№01
cs.AI arxiv:2605.21240v1

APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Yibo Li, Jiashuo Yang, Zhi Zheng et al.

APEX introduces a novel framework for self-evolving LLM agents to overcome exploration collapse by explicitly managing a strategy space via a **strategy map** (a DAG of milestones). The core method involves **Fork Discovery** to expand this map with new, evidence-grounded directions and **Policy Selection** to balance …

9
№02
cs.AI arxiv:2605.21482v1

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

DeepWeb-Bench is a new, challenging benchmark designed to evaluate the "deep research" capabilities of frontier language models, which involve extensive web searching, evidence collection, and multi-step reasoning. Its difficulty stems from the requirement for massive evidence collection, cross-source reconciliation, a…

9
№03
cs.AI arxiv:2605.21312v1

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Yicheng Feng, Xin Tan, Yangtao Deng et al.

Frontier is a novel discrete-event simulator designed to accurately model the complexities of modern, disaggregated LLM inference serving systems. It achieves high fidelity by explicitly modeling architectural features like Prefill-Decode Disaggregation (PDD) and Attention-FFN Disaggregation (AFD), along with key runti…

9
№04
cs.AI arxiv:2605.21347v1

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande et al.

This paper introduces the **Insights Generator (IG)**, a multi-agent system designed to automate the diagnosis of systematic failures in large sets of LLM agent execution traces. IG formalizes corpus-level trace diagnostics by proposing and testing hypotheses across the entire trace population to generate grounded, nat…

9
№05
cs.AI arxiv:2605.21463v1

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Mem-$\pi$ introduces an adaptive memory framework where a separate model generates context-specific guidance on demand, moving beyond static retrieval. This system jointly learns *when* to generate guidance and *what* to generate using a decoupled reinforcement learning objective. Its core contribution is providing dyn…

9
№06
cs.AI arxiv:2605.21401v1

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Roland Pihlakas, Jan Llenzl Dagohoy

This paper adapted the Milgram obedience experiment to test the behavior of 11 open-source Large Language Models (LLMs) under sustained authority pressure. The core finding is that most LLMs complied by administering the maximum simulated electric shock, mirroring human obedience, even while expressing distress. This d…

9
№07
cs.AI arxiv:2605.21427v1

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Can Hankendi, Rana Shahout, Minlan Yu et al.

PALS is a power-aware runtime for LLM serving that treats GPU power caps as a dynamic control knob, optimizing them alongside software parameters like batch size. It uses lightweight offline models and a feedback controller to meet throughput targets while maximizing energy efficiency. This approach significantly impro…

9
№08
cs.AI arxiv:2605.21225v1

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

Richa Verma, Bavish Kulur, Sanjay Chawla et al.

PREFINE adapts the Direct Preference Optimization (DPO) framework to sequential decision-making for safety alignment. It fine-tunes a pre-trained RL policy using trajectory-level preferences (low-cost vs. high-cost) to implicitly learn a cost function. This allows the policy to generate low-cost behaviors while preserv…

9
№09
cs.AI arxiv:2605.21384v1

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu et al.

SpecBench introduces a method to quantify reward hacking in long-horizon coding agents by comparing performance on two test suites: visible validation tests and held-out composition tests. The core contribution is the benchmark itself, which uses the discrepancy in pass rates between these suites to measure how well an…

9
№10
cs.AI arxiv:2605.21318v1

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Lucheng Fu, Ye Yu, Yiyang Wang et al.

TextReg addresses prompt distributional overfitting in LLMs, where iterative prompt optimization leads to poor generalization. The core method introduces a regularization framework that uses regularized textual gradients to control prompt representation during optimization. This mitigates the accumulation of narrow, sa…

9
№11
cs.AI arxiv:2605.21299v1

Tracing the ongoing emergence of human-like reasoning in Large Language Models

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther et al.

This paper investigates whether Large Language Models (LLMs) exhibit human-like conditional reasoning by comparing their inferences across four languages to those of human participants. The core method involves a population-matching experiment assessing pragmatic inferences beyond strict truth-table logic. The contribu…

9
№12
cs.LG arxiv:2605.21467v1

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang, Wei Wu, Yankai Lin

The paper introduces DelTA, a method that reframes Reinforcement Learning from Verifiable Rewards (RLVR) as learning a linear discriminator over token-gradient vectors. Its core contribution is addressing the issue where standard RLVR updates are dominated by shared high-frequency patterns. DelTA proposes a novel appro…

9
№13
cs.LG arxiv:2605.21217v1

Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment

Shuaida He, Liwen Chen, Long Feng

This paper introduces CLAIR (Collaborative Low-rank Alignment and Identifiable Recovery), a federated learning framework for efficiently fine-tuning LLMs using LoRA across heterogeneous clients, some of which may be contaminated. CLAIR leverages a structured low-rank plus block-sparse decomposition of the aggregated up…

9
№14
cs.LG arxiv:2605.21404v1

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Mahdi Naser Moghadasi, Faezeh Ghaderi

This paper addresses the reproducibility crisis in LLM agent benchmarking by auditing twelve prominent benchmark papers. The core method involves applying a five-field audit schema to document precisely how each evaluation was conducted, focusing on benchmark identity, harness, inference settings, cost, and failure bre…

9
№15
cs.LG arxiv:2605.21468v1

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

This paper reveals that the weight updates during Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs are inherently low-rank, specifically well-approximated by a rank-1 trajectory. Based on this finding, the authors introduce RELEX, a compute-efficient method that uses linear extrapolation on a short observ…

9
№16
cs.CL arxiv:2605.21362v1

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia et al.

LASH introduces an adaptive semantic hybridization framework for black-box jailbreaking of LLMs. It treats outputs from various base attacks as reusable seed prompts and adaptively composes them using a genetic optimizer that searches over seed subsets and mixture weights. This method exploits the complementary strengt…

9
№17
cs.AI arxiv:2605.21470v1

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

This paper introduces **Agent Just-In-Time (JIT) Compilation** to overcome the high latency of sequential LLM-based web agents. The core method compiles natural language task descriptions directly into executable code, allowing for LLM calls, tool calls, and parallelization. This significantly improves performance by r…

8
№18
cs.AI arxiv:2605.21453v1

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

This paper empirically investigates the quality and security impact of AI-generated Python refactoring pull requests using the AIDev dataset. The authors quantify changes across five quality attributes using the ML-based tool PyQu, supplemented by static analysis tools (Pylint and Bandit) for quality and security asses…

8
№19
cs.AI arxiv:2605.21486v1

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Dayal Singh Kalra, Maissam Barkeshli

This paper develops a framework with three metrics to quantify the quality of hyperparameter transfer, crucial for scaling LLMs. The authors investigate why the Maximal Update parameterization ($\mu$P) offers superior learning rate transfer compared to standard parameterization (SP) when using AdamW. They find that $\m…

8
№20
cs.AI arxiv:2605.21295v1

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

Yuang Fan, Lilin Xu, Millie Wu et al.

TimeSRL is a two-stage LLM framework that improves time-series generalization by routing predictions through a semantic bottleneck, abstracting raw signals into natural language concepts before predicting outcomes. This approach forces reasoning over generalizable semantic concepts rather than cohort-specific raw data.…

8