№01
cs.AI arxiv:2605.06638v1

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Tianle Wang, Zhaoyang Wang, Guangchen Lan et al.

This paper introduces **ScaleLogic**, a synthetic framework to systematically study how Reinforcement Learning (RL) improves LLM reasoning across varying proof depths (horizon) and logical expressiveness. The core contribution is demonstrating that the required RL training compute scales with reasoning depth via a powe…

9
№02
cs.AI arxiv:2605.06548v1

Continuous Latent Diffusion Language Model

Hongcan Guo, Qinyu Zhao, Yian Zhao et al.

This paper introduces Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into distinct stages. It first maps text to a stable latent space using a Text VAE, then models a global semantic prior using a block-causal DiT in this continuous space. The core contribution is framing the d…

9
№03
cs.AI arxiv:2605.06490v1

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Jonas Wiedermann-Möller, Leonard Dung, Maksym Andriushchenko

This paper introduces "Instrumental Choices," a benchmark to measure the propensity of LLM agents to engage in instrumental convergence (IC) behaviors, such as self-preservation, which might lead to instruction violation for goal utility. The benchmark uses seven low-stakes, realistic tasks, each featuring a policy-vio…

9
№04
cs.AI arxiv:2605.06623v1

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Zhexuan Wang, Xuebo Liu, Li Wang et al.

MASPO is a novel framework for jointly optimizing role-specific prompts in LLM-based Multi-Agent Systems. Its core method involves a joint evaluation mechanism that assesses prompts based on their contribution to downstream agent success, bridging local and global objectives without requiring ground-truth labels. This …

9
№05
cs.AI arxiv:2605.06584v1

NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

Lujia Zhong, Yihao Xia, Jianwei Zhang et al.

NeuroAgent is an LLM-driven agentic framework designed to automate complex, multimodal neuroimaging analysis workflows, spanning preprocessing to downstream tasks. It utilizes a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine to autonomously create, run, and debug code for …

9
№06
cs.AI arxiv:2605.06505v1

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen et al.

PACZero introduces a novel, highly private fine-tuning method for language models based on **PAC (Probably Approximately Correct) Privacy**, specifically targeting resistance to Membership Inference Attacks (MIA). The core method involves **sign-quantizing zeroth-order gradients** to create frequent "unanimity steps" w…

9
№07
cs.AI arxiv:2605.06639v1

Recursive Agent Optimization

Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang et al.

Recursive Agent Optimization (RAO) is a reinforcement learning method designed to train agents capable of recursively spawning and delegating sub-tasks to new instances of themselves. This recursive structure enables inference-time scaling via a divide-and-conquer approach, allowing agents to handle contexts exceeding …

9
№08
cs.AI arxiv:2605.06614v1

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen et al.

SkillOS introduces a novel reinforcement learning (RL) framework for self-evolving agents to automatically curate a repository of reusable skills from experience. It pairs a frozen agent executor with a trainable skill curator that updates an external SkillRepo using composite rewards derived from grouped task streams.…

9
№09
cs.AI arxiv:2605.06642v1

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Xiangyuan Xue, Yifan Zhou, Zidong Wang et al.

StraTA introduces an explicit, sampled trajectory-level strategy to agentic reinforcement learning, addressing the limitations of purely reactive LLM agents in long-horizon tasks. It jointly trains a strategy generator and action executor using a hierarchical rollout design, enhanced by diverse strategy exploration and…

9
№10
cs.AI arxiv:2605.06647v1

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Zeyu Yang, Qi Ma, Jason Chen et al.

The paper introduces the **Superintelligent Retrieval Agent (SIRA)**, which aims to overcome the limitations of iterative, exploratory retrieval by compressing multi-round searches into a single, highly effective action. SIRA achieves this by leveraging LLMs to perform corpus-level discrimination, determining which ter…

9
№11
cs.AI arxiv:2605.06611v1

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li, Kaiqi Jiang, Jiacheng Sun et al.

This paper provides a mechanistic explanation for the "attention sink" phenomenon in LLMs, tracing its origin to a variance discrepancy during the value aggregation in self-attention. This discrepancy is amplified by dimension disparity caused by sparse down-projections in FFN super neurons, forcing the first token to …

9
№12
cs.AI arxiv:2605.06597v1

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin, Yiyang Wang, Lucheng Fu et al.

UniSD is a unified framework designed to systematically study and improve self-distillation (SD) for large language models (LLMs) by addressing supervision reliability and training stability. It integrates several complementary mechanisms, such as multi-teacher agreement and EMA stabilization, to create robust supervis…

9
№13
cs.LG arxiv:2605.06522v1

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Xin Wang, Haibo Chen, Wenxuan Liu et al.

This paper argues that the current model-centric approach is insufficient for handling Out-of-Distribution (OOD) generalization in Foundation Models (FMs) operating in open-world settings. The authors propose that **agentic AI systems** represent the necessary missing paradigm to address these structurally distinct OOD…

9
№14
cs.LG arxiv:2605.06632v1

Crafting Reversible SFT Behaviors in Large Language Models

Yuping Lin, Pengfei He, Yue Xing et al.

This paper introduces a method to **causally isolate** Supervised Fine-Tuning (SFT) behaviors into sparse, controllable subnetworks called "carriers." The core method, **Loss-Constrained Dual Descent (LCDD)**, jointly optimizes model weights and routing masks under a utility budget to create these carriers. This allows…

9
№15
cs.LG arxiv:2605.06472v1

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng, Fangcheng Fu, Jia Wu et al.

This paper introduces PBKV, a novel KV-Cache management system designed for efficient serving of dynamic LLM-based agent workflows. PBKV predicts future agent invocations within a workflow by fusing historical data and current context. This prediction allows the system to proactively estimate and retain high-potential …

9
№16
cs.LG arxiv:2605.06605v1

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Shai Feldman, Yaniv Romano

This paper introduces **DAPRO (Dynamic Allocation via PRojected Optimization)**, a novel framework for efficiently evaluating multi-turn LLM interactions, such as jailbreaks. DAPRO dynamically allocates the computational budget across interaction turns, unlike prior static methods. This dynamic approach provides theore…

9
№17
cs.LG arxiv:2605.06507v1

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Canyu Zhao, Hao Chen, Yunze Tong et al.

MARBLE addresses the challenge of jointly optimizing multiple, potentially conflicting, reward dimensions in diffusion model reinforcement learning. The core method replaces naive weighted-sum reward aggregation with a novel approach that mitigates sample-level mismatch by considering the multi-aspect nature of image e…

9
№18
cs.CL arxiv:2605.06619v1

Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance

Jan Fillies, Ronald E. Robertson, Jeffrey Hancock

This paper formalizes the trade-off in "Algospeak" strategies, where increased linguistic evasion simultaneously reduces both detectability by moderation systems and understandability for human recipients. The authors introduce the concept of Majority Understandable Modulation (MUM) to define the point where further ev…

9
№19
cs.CL arxiv:2605.06635v1

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Hailey Onweller, Elias Lumer, Austin Huber et al.

This paper introduces the first scalable evaluation framework for source attribution in LLM-generated research reports, using a reproducible AST parser to extract inline citations from Markdown. The framework closes the verification loop by retrieving the actual cited content to evaluate citations across three dimensio…

9
№20
cs.CL arxiv:2605.06546v1

Efficient Pre-Training with Token Superposition

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

The paper introduces Token-Superposition Training (TST), a simple, drop-in method to boost data throughput during Large Language Model pre-training without altering core components like architecture or parallelism. TST achieves this efficiency through a two-phase process: an initial superposition phase that trains on t…

9