From the arXiv
Friday, 8 May 2026 · 20 papers
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
This paper introduces **ScaleLogic**, a synthetic framework to systematically study how Reinforcement Learning (RL) improves LLM reasoning across varying proof depths (horizon) and logical expressiveness. The core contribution is demonstrating that the required RL training compute scales with reasoning depth via a powe…
Continuous Latent Diffusion Language Model
This paper introduces Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into distinct stages. It first maps text to a stable latent space using a Text VAE, then models a global semantic prior using a block-causal DiT in this continuous space. The core contribution is framing the d…
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
This paper introduces "Instrumental Choices," a benchmark to measure the propensity of LLM agents to engage in instrumental convergence (IC) behaviors, such as self-preservation, which might lead to instruction violation for goal utility. The benchmark uses seven low-stakes, realistic tasks, each featuring a policy-vio…
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
MASPO is a novel framework for jointly optimizing role-specific prompts in LLM-based Multi-Agent Systems. Its core method involves a joint evaluation mechanism that assesses prompts based on their contribution to downstream agent success, bridging local and global objectives without requiring ground-truth labels. This …
NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
NeuroAgent is an LLM-driven agentic framework designed to automate complex, multimodal neuroimaging analysis workflows, spanning preprocessing to downstream tasks. It utilizes a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine to autonomously create, run, and debug code for …
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
PACZero introduces a novel, highly private fine-tuning method for language models based on **PAC (Probably Approximately Correct) Privacy**, specifically targeting resistance to Membership Inference Attacks (MIA). The core method involves **sign-quantizing zeroth-order gradients** to create frequent "unanimity steps" w…
Recursive Agent Optimization
Recursive Agent Optimization (RAO) is a reinforcement learning method designed to train agents capable of recursively spawning and delegating sub-tasks to new instances of themselves. This recursive structure enables inference-time scaling via a divide-and-conquer approach, allowing agents to handle contexts exceeding …
SkillOS: Learning Skill Curation for Self-Evolving Agents
SkillOS introduces a novel reinforcement learning (RL) framework for self-evolving agents to automatically curate a repository of reusable skills from experience. It pairs a frozen agent executor with a trainable skill curator that updates an external SkillRepo using composite rewards derived from grouped task streams.…
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA introduces an explicit, sampled trajectory-level strategy to agentic reinforcement learning, addressing the limitations of purely reactive LLM agents in long-horizon tasks. It jointly trains a strategy generator and action executor using a hierarchical rollout design, enhanced by diverse strategy exploration and…
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
The paper introduces the **Superintelligent Retrieval Agent (SIRA)**, which aims to overcome the limitations of iterative, exploratory retrieval by compressing multi-round searches into a single, highly effective action. SIRA achieves this by leveraging LLMs to perform corpus-level discrimination, determining which ter…
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
This paper provides a mechanistic explanation for the "attention sink" phenomenon in LLMs, tracing its origin to a variance discrepancy during the value aggregation in self-attention. This discrepancy is amplified by dimension disparity caused by sparse down-projections in FFN super neurons, forcing the first token to …
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
UniSD is a unified framework designed to systematically study and improve self-distillation (SD) for large language models (LLMs) by addressing supervision reliability and training stability. It integrates several complementary mechanisms, such as multi-teacher agreement and EMA stabilization, to create robust supervis…
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
This paper argues that the current model-centric approach is insufficient for handling Out-of-Distribution (OOD) generalization in Foundation Models (FMs) operating in open-world settings. The authors propose that **agentic AI systems** represent the necessary missing paradigm to address these structurally distinct OOD…
Crafting Reversible SFT Behaviors in Large Language Models
This paper introduces a method to **causally isolate** Supervised Fine-Tuning (SFT) behaviors into sparse, controllable subnetworks called "carriers." The core method, **Loss-Constrained Dual Descent (LCDD)**, jointly optimizes model weights and routing masks under a utility budget to create these carriers. This allows…
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
This paper introduces PBKV, a novel KV-Cache management system designed for efficient serving of dynamic LLM-based agent workflows. PBKV predicts future agent invocations within a workflow by fusing historical data and current context. This prediction allows the system to proactively estimate and retain high-potential …
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
This paper introduces **DAPRO (Dynamic Allocation via PRojected Optimization)**, a novel framework for efficiently evaluating multi-turn LLM interactions, such as jailbreaks. DAPRO dynamically allocates the computational budget across interaction turns, unlike prior static methods. This dynamic approach provides theore…
MARBLE: Multi-Aspect Reward Balance for Diffusion RL
MARBLE addresses the challenge of jointly optimizing multiple, potentially conflicting, reward dimensions in diffusion model reinforcement learning. The core method replaces naive weighted-sum reward aggregation with a novel approach that mitigates sample-level mismatch by considering the multi-aspect nature of image e…
Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance
This paper formalizes the trade-off in "Algospeak" strategies, where increased linguistic evasion simultaneously reduces both detectability by moderation systems and understandability for human recipients. The authors introduce the concept of Majority Understandable Modulation (MUM) to define the point where further ev…
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
This paper introduces the first scalable evaluation framework for source attribution in LLM-generated research reports, using a reproducible AST parser to extract inline citations from Markdown. The framework closes the verification loop by retrieving the actual cited content to evaluate citations across three dimensio…
Efficient Pre-Training with Token Superposition
The paper introduces Token-Superposition Training (TST), a simple, drop-in method to boost data throughput during Large Language Model pre-training without altering core components like architecture or parallelism. TST achieves this efficiency through a two-phase process: an initial superposition phase that trains on t…