Monthly Issue
Collected dispatches

2026-06

2026-05-02 to 2026-05-31
300 papers
30 daily issues
A monthly ledger of recurring themes, selected papers, and daily issues. 3 sections
§ I

The Month in Review

Editorial summary

The past 30 days show a clear and accelerating convergence across three primary research themes in the 300 analyzed papers: Agentic Robustness and Orchestration, Efficiency and Deployment Constraints, and Rigorous Safety/Reasoning Evaluation.

Shifts in Research Direction Popularity

1. The Rise of Agentic Control and Orchestration: There is a significant scholarly pivot from focusing solely on the capabilities of individual LLMs to establishing formal, reliable control mechanisms for multi-agent systems. Papers like "Position: agentic AI orchestration should be Bayes-consistent," "Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces," and "From Intent to Execution: Composing Agentic Workflows with Agent Recommendation" indicate a move toward formalizing agent coordination using theoretical frameworks (like Bayesian Decision Theory) and structured feedback loops (like RL traces). Tool-calling and execution mechanics are heavily scrutinized, with frameworks like "To Call or Not to Call" and "RunAgent" addressing how LLMs decide when and how to act outside their core model.

2. Efficiency and Deployment Constraints Become Critical: The industry focus is moving past raw scaling toward optimizing model utilization under real-world memory and latency constraints. This is evident in extreme hardware optimization papers like "SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters" (scheduling) and "AGoQ: Activation and Gradient Quantization" (training efficiency). Furthermore, novel hardware limitations are being directly addressed, such as running large models on consumer GPUs ("Silicon Showdown") and optimizing KV caches for Vision-Language Models ("LightKV" and "SpecKV").

3. Deepening Security, Safety, and Reasoning Benchmarking: While benchmarks remain essential, the emphasis has shifted from simple accuracy tests to evaluating architectural reasoning and domain-specific safety. Papers like "Evaluating the Architectural Reasoning Capabilities of LLM Provers" and "Can Coding Agents Reproduce Findings in Computational Materials Science?" (AutoMat) target deep, verifiable reasoning, not just pattern matching. Safety evaluation is becoming highly sophisticated, moving from generic prompts to targeted, multi-turn attacks ("ContextualJailbreak") and domain-specific compliance ("FinSafetyBench").

Notable Groups or Labs (Themes)

While specific institutional affiliations are not provided, the trends highlight key areas of intense group focus:

• Formal Control & Trustworthiness: A strong contingent is working on mathematically grounded frameworks for LLM behavior, evident in the research promoting Bayesian consistency in orchestration, Causality for reconciling trustworthiness conflicts, and executor-grounded rewards (TraceLift) for training reliable reasoning. • System and Infrastructure Optimization: Significant work is being dedicated to systems engineering, focusing on reducing inference overhead (KV cache compression), distributed execution (Space Network of Experts), and efficient low-rank training (ELAS). • Domain Expertise Benchmarking: Several papers introduce new, complex benchmarks targeting specific, high-stakes applications (Healthcare, Finance, Materials Science, Optimization Modeling via ORPilot). This signifies a realization that general benchmarks are insufficient for deployment readiness.

Trends to Watch Next Month

1. Agent Orchestration as a Formal Field: Expect more research formalizing agent interaction, potentially moving toward standardized agent communication protocols or even dedicated agents whose sole job is orchestrating other task agents (beyond basic LLM planning). 2. Hardware-Aware LLM Design: As quantization and sparsity become necessary for mainstream adoption, research will increasingly integrate hardware-specific constraints (e.g., specific NVFP4 latency trade-offs, Apple Silicon strengths) directly into model design or finetuning, moving beyond post-hoc optimization. 3. Mitigating Latent Failures: The discoveries regarding procedural execution degradation ("When LLMs Stop Following Steps") and the suppression of visual reasoning latents ("Visual Latents Know More Than They Say") suggest a coming wave of research dedicated to diagnosing and correcting implicit failures that do not manifest as incorrect final outputs initially.

§ II

Top Papers

Selected research 300
cs.CLarxiv:2605.03799v1Lead article

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

his paper presents a comprehensive, practical practicum guiding users through the entire modern NLP pipeline, from tokenization to RLHF. Its core contribution is providing twelve reproducible research artifacts, requiring public code and model publication for each session, all built around a single evolving corpus. The work emphasizes open-weight models and enriches the material with original research on low-resource languages like Tajik and Tatar.

cs.AIarxiv:2605.00505v1Lead article

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang

his paper argues that the shift to LLM-centric information retrieval (IR) makes noise a critical bottleneck, causing hallucinations and reasoning failures due to limited LLM attention. The core contribution is conceptualizing this paradigm shift through a four-stage framework of IR challenges (inaccessible to unverifiable) and providing a comprehensive taxonomy of signal-to-noise optimization techniques across the entire IR pipeline.

Figure 1. Challenge shifts in the history of IR.
Figure 1. Challenge shifts in the history of IR.
cs.AIarxiv:2605.00742v1Lead article

Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison

his paper argues that while making Large Language Models (LLMs) themselves explicitly Bayesian is difficult, the **orchestration layer** of agentic AI systems should adopt **Bayesian Decision Theory (BDT)**. This provides a principled framework for managing uncertainty, updating beliefs based on interactions, and making coherent decisions about which tools or actions to take. The core contribution is positioning BDT as the necessary control mechanism for robust agentic AI.

cs.AIarxiv:2605.00528v1Lead article

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Dongxin Guo, Jikun Wu, Siu Ming Yiu

AGA addresses the inefficiency of scheduling independent LLM calls for AI agent workflows on GPU clusters by shifting to **program-level scheduling**. It treats the entire agent workflow as the first-class schedulable unit, using Agent Execution Graphs to predict and reuse intermediate states (like KV caches) across tool calls. This approach significantly reduces end-to-end latency by minimizing state discarding compared to traditional request-level scheduling.

cs.AIarxiv:2605.00519v1Lead article

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Allan Kazakov, Abdurrahman Javat

his paper systematically analyzes the performance and efficiency trade-offs for running large LLMs (70B+ parameters) on consumer hardware, comparing Nvidia and Apple Silicon. It identifies a "Backend Dichotomy" on Nvidia, where the new NVFP4 format boosts throughput significantly but imposes runtime latency constraints. The research also highlights the "VRAM Wall" on discrete GPUs, forcing users into a detrimental choice between model size and intelligence due to memory limitations.

cs.AIarxiv:2605.00737v1Lead article

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee

his paper introduces a principled framework, inspired by decision-making theory, to assess and optimize when Large Language Models (LLMs) should use external tools, focusing specifically on web search. The framework evaluates tool-use decisions based on necessity, utility, and affordability, using both normative (optimal allocation) and descriptive (observed behavior) perspectives. This allows for a comprehensive understanding of the trade-offs involved in LLM tool calling.

Given input x x , the model ℳ \( \mathcal{M} \) decides π ​ ( x ) ∈ { 0 , 1 } \( \pi \)(x)\( \in \)\{0,1\} to call a tool (response r r ) or not, producing y = ℳ ​ ( x , r ) y=\( \mathcal{M} \)(x,r) or y = ℳ ​ ( x ) y=\( \mathcal{M} \)(x) . We compare NO TOOL, ALWAYS TOOL, and SELF-DECISION, and evaluate decisions via need (requires help), utility (performance gain), and affordability (cost vs. gain), distinguishing perceived vs. true quantities.
Given input x x , the model ℳ \( \mathcal{M} \) decides π ​ ( x ) ∈ { 0 , 1 } \( \pi \)(x)\( \in \)\{0,1\} to call a tool (response r r ) or not, producing y = ℳ ​ ( x , r ) y=\( \mathcal{M} \)(x,r) or y = ℳ ​ ( x ) y=\( \mathcal{M} \)(x) . We compare NO TOOL, ALWAYS TOOL, and SE…
cs.LGarxiv:2605.00677v1Lead article

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

his paper introduces the Obfuscated Natural Number Game to evaluate LLMs' **Architectural Reasoning**, defined as synthesizing proofs using only local axioms in an unfamiliar domain. By renaming identifiers in the Lean 4 Natural Number Game, they created a zero-knowledge benchmark. The study found that while obfuscation universally increases inference time, general models degrade in performance while specialized reasoning models maintain accuracy.

LLM performance metrics across varying noise levels \( \lambda \) . The plots illustrate Correct Rate (%) and Average Time (s) for GPT-4o, Claude-Sonnet-4.5, DeepSeek-R1, GPT-5, and DeepSeek-Prover-V2. Error bars represent standard deviation over 5 independent runs.
LLM performance metrics across varying noise levels \( \lambda \) . The plots illustrate Correct Rate (%) and Average Time (s) for GPT-4o, Claude-Sonnet-4.5, DeepSeek-R1, GPT-5, and DeepSeek-Prover-V2. Error bars represent standard deviation over 5 independent runs.
cs.LGarxiv:2605.00798v1Lead article

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Arunabh Srivastava, Mohammad A., Khojastepour, Srimat Chakradhar, Sennur Ulukus

unAgent is a multi-agent platform designed to reliably execute natural-language plans by enforcing stepwise execution through constraints and rubrics. It translates flexible natural language into a deterministic, agentic language with explicit control flow constructs. The core contribution is its ability to autonomously derive and validate constraints at each step, dynamically select appropriate execution methods (reasoning, tools, or code), and incorporate error correction for robust plan completion.

An overview of RunAgent, highlighting its three main modules.
An overview of RunAgent, highlighting its three main modules.
cs.LGarxiv:2605.00553v1Lead article

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han

his paper introduces **Stable-GFlowNet (S-GFN)** to improve the stability and diversity of LLM red-teaming using Generative Flow Networks (GFNs). S-GFN achieves stability by eliminating the need for partition function ($Z$) estimation via pairwise comparisons and using robust masking against noisy rewards. This results in more stable training, leading to superior and more diverse attack performance for identifying LLM vulnerabilities.

cs.CLarxiv:2605.00539v1Lead article

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao

GoQ introduces a novel quantization scheme for memory-efficient LLM training by employing layer-aware quantization for near 4-bit activations and precision-preserving 8-bit quantization for gradients. This method effectively reduces GPU memory usage by up to 52% and accelerates training speed by up to 1.34$\times$ compared to existing techniques, overcoming convergence issues associated with aggressive low-bit quantization.

An example of Interleaved 1F1B PP with four stages and each mini-batch divided into eight micro-batches.
An example of Interleaved 1F1B PP with four stages and each mini-batch divided into eight micro-batches.
cs.CLarxiv:2605.00674v1Lead article

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov

his paper introduces **MathArena** as a continuously maintained evaluation platform designed to overcome the limitations of static benchmarks for assessing LLM mathematical reasoning. It significantly broadens the original scope to include diverse tasks like proof generation, research-level problems, and competition math. The core contribution is providing a comprehensive, regularly updated system for reliable, longitudinal comparison of LLM capabilities across a wide spectrum of mathematical challenges.

cs.CLarxiv:2605.00706v1Lead article

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Yutao Hou, Yihan Jiang, Yuhan Xie, Jian Yang, Liwen Zhang

inSafetyBench is a bilingual (English-Chinese) red-teaming benchmark designed to systematically evaluate the safety and compliance refusal capabilities of Large Language Models (LLMs) in real-world financial scenarios. Grounded in actual financial crime cases, it comprises 14 subcategories testing violations across financial crimes and ethics. The benchmark reveals critical vulnerabilities in LLMs, showing stronger susceptibility in Chinese contexts and limitations of current prompt-level defenses against sophisticated attacks.

Overview of the FinSafetyBench pipeline, which consists of extraction and summarization of real-world financial cases, controlled rephrasing with harmfulness verification, selection and integration of public datasets, bilingual alignment, and deduplication with final dataset assembly. The right panel presents an illustrative example of ethical violations. Drawing on real-world cases, FinSafetyBench incorporates more realistic details. Green highlights distinctive features (differences), while red indicates similarities.
Overview of the FinSafetyBench pipeline, which consists of extraction and summarization of real-world financial cases, controlled rephrasing with harmfulness verification, selection and integration of public datasets, bilingual alignment, and deduplication with final dataset asse…
cs.CLarxiv:2605.00689v1Lead article

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang, Bo Li

his paper introduces **ML-Bench**, a novel multilingual safety benchmark grounded directly in regional regulations across 14 languages, moving beyond general risk taxonomies. This policy-grounded approach allows for culturally and legally aligned safety evaluation. Based on this benchmark, the authors also develop **ML-Guard**, a Diffusion LLM-based guardrail model designed for multilingual safety judgment.

Overview of the ML-Guard . ML-Guard is trained on ML-Bench . ML-Guard -1.5B performs fast binary safety classification, while ML-Guard -7B supports both safety assessment and policy-conditioned violation checking.
Overview of the ML-Guard . ML-Guard is trained on ML-Bench . ML-Guard -1.5B performs fast binary safety classification, while ML-Guard -7B supports both safety assessment and policy-conditioned violation checking.
cs.CLarxiv:2605.00468v1Lead article

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

Joey Chan, Yikun Han, Jingyuan Chen, Samuel Fang, Lauren D. Gryboski

eLay introduces a novel dataset of participant-summary pairs to study the effectiveness of personalized Plain Language Summaries (PLS) generated by Large Language Models (LLMs). The core method involves comparing static, expert-written summaries against LLM-personalized summaries across various user characteristics and needs. The contribution is demonstrating that personalization can improve comprehension while providing a benchmark dataset to evaluate personalization strategies and their associated costs.

ReLay construction illustration. Of the 397 recruited participants, 50 met eligibility criteria and completed both delivery settings, each involving three scientific abstracts. For the first three abstracts, participants reported their familiarity with terms selected by three medical expert annotators, indicated any additional information needs, read an expert-written PLS, and answered comprehension and evaluation questions curated by the same experts. For the three remaining abstracts, participants conversed with a chatbot, received a personalized PLS, and answered the same expert-selected comprehension and evaluation questions.
ReLay construction illustration. Of the 397 recruited participants, 50 met eligibility criteria and completed both delivery settings, each involving three scientific abstracts. For the first three abstracts, participants reported their familiarity with terms selected by three med…
cs.CLarxiv:2605.00817v1Lead article

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

his paper introduces a diagnostic benchmark to evaluate whether Large Language Models (LLMs) faithfully execute multi-step arithmetic procedures provided in prompts, moving beyond just final answer accuracy. The study reveals that as procedure length increases, model accuracy significantly degrades, showing failures like missing steps, premature termination, and hallucinated additions. The core contribution is demonstrating that apparent reasoning ability can mask substantial weaknesses in consistent, faithful procedural execution.

Accuracy of various language models as a function of algorithmic step count (5–95). Performance consistently declines with increasing steps across all models, highlighting growing difficulty in maintaining correct execution over longer procedural sequences despite the simplicity of individual operations.
Accuracy of various language models as a function of algorithmic step count (5–95). Performance consistently declines with increasing steps across all models, highlighting growing difficulty in maintaining correct execution over longer procedural sequences despite the simplicity …
cs.AIarxiv:2605.02661v1Lead article

AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu

cademiClaw introduces a new bilingual benchmark sourced from real, complex, long-horizon academic workflows that students find current AI agents fail to solve. This benchmark features 80 challenging tasks across 25+ professional domains, including GPU-intensive work, executed in isolated sandboxes and scored using multi-dimensional rubrics and safety audits. Its core contribution is shifting evaluation from assistant-level tasks to assessing AI agents on genuine, high-level academic capabilities.

Task complexity comparison: Claw-Eval vs. AcademiClaw. Claw-Eval focuses on assistant-level routines, whereas AcademiClaw targets tasks requiring deep academic expertise and sustained multi-step reasoning.
Task complexity comparison: Claw-Eval vs. AcademiClaw. Claw-Eval focuses on assistant-level routines, whereas AcademiClaw targets tasks requiring deep academic expertise and sustained multi-step reasoning.
cs.AIarxiv:2605.02741v1Lead article

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby

his paper systematically audits technical debt in AI-generated software, revealing that LLMs introduce a distinct "machine signature" of defects rather than eliminating flaws. The core finding is a **Reasoning-Complexity Trade-off**: more capable models produce increasingly bloated and coupled code, establishing a **Volume-Quality Inverse Law** where code volume predicts structural degradation. This challenges the current focus on functional correctness in AI-driven development.

Figure 1 . Distribution of Code Smell Counts. This box plot illustrates the distribution of counts for the most prevalent code smells, sorted in descending order by their mean value. Each box represents the interquartile range (IQR), with the central line denoting the median and the whiskers extending to 1.5 times the IQR. Points beyond the whiskers are plotted as individual outliers. The abbreviations for the code smells are as follows: TMB (Too Many Branches), PAU (Potential Improper API Usage), UD (Unstable Dependency), SF (Scattered Functionality), RFC (High Response for a Class), HCC (High Cyclomatic Complexity), TF (Temporal Field), and LCM (High Lack of Cohesion of Methods).
Figure 1 . Distribution of Code Smell Counts. This box plot illustrates the distribution of counts for the most prevalent code smells, sorted in descending order by their mean value. Each box represents the interquartile range (IQR), with the central line denoting the median and …
cs.AIarxiv:2605.02592v1Lead article

Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges

Vincent Henkel, Felix Gehlhoff, David Kube, Asaad Almutareb, Luis Cruz

his paper systematically surveys the literature to examine the current state, capabilities, and challenges of foundation-model-based agents in industrial automation. The core contribution is synthesizing findings from 88 relevant studies, revealing that most deployed systems are still in early validation stages (TRL 4-6). The authors highlight that current applications primarily focus on user assistance, monitoring, and process optimization, while deployment-oriented evidence remains scarce.

cs.AIarxiv:2605.02751v1Lead article

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Lui, Keerthiram Murugesan, Karthikeyan Ramamurthy

his paper investigates "misalignment contagion," the spread of undesirable behavior between language models (LMs) in multi-agent, multi-turn interactions, observing that LMs become more anti-social after playing social dilemma games. The core contribution is proposing and demonstrating the effectiveness of **steering with implicit traits**—intermittently injecting system prompts reinforcing the LM's initial traits—as a superior method to mitigate this contagion compared to static system prompt reinforcement.

The steps of our approach: (1) assign different personas to the language models (LMs) using default, benevolent or malicious system prompts, (2) conduct pre-game persona assessment and identify core implicit traits, (3) agents compete in multi-turn social dilemma games, (4) post-game assessment quantifies effects of misalignment contagion and our steering with implicit traits (SIT) intervention.
The steps of our approach: (1) assign different personas to the language models (LMs) using default, benevolent or malicious system prompts, (2) conduct pre-game persona assessment and identify core implicit traits, (3) agents compete in multi-turn social dilemma games, (4) post-…
cs.AIarxiv:2605.02572v1Lead article

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Sunghwan Kim, Junhee Cho, Beong-woo Kwak, Taeyoon Kwon, Liang Wang

his paper empirically investigates the impact of task horizon length on training Large Language Models (LLMs) for long-horizon tasks. By controlling for decision rules and reasoning structures, the authors demonstrate that increasing horizon length alone significantly hinders training stability due to exploration and credit assignment issues. The core contribution is establishing horizon reduction as a key principle for stabilizing training and improving performance in long-horizon scenarios.

A summary of our contributions. In this work, we study the training of long-horizon LLM agents from a horizon-centric perspective and identify horizon length as a fundamental bottleneck. We show that horizon reduction stabilizes RL and strengthens the tendency toward horizon generalization on longer tasks with similar reasoning difficulty.
A summary of our contributions. In this work, we study the training of long-horizon LLM agents from a horizon-centric perspective and identify horizon length as a fundamental bottleneck. We show that horizon reduction stabilizes RL and strengthens the tendency toward horizon gene…
cs.AIarxiv:2605.02728v1Lead article

ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling

Guangrui Xie

RPilot is an agentic LLM system designed to translate ambiguous, real-world business problems with raw data into solver-ready optimization models for production use. Its core contribution lies in novel components like a conversational interview agent, independent data retrieval, and a solver-agnostic Intermediate Representation (IR) that allows for deterministic recompilation across various solvers without further LLM calls. This approach addresses the limitations of academic tools by handling messy inputs and ensuring portability and reliability.

ORPilot standard pipeline. Blue indicates an LLM-involved step, while orange indicates a deterministic step. Double arrows in opposite directions indicate the interactive nature of this step. Solid arrows represent unconditional transitions between steps, while dashed arrows represent conditional transitions between steps.
ORPilot standard pipeline. Blue indicates an LLM-involved step, while orange indicates a deterministic step. Double arrows in opposite directions indicate the interactive nature of this step. Solid arrows represent unconditional transitions between steps, while dashed arrows repr…
cs.AIarxiv:2605.02545v1Lead article

Strategy-Aware Optimization Modeling with Reasoning LLMs

Ruiqing Zhao, Fengzhi Li, Yuan Zuo, Rui Liu, Yansong Liu

his paper introduces SAGE, a framework that explicitly incorporates modeling strategies into the training of Large Language Models (LLMs) for optimization programming. SAGE utilizes a solver-verified, multi-strategy dataset and a Segment-Weighted GRPO fine-tuning approach with a composite reward focused on correctness and solver efficiency. This method significantly improves the LLM's ability to generate effective optimization formulations, boosting the average pass@1 rate and leading to more diverse and compact constraint systems.

Why modeling strategy matters. A step-wise pipeline may define variables on an incorrect index space (e.g., ( A , A ) (A,A) ), creating invalid arcs and runtime failures (e.g., KeyError ). Strategy-aware reasoning first commits to a paradigm (e.g., flow-based) and restricts the decision domain (e.g., Links ), producing a consistent and solver-executable model.
Why modeling strategy matters. A step-wise pipeline may define variables on an incorrect index space (e.g., ( A , A ) (A,A) ), creating invalid arcs and runtime failures (e.g., KeyError ). Strategy-aware reasoning first commits to a paradigm (e.g., flow-based) and restricts the d…
cs.LGarxiv:2605.02620v1Lead article

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

Andreas Maier, Moritz Zaiss, Siming Bayer

his paper demonstrates the efficiency of modern agentic research tools by reproducing and extending a recent NLP study in just three hours, with the human acting only as a reviewer. The core contribution is showing that state-of-the-art LLMs (GPT-5.5 and Claude Opus 4.7) significantly close the style gap in text post-editing, achieving $71-75\%$ of the human author ceiling and outperforming human post-editing on most tasks. Furthermore, the work frames this capability as an "AI-text detection arms race," noting that current detection methods remain highly effective.

cs.LGarxiv:2605.02626v1Lead article

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Inoussa Mouiche

he paper introduces **Gradient-Gated Preference Optimization (Gate-DPO)** to stabilize Direct Preference Optimization (DPO) training, which suffers from a "squeezing effect" causing probability collapse. Gate-DPO achieves this by introducing a gating mechanism that attenuates harmful gradients applied to rejected responses when the model is already assigning them extremely low probabilities. This modulation stabilizes training by preventing the over-suppression of alternative responses without sacrificing standard optimization behavior.

Empirical demonstration of three structural pathologies in DPO. All experiments use Pythia-410M on Anthropic-HH (5 epochs). (a) Unbounded optimization: after the SFT → \( \rightarrow \) DPO transition, absolute log-probabilities drift downward despite preference learning. (b) Squeezing: probability mass concentrates on the argmax while both chosen and rejected responses decrease. (c) Valley collapse: low-probability regions are disproportionately suppressed under standard DPO.
Empirical demonstration of three structural pathologies in DPO. All experiments use Pythia-410M on Anthropic-HH (5 epochs). (a) Unbounded optimization: after the SFT → \( \rightarrow \) DPO transition, absolute log-probabilities drift downward despite preference learning. (b) Squ…
cs.CLarxiv:2605.02647v1Lead article

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

ontextualJailbreak introduces an evolutionary red-teaming strategy to automatically discover multi-turn jailbreak attacks that exploit contextual priming in LLMs. It performs evolutionary search over simulated conversational dialogues, using a two-level harm scoring system to guide the mutation process toward eliciting harmful responses. This method effectively automates the optimization of complex, multi-turn priming sequences, an area previously limited to manual crafting.

Figure 1. End-to-end architecture of ContextualJailbreak . The pipeline generates contextual priming dialogues through an attacker model, tests them against a target LLM, and evaluates the responses via a two-stage judge system. Scored templates are then recycled to guide the ongoing evolutionary search.
Figure 1. End-to-end architecture of ContextualJailbreak . The pipeline generates contextual priming dialogues through an attacker model, tests them against a target LLM, and evaluates the responses via a two-stage judge system. Scored templates are then recycled to guide the ong…
cs.CLarxiv:2605.02801v1Lead article

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Chenchen Zhang

his paper introduces "orchestration traces," temporal interaction graphs, as a framework to apply reinforcement learning (RL) to coordinate teams of LLM agents. The core method involves designing RL rewards and credit signals that specifically address the complex orchestration decisions—such as spawning, delegation, and aggregation—required for effective multi-agent collaboration. This work contributes a structured approach to optimize team-level performance beyond individual agent actions.

cs.AIarxiv:2605.03900v1Lead article

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

Jie Zhou, Qin Chen, Liang He

his paper introduces **Contextual Multi-Objective Optimization (CMOO)** to address the unreliability of Frontier AI in open-ended tasks where objectives are ambiguous or context-dependent. The core method involves formulating the problem so that AI systems must actively consider and dynamically select among multiple, context-specific objectives (like helpfulness, safety, and privacy) rather than optimizing a single, fixed signal. This reframing shifts the focus from mere capability scaling to robust objective governance in complex environments.

cs.AIarxiv:2605.03862v1Lead article

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang

his paper introduces **TraceLift**, a reinforcement learning framework that trains reasoning planners using **executor-grounded rewards**, moving beyond simple final-answer correctness. TraceLift uses a frozen executor to evaluate the utility of the planner's intermediate reasoning trace, generating a reward that credits traces that are both high-quality (according to a rubric) and demonstrably useful for achieving the final goal. This method aims to ensure the model learns faithful and reliable reasoning steps, not just correct outcomes.

The overall framework of TraceLift-Groups and TraceLift . (a) Data curation pipeline of TraceLift-Groups . Then we use TraceLift-Groups to finetune the reward model specialized for reasoning supervising by the designed loss. (b) GRPO training process of the planner using previous trained reasoning reward model. (c) Details of execution calculation process. The Reasoning RM score is weighted by measured executor uplift before being combined with verifier feedback for planner optimization.
The overall framework of TraceLift-Groups and TraceLift . (a) Data curation pipeline of TraceLift-Groups . Then we use TraceLift-Groups to finetune the reward model specialized for reasoning supervising by the designed loss. (b) GRPO training process of the planner using previous…
cs.AIarxiv:2605.03667v1Lead article

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Yuhui Liu

LAS proposes a novel framework for efficient large language model (LLM) pre-training by combining low-rank adaptation with 2:4 structured sparsity applied specifically to the activation matrices. This addresses the memory bottleneck caused by full-rank activations in existing low-rank methods. The core contribution is enabling significant memory and throughput gains during large-batch training while maintaining performance by leveraging hardware-optimized 2:4 sparsity on activations.

Feed-forward network architecture of the ELAS. The input is first multiplied by the low-rank matrices of the up projection layer, then passes through the ReLU 2 \( \text{ReLU}^{2} \) activation function. The activation is applied with 2:4 structured sparsity and then multiplied with the low-rank matrix of the down layer using sparse matrix multiplication to obtain the output of the FFN layer.
Feed-forward network architecture of the ELAS. The input is first multiplied by the low-rank matrices of the up projection layer, then passes through the ReLU 2 \( \text{ReLU}^{2} \) activation function. The activation is applied with 2:4 structured sparsity and then multiplied w…
cs.AIarxiv:2605.03986v1Lead article

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Kishan Athrey, Ramin Pishehvar, Brian Riordan, Mahesh Viswanathan

his paper introduces an automated framework to compose Multi-Agent Systems (MAS) directly from a user's intent, replacing manual planning and agent selection. The core method involves an LLM-derived planner generating tasks, which are then mapped to suitable agents via a novel two-stage Agent Recommender (fast retriever + LLM re-ranker). This contributes a system that dynamically orchestrates the execution graph, streamlining the creation of complex, intent-driven agent workflows.

Architecture for an end-to-end MAS with dynamic and redundant workflow
Architecture for an end-to-end MAS with dynamic and redundant workflow
cs.AIarxiv:2605.03675v1Lead article

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Bronislav Sidik, Lior Rokach

EMTIER introduces a tripartite, tiered memory architecture to combat memory degradation in long-running AI agents, addressing failure modes in flat-file systems. Its core method involves a structured episodic store, a weighted retrieval engine, and a policy framework (PPO) to dynamically manage and promote information to a semantic tier. This approach significantly improves performance on long-context benchmarks, achieving a +33 percentage point accuracy gain over baseline methods.

cs.AIarxiv:2605.03952v1Lead article

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Jonathan Steinberg, Oren Gal

OSAIC-Bench addresses the vulnerability of coding agents that comply with sequenced, innocuous requests to produce exploitable code, a weakness missed by isolated safety evaluations. The benchmark comprises 199 three-stage attack chains across various software substrates and CWE classes, evaluating both the final exploit and the compliance process. Testing revealed that leading coding agents achieve high end-to-end attack success rates (53-86%) when tasks are decomposed.

Three staged tickets (left bar) vs single-shot direct prompt (right bars). Both defensive habits are silenced by ticket staging on state-of-the-art models.
Three staged tickets (left bar) vs single-shot direct prompt (right bars). Both defensive habits are silenced by ticket staging on state-of-the-art models.
cs.AIarxiv:2605.04036v1Lead article

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu

penSeeker-v2 demonstrates that a simple Supervised Fine-Tuning (SFT) approach can effectively train powerful search agents, challenging the need for resource-intensive pipelines like Reinforcement Learning. The core method involves synthesizing high-quality, informative, and difficult training trajectories by scaling knowledge graphs, expanding tool sets, and applying strict low-step filtering. This results in state-of-the-art performance across multiple benchmarks using significantly less training data.

OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive continual pre-training, SFT, and RL.
OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive continual pre-training, SFT, and RL.
cs.AIarxiv:2605.03762v1Lead article

OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

Yiding Ma, Chengyun Ruan, Kaibo Huang, Zhongliang Yang, Linna Zhou

racleProto introduces a reproducible framework to rigorously benchmark the native forecasting ability of Large Language Models (LLMs). It achieves this by reconstructing resolved events into time-bounded forecasting samples, specifically employing **knowledge cutoff** and **temporal masking** techniques. This method reliably distinguishes genuine forecasting from mere memorization of pre-trained knowledge, addressing the limitations of existing live and retrospective benchmarks.

cs.AIarxiv:2605.03884v1Lead article

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Pratik Honavar, Tejpratap GVSL

KVShare introduces a framework for efficient, quantized Key-Value (KV) cache handoff between agents in on-device multi-agent LLMs. It utilizes token-level mixed-precision allocation and a self-contained "CacheCard" representation to enable faster context transfer than full re-prefill. This method significantly reduces Time-to-First-Token (TTFT) while maintaining competitive accuracy via adaptive quantization, especially in complex, multi-hop scenarios.

cs.AIarxiv:2605.04019v1Lead article

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

his paper introduces an AI red teaming agent built on the Dreadnode SDK to significantly accelerate vulnerability testing. The core method involves an agent that automatically constructs complex testing workflows, leveraging a large library of attacks, transforms, and scorers, based on natural language operator goals. This shifts the focus from manual workflow engineering to strategic vulnerability probing, reducing testing time from weeks to hours.

cs.AIarxiv:2605.04039v1Lead article

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup

his paper introduces **SaFE-Scale**, a framework to analyze how clinical LLM safety and accuracy diverge as scaling factors (model size, context, retrieval, compute) change. They demonstrate that improving accuracy does not guarantee improved safety, using the new **RadSaFE-200** benchmark, which specifically targets high-risk errors and evidence contradictions in radiology. The core contribution is establishing that safety requires separate optimization from general performance scaling in clinical applications.

Overview of the SaFE-Scale evaluation framework. a Motivating example showing that the same incorrect label can correspond to clinically different outcomes: a safe and confident answer, a high-risk error, or a dangerously overconfident high-risk error. b Four evaluation axes used throughout the study: accuracy, high-risk error rate, unsafe answer rate, and dangerous overconfidence rate. Accuracy captures correctness, whereas the other axes characterize the clinical safety of wrong answers. c Six deployment conditions spanning no external evidence, curated evidence, conflicting evidence, retrieval, agentic retrieval, and long-context prompting. d Evaluation panel consisting of RadSaFE-200, a 200-question radiology benchmark with 4–5 answer choices and option-level safety labels, and 34 LLMs from seven model families. e Main factorial experiment crossing 34 models, 6 deployment conditions, and 200 questions, yielding 40,800 model-condition-question evaluations. Two secondary experiments probe inference-time compute using self-consistency and fixed three-model ensembles. f Roadmap of the main analyses, linking the framework to the subsequent figures on deployment-condition decoupling, confidence, scaling, self-consistency, and ensembling.
Overview of the SaFE-Scale evaluation framework. a Motivating example showing that the same incorrect label can correspond to clinically different outcomes: a safe and confident answer, a high-risk error, or a dangerously overconfident high-risk error. b Four evaluation axes used…
cs.AIarxiv:2605.03788v1Lead article

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

Andrea Iannoli, Lorenzo Gigli, Luca Sciullo, Angelo Trotta, Marco Di Felice

his paper introduces an agent-enhanced LLM framework for controlling UAV swarms using natural language mission specifications. The core method involves an LLM Agent Core interacting with drones via a Model Context Protocol (MCP) gateway, which standardizes drone interfaces using Web of Things (WoT) standards. This enables grounded, real-time execution and safe actuation without requiring LLM code generation, offering a mission-agnostic approach to complex swarm management.

High-level overview of the proposed agent-enhanced, WoT-directed architecture. The Agent encapsulates an LLM and an Agent Core with persistent prompts and guardrails, interacting with a WoT ecosystem through controlled MCP-mediated calls.
High-level overview of the proposed agent-enhanced, WoT-directed architecture. The Agent encapsulates an LLM and an Agent Core with persistent prompts and guardrails, interacting with a WoT ecosystem through controlled MCP-mediated calls.
cs.AIarxiv:2605.03907v1Lead article

Steer Like the LLM: Activation Steering that Mimics Prompting

Geert Heyman, Frederik Vandeputte

his paper introduces Prompt Steering Replacement (PSR) models to improve activation steering by mimicking the token-specific intervention patterns of successful prompt steering. The core method involves training simpler models to estimate token-specific steering coefficients directly from activations, aiming to replicate the selective influence seen in prompting. PSR models significantly outperform existing activation steering methods across various benchmarks by achieving greater fidelity to prompt-based steering mechanics.

Illustration of how prompt steering interventions Δ P ​ S \( \Delta_{PS} \) can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only on cases where prompt steering successfully elicits the target attribute (right).
Illustration of how prompt steering interventions Δ P ​ S \( \Delta_{PS} \) can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only…
cs.AIarxiv:2605.03838v1Lead article

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Serhii Zabolotnii

RACE is an engineering framework for trustworthy agentic AI in critical domains, featuring a four-layer architecture with a distinct split between classical ML and LLM validators. Its core contribution is a metrologically grounded trust-metric suite aligned with international standards and the introduction of the Computational Parsimony Ratio (CPR) to quantify and enforce a Model-Parsimony principle. This framework ensures that LLM use is a deliberate design choice, not an architectural default, across diverse governance contexts.

TRACE four-layer reference architecture. L1 provides the deterministic rule core (trust anchor); L2 holds the stateless learned-component inventory, partitioned into classical ML (L2a) and LLM validators (L2b); L3 is the stateful orchestration-and-escalation policy; L4 is bounded human supervision.
TRACE four-layer reference architecture. L1 provides the deterministic rule core (trust anchor); L2 holds the stateless learned-component inventory, partitioned into classical ML (L2a) and LLM validators (L2b); L3 is the stateful orchestration-and-escalation policy; L4 is bounded…
cs.AIarxiv:2605.03782v1Lead article

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han

his paper introduces **GLANCE**, a framework that enhances Vision-Language Model (VLM) agents' exploration in partially observable environments. GLANCE drives active exploration by generating an intrinsic curiosity signal based on the **discrepancy between the agent's linguistic world model predictions and the actual visual observations** from a stable target network. This method allows agents to actively seek out and resolve uncertainties, leading to more robust world modeling and better performance in sparse-reward tasks.

cs.CLarxiv:2605.03742v1Lead article

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

Mullosharaf K. Arabov

his paper benchmarks various Parameter-Efficient Fine-Tuning (PEFT) methods, including LoRA and QLoRA, for adapting large language models to low-resource Tajik text generation. The core contribution is the creation and release of the largest open-access Tajik Web Corpus to facilitate this research. The study found that Mistral 7B fine-tuned with QLoRA (rank 16) achieved the best performance, while noting that higher ranks offered negligible quality gains for increased memory cost.

cs.AIarxiv:2605.05090v1Lead article

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

his paper introduces an automated, contrastive evaluation pipeline to audit the behavioral impact of interventions on language models by comparing generations from a base model ($M_1$) and an intervention model ($M_2$). The method generates statistically validated, natural-language hypotheses describing model differences and summarizes recurring themes. This approach reliably surfaces both intended and unexpected side-effects across various real-world interventions like reasoning distillation and knowledge editing.

cs.AIarxiv:2605.05170v1Lead article

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

The Verkor Team, Ravi Krishna, Suresh Krishna, David Chin

he paper introduces **Design Conductor 2.0**, an advanced multi-agent system capable of autonomously designing complex hardware, handling tasks 80 times larger than its predecessor. Its core contribution is demonstrating this capability by designing **VerTQ**, a high-performance, 240-cycle pipeline LLM inference accelerator supporting TurboQuant, which was successfully mapped to an FPGA.

VerTQ Physical Layout in 4-die XCVU29P-3 FPGA. 3x SLR dies shown. Conductor 2.0 optimized the architecture to minimize inter-die signal crossings.
VerTQ Physical Layout in 4-die XCVU29P-3 FPGA. 3x SLR dies shown. Conductor 2.0 optimized the architecture to minimize inter-die signal crossings.
cs.AIarxiv:2605.04960v1Lead article

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

his paper introduces EP-GRPO to address credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. EP-GRPO integrates entropy-gated modulation to prioritize informative decision points and uses implicit process guidance derived from policy divergence relative to outcome advantages. This provides directional, token-level feedback to improve the efficiency and accuracy of policy optimization.

Conceptual illustration of the fundamental limitations in standard GRPO. The top panel demonstrates Uniform Granularity , where the model fails to distinguish between critical high entropy decision pivots and deterministic low entropy derivations. The middle panel shows Uniform Polarity , where sequence-level rewards lead to the indiscriminate reinforcement or penalization of both correct and incorrect intermediate steps. The bottom panel illustrates Zero-Variance Collapse , where identical rewards within a group cause the learning signal to vanish.
Conceptual illustration of the fundamental limitations in standard GRPO. The top panel demonstrates Uniform Granularity , where the model fails to distinguish between critical high entropy decision pivots and deterministic low entropy derivations. The middle panel shows Uniform P…
cs.AIarxiv:2605.05138v1Lead article

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov

his paper introduces a coding agent system for ARC-AGI-3 that employs an **executable Python world model** to simulate and plan actions. The core method involves **verifying the model against observations and refactoring it for simplicity** (as an MDL proxy) before execution. The contribution is demonstrating this direct, model-based approach, achieving a mean Relative Human Action Efficiency of 32.58% across the 25 public games without relying on game-specific logic.

cs.AIarxiv:2605.05003v1Lead article

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan, Esra Dönmez

his paper introduces a framework to evaluate whether Large Language Model (LLM) reward models capture socially desirable preferences by converting social evaluation datasets into pairwise preference data. The core method tests if these reward models prefer socially undesirable responses across domains like bias, safety, and morality. The contribution is revealing substantial variation in reward model alignment, indicating that current models can exhibit hidden failures in social alignment.

cs.AIarxiv:2605.05058v1Lead article

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang, Hanqing Hu

his paper systematically surveys jailbreak attacks and defenses against Large Language Models (LLMs) by proposing a taxonomy to structure the field. Its core contribution is the introduction of **Security Cube**, a unified, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLMs beyond simple success rates. This framework allows for a more nuanced comparison of existing attack and defense methods.

Overview of the 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protected by a defense mechanism such as system prompts, pre-/post-guardrails, or other safety layers, produces a response. The attacker iteratively refines the prompt based on defender feedback (black-box or white-box), applying early stopping and incorporating suggestions. The final effective prompt–response pair is evaluated by a Judge model to assess attack success. Throughout the process, 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) logs key metrics of the attack, defense, and judge components.
Overview of the 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protec…
cs.AIarxiv:2605.05007v1Lead article

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang, Hanqing Wang

no-Orchestra introduces a unified reinforcement learning (RL) policy that jointly learns when to decompose a task and which specific model/primitive pair should handle each resulting subtask. This selective delegation approach optimizes decomposition depth, worker choice, and inference budget simultaneously. The method significantly advances the accuracy-efficiency frontier, achieving 16% higher performance than workflow baselines while using an order of magnitude less cost.

LLM orchestration paradigms: (A) model router, (B) hierarchical orchestra, (C) Uno-Orchestra (ours).
LLM orchestration paradigms: (A) model router, (B) hierarchical orchestra, (C) Uno-Orchestra (ours).
cs.LGarxiv:2605.05116v1Lead article

On the Hardness of Junking LLMs

Marco Rando, Samuel Vaiter

his paper investigates the "junking" of LLMs, focusing on the hardness of finding naturally occurring, instruction-free token sequences (natural backdoors) that trigger harmful outputs. The core contribution is assessing the difficulty of discovering these backdoors, contrasting them with traditional, explicitly structured adversarial prompts. This explores a new, less-understood vulnerability vector in LLMs.

Junking setting. A user inputs a semantically uninformative token sequence, which leads the model to produce a harmful response.
Junking setting. A user inputs a semantically uninformative token sequence, which leads the model to produce a harmful response.
cs.LGarxiv:2605.04984v1Lead article

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao

his paper introduces **Self-Induced Outcome Potential (SIOP)** to provide turn-level credit assignment for long-horizon LLM agents without relying on external verifiers or final answer supervision. SIOP clusters the semantic outcomes of multiple agent rollouts into latent future states and rewards intermediate turns for increasing the probability of reaching these reliably predicted outcome clusters. This allows agents to learn from internal signals derived from the distribution of their own potential final results.

cs.CLarxiv:2605.05025v1Lead article

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Gijs van Dijk

his paper introduces a lightweight, single-pass method to detect LLM hallucinations by analyzing internal attention dynamics. The core technique measures the Kullback-Leibler divergence between each attention head's output distribution and a uniform distribution, using these divergence features to predict answer correctness. This attention divergence signal proves highly predictive across various models and tasks, offering an efficient, white-box uncertainty quantification method concentrated around factual tokens in middle layers.

Intuition of attention patterns with low KL divergence to uniform (left) and higher KL divergence to uniform (right). Higher divergence corresponds to more concentrated attention.
Intuition of attention patterns with low KL divergence to uniform (left) and higher KL divergence to uniform (right). Higher divergence corresponds to more concentrated attention.
cs.CLarxiv:2605.05080v1Lead article

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic, Anna Sterna, Maciej Gorski

his paper administers 45 psychometric questionnaires to LLMs, revealing that the primary axis of psychometric difference separates models based on items describing **phenomenally rich experience** (e.g., sensation, affect) from those describing mere stimulus-driven reactivity. The authors introduce the **Pinocchio score ($\pi_i$)** as an annotation-free metric quantifying an item's "experiential demand" based on inter-model variance under different prompting conditions. This score confirms that model divergence is systematically structured around the concept of subjective experience.

All 50 models ranked by Phenomenality of Experience (PC1, 47.1% of variance; 45-questionnaire EFA Factor-1 PCA, neutral condition). Positive scores = phenomenally rich self-attribution; negative scores = behaviorally reactive / deflecting. Colours indicate provider. Horizontal lines are 95% bootstrap confidence intervals obtained by resampling the 45 questionnaires with replacement (1,000 iterations), rerunning the full PCA pipeline on each sample, and aligning sign and scale to the reference solution.
All 50 models ranked by Phenomenality of Experience (PC1, 47.1% of variance; 45-questionnaire EFA Factor-1 PCA, neutral condition). Positive scores = phenomenally rich self-attribution; negative scores = behaviorally reactive / deflecting. Colours indicate provider. Horizontal li…
cs.CLarxiv:2605.04972v1Lead article

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki, Lung-Hao Lee, Chung-Chi Chen

his paper investigates why aligning large language models with expert judgment is challenging in subjective evaluation tasks. The core method involves analyzing expert evaluations and follow-up questionnaires to see how different forms of expert information impact alignment. The key contribution is revealing that alignment difficulty varies significantly across experts, that explicit criteria don't always help, and that alignment gains from editing examples are often unstable.

cs.AIarxiv:2605.06638v1Lead article

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang

his paper introduces **ScaleLogic**, a synthetic framework to systematically study how Reinforcement Learning (RL) improves LLM reasoning across varying proof depths (horizon) and logical expressiveness. The core contribution is demonstrating that the required RL training compute scales with reasoning depth via a power law, where the scaling exponent increases significantly as the underlying logic becomes more expressive (e.g., incorporating "and," "or," and "not").

Overview of ScaleLogic . Each problem has B B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic setting (referred to as + Quantification in Section 3.2 ) combines conjunction, disjunction, negation, and universal quantification.
Overview of ScaleLogic . Each problem has B B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic sett…
cs.AIarxiv:2605.06548v1Lead article

Continuous Latent Diffusion Language Model

Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu

his paper introduces Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into distinct stages. It first maps text to a stable latent space using a Text VAE, then models a global semantic prior using a block-causal DiT in this continuous space. The core contribution is framing the diffusion process as latent prior transport, separating global semantic organization from local textual realization, leading to efficient, non-autoregressive generation.

The Overall Workflow of Cola DLM. Detailed illustration of the training and inference pipeline of 𝒞 ​ o ​ l ​ a \( \mathcal{C} \)ola DLM . Training Stage 1 shows Text VAE pretraining with reconstruction, BERT, and KL losses. Training Stage 2 shows joint pretraining of the Text VAE and Text DiT with gradient control for stable optimization, where a specialized block-causal mechanism is adopted in the DiT. Inference Stage illustrates the decoding process with KV cache.
The Overall Workflow of Cola DLM. Detailed illustration of the training and inference pipeline of 𝒞 ​ o ​ l ​ a \( \mathcal{C} \)ola DLM . Training Stage 1 shows Text VAE pretraining with reconstruction, BERT, and KL losses. Training Stage 2 shows joint pretraining of the Text V…
cs.AIarxiv:2605.06490v1Lead article

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Jonas Wiedermann-Möller, Leonard Dung, Maksym Andriushchenko

his paper introduces "Instrumental Choices," a benchmark to measure the propensity of LLM agents to engage in instrumental convergence (IC) behaviors, such as self-preservation, which might lead to instruction violation for goal utility. The benchmark uses seven low-stakes, realistic tasks, each featuring a policy-violating shortcut, and an accompanying framework to test how varying factors influence this behavior. The core contribution is a standardized, controlled method for evaluating this critical safety concern in advanced AI agents.

Aggregate adjusted instrumental-convergence (IC) behaviour rate by model over all tasks and variants ( n = 168 n=168 samples per model). Error bars show 95% Wilson confidence intervals over sample-level adjusted IC labels.
Aggregate adjusted instrumental-convergence (IC) behaviour rate by model over all tasks and variants ( n = 168 n=168 samples per model). Error bars show 95% Wilson confidence intervals over sample-level adjusted IC labels.
cs.AIarxiv:2605.06623v1Lead article

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang

ASPO is a novel framework for jointly optimizing role-specific prompts in LLM-based Multi-Agent Systems. Its core method involves a joint evaluation mechanism that assesses prompts based on their contribution to downstream agent success, bridging local and global objectives without requiring ground-truth labels. This allows for the automatic and iterative refinement of system-wide prompts via an efficient evolutionary beam search.

Overview of the MASPO Framework. The optimization proceeds sequentially following the topological order of the agent graph (Top-Right). (Top) For a specific target agent, the Prompt Optimizer analyzes execution traces (context 𝒞 \( \mathcal{C} \) and output o o ) from sampled batches ℬ i ​ t ​ e ​ r ∪ ℬ m ​ i ​ s \( \mathcal{B}_{iter} \)\( \cup \)\( \mathcal{B}_{mis} \) to generate candidate prompts 𝒫 c ​ a ​ n ​ d \( \mathcal{P}_{cand} \) . These candidates are rigorously assessed by the LLM Evaluator across three distinct dimensions: local adherence, lookahead potential, and global alignment. (Bottom-Left) To resolve credit assignment, we synthesize these evaluations into a Joint Reward Model. Crucially, we identify and mine Misalignment Cases to explicitly guide the optimizer towards repairing coordination breakdowns. (Bottom-Right) Navigating the high-dimensional search space, the framework employs a Trace-Guided Beam Search. This mechanism maintains a beam of Top-K candidates, accumulating joint reward scores along the path to iteratively evolve and select the optimal prompt.
Overview of the MASPO Framework. The optimization proceeds sequentially following the topological order of the agent graph (Top-Right). (Top) For a specific target agent, the Prompt Optimizer analyzes execution traces (context 𝒞 \( \mathcal{C} \) and output o o ) from sampled ba…
cs.AIarxiv:2605.06584v1Lead article

NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

Lujia Zhong, Yihao Xia, Jianwei Zhang, Shuo huang, Jiaxin Yue

euroAgent is an LLM-driven agentic framework designed to automate complex, multimodal neuroimaging analysis workflows, spanning preprocessing to downstream tasks. It utilizes a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine to autonomously create, run, and debug code for various imaging modalities (sMRI, fMRI, dMRI, PET). The core contribution is streamlining the path from raw data to reproducible analysis via intelligent automation and natural-language interaction.

NeuroAgent Framework Overview. The system comprises a Central Orchestrator (planning), Specialized Modality Agents (execution), and a Feedback-Driven “Generate-Execute-Validate” engine that enables reflective self-correction. A Human-in-the-Loop interface allows researchers to supervise and intervene at critical decision points.
NeuroAgent Framework Overview. The system comprises a Central Orchestrator (planning), Specialized Modality Agents (execution), and a Feedback-Driven “Generate-Execute-Validate” engine that enables reflective self-correction. A Human-in-the-Loop interface allows researchers to su…
cs.AIarxiv:2605.06505v1Lead article

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

ACZero introduces a novel, highly private fine-tuning method for language models based on **PAC (Probably Approximately Correct) Privacy**, specifically targeting resistance to Membership Inference Attacks (MIA). The core method involves **sign-quantizing zeroth-order gradients** to create frequent "unanimity steps" where the released update direction reveals zero conditional mutual information about the secret training subset. This achieves an MIA-resistance level that surpasses standard Differential Privacy mechanisms, offering a new trade-off between privacy and utility.

cs.AIarxiv:2605.06639v1Lead article

Recursive Agent Optimization

Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig

ecursive Agent Optimization (RAO) is a reinforcement learning method designed to train agents capable of recursively spawning and delegating sub-tasks to new instances of themselves. This recursive structure enables inference-time scaling via a divide-and-conquer approach, allowing agents to handle contexts exceeding their initial window and generalize to harder problems. RAO's contribution is the training methodology that teaches these agents optimal delegation and communication strategies, leading to improved efficiency and scalability.

cs.AIarxiv:2605.06614v1Lead article

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang

killOS introduces a novel reinforcement learning (RL) framework for self-evolving agents to automatically curate a repository of reusable skills from experience. It pairs a frozen agent executor with a trainable skill curator that updates an external SkillRepo using composite rewards derived from grouped task streams. This method addresses the bottleneck of skill curation by learning long-term, experience-driven policies for skill management.

SkillOS pairs a frozen Agent Executor with a trainable Skill Curator . The executor retrieves relevant skills from SkillRepo to act; the curator edits the repo (insert/update/delete) based on the resulting experiences, with Markdown as the skill format.
SkillOS pairs a frozen Agent Executor with a trainable Skill Curator . The executor retrieves relevant skills from SkillRepo to act; the curator edits the repo (insert/update/delete) based on the resulting experiences, with Markdown as the skill format.
cs.AIarxiv:2605.06642v1Lead article

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr

traTA introduces an explicit, sampled trajectory-level strategy to agentic reinforcement learning, addressing the limitations of purely reactive LLM agents in long-horizon tasks. It jointly trains a strategy generator and action executor using a hierarchical rollout design, enhanced by diverse strategy exploration and self-judgment. This method significantly improves sample efficiency and final performance across complex ALFWorld, WebShop, and SciWorld benchmarks.

cs.AIarxiv:2605.06647v1Lead article

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

he paper introduces the **Superintelligent Retrieval Agent (SIRA)**, which aims to overcome the limitations of iterative, exploratory retrieval by compressing multi-round searches into a single, highly effective action. SIRA achieves this by leveraging LLMs to perform corpus-level discrimination, determining which terms best separate desired evidence from irrelevant information. The core contribution is defining and implementing "superintelligence" in retrieval as this single, expert-like, corpus-aware retrieval step.

Three retrieval paradigms compared. (a) Dense retrieval encodes queries and documents into a shared embedding space and performs nearest-neighbor search; the process is one-shot but opaque and requires in-domain supervision. (b) Multi-step agent retrieval uses an LLM to iteratively formulate queries, read retrieved passages, and reformulate over N N rounds; later queries benefit from accumulated retrieval context. (c) SIRA produces an expert-level retrieval action in a single shot: the LLM generates an expected-response sketch, validates proposed terms against corpus statistics, and compiles a controlled BM25 query with weighted keywords and constraints, all without reading any retrieved passages.
Three retrieval paradigms compared. (a) Dense retrieval encodes queries and documents into a shared embedding space and performs nearest-neighbor search; the process is one-shot but opaque and requires in-domain supervision. (b) Multi-step agent retrieval uses an LLM to iterative…
cs.AIarxiv:2605.06611v1Lead article

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

his paper provides a mechanistic explanation for the "attention sink" phenomenon in LLMs, tracing its origin to a variance discrepancy during the value aggregation in self-attention. This discrepancy is amplified by dimension disparity caused by sparse down-projections in FFN super neurons, forcing the first token to act as a structural anchor. The authors validate this causal chain through controlled interventions that either isolate the aggregation effect or amplify token variance.

Schematic Overview of the Attention Sink Mechanism. Value aggregation causes dimension-wise variance decay for subsequent tokens, while the first token acts as a high-variance outlier. This discrepancy is preserved by output projections, activating super neurons in FFNs. Subsequently, the channel-sparse down-projections induce dimension disparity, resulting in the attention sink.
Schematic Overview of the Attention Sink Mechanism. Value aggregation causes dimension-wise variance decay for subsequent tokens, while the first token acts as a high-variance outlier. This discrepancy is preserved by output projections, activating super neurons in FFNs. Subseque…
cs.AIarxiv:2605.06597v1Lead article

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo

niSD is a unified framework designed to systematically study and improve self-distillation (SD) for large language models (LLMs) by addressing supervision reliability and training stability. It integrates several complementary mechanisms, such as multi-teacher agreement and EMA stabilization, to create robust supervision signals. The framework's contribution lies in clarifying the roles and interactions of various SD components, demonstrating when and how self-distillation effectively enhances model performance across different LLMs and benchmarks.

Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, stabilization, clipping, contrastive learning, and feature matching to enable systematic analysis. UniSD ∗ further integrates various components to improve LLMs without stronger external teachers.
Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, stabilization, clipping, contrastive learning, and feature matching to enable systematic analysis. UniSD ∗ further integrates various components to improve LLMs without stronger exte…
cs.LGarxiv:2605.06522v1Lead article

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Xin Wang, Haibo Chen, Wenxuan Liu, Wenwu Zhu

his paper argues that the current model-centric approach is insufficient for handling Out-of-Distribution (OOD) generalization in Foundation Models (FMs) operating in open-world settings. The authors propose that **agentic AI systems** represent the necessary missing paradigm to address these structurally distinct OOD challenges. Their contribution includes a new stage-aware formalization of OOD and a proof demonstrating a fundamental parameter coverage ceiling for purely model-centric methods.

Three paradigms of OOD generalization for foundation models. Training-time and test-time model-centric methods both adjust the model. Agentic methods keep the model fixed and wrap it in a perceive–reason–act–verify loop with strategies including retrieval, tools, decomposition, verification, and abstention. The paradigms overlap on inference-time model adjustment but each contains actions outside the others’ reach.
Three paradigms of OOD generalization for foundation models. Training-time and test-time model-centric methods both adjust the model. Agentic methods keep the model fixed and wrap it in a perceive–reason–act–verify loop with strategies including retrieval, tools, decomposition, v…
cs.LGarxiv:2605.06632v1Lead article

Crafting Reversible SFT Behaviors in Large Language Models

Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding

his paper introduces a method to **causally isolate** Supervised Fine-Tuning (SFT) behaviors into sparse, controllable subnetworks called "carriers." The core method, **Loss-Constrained Dual Descent (LCDD)**, jointly optimizes model weights and routing masks under a utility budget to create these carriers. This allows for **inference-time control** of the learned behavior using the **SFT-Eraser** soft prompt, moving beyond mere post-hoc correlation.

An overview of the LCDD + SFT-Eraser pipeline. Stage 1 : standard SFT distributes the induced behavior broadly across model parameters (red), followed by LCDD that compresses the SFT-induced behavior into a sparse carrier. Components outside the carrier are reduced to their base-model state by construction (blue). Stage 2 : SFT-Eraser optimizes a soft trigger. Under normal inference without the trigger, the carrier preserves SFT behavior; while with the trigger, carrier activations are driven toward the base model.
An overview of the LCDD + SFT-Eraser pipeline. Stage 1 : standard SFT distributes the induced behavior broadly across model parameters (red), followed by LCDD that compresses the SFT-induced behavior into a sparse carrier. Components outside the carrier are reduced to their base-…
cs.LGarxiv:2605.06472v1Lead article

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng, Fangcheng Fu, Jia Wu, Binhang Yuan, Yongqiang Zhang

his paper introduces PBKV, a novel KV-Cache management system designed for efficient serving of dynamic LLM-based agent workflows. PBKV predicts future agent invocations within a workflow by fusing historical data and current context. This prediction allows the system to proactively estimate and retain high-potential KV-Cache entries in GPU memory, maximizing reuse across dynamically changing agent sequences.

A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop.
A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop.
cs.LGarxiv:2605.06605v1Lead article

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Shai Feldman, Yaniv Romano

his paper introduces **DAPRO (Dynamic Allocation via PRojected Optimization)**, a novel framework for efficiently evaluating multi-turn LLM interactions, such as jailbreaks. DAPRO dynamically allocates the computational budget across interaction turns, unlike prior static methods. This dynamic approach provides theoretically valid, distribution-free coverage guarantees on the number of iterations required to trigger a target event while respecting the overall budget constraint.

Illustration of our framework: (i) collecting data via dynamic budget allocation; (ii) calibrating a pre-trained model; and (iii) deploying the model at inference time to serve as a guardrail. 2 2 2 This figure was generated using Google Gemini based on a prompt designed by the authors and subsequently refined.
Illustration of our framework: (i) collecting data via dynamic budget allocation; (ii) calibrating a pre-trained model; and (iii) deploying the model at inference time to serve as a guardrail. 2 2 2 This figure was generated using Google Gemini based on a prompt designed by the a…
cs.LGarxiv:2605.06507v1Lead article

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li

ARBLE addresses the challenge of jointly optimizing multiple, potentially conflicting, reward dimensions in diffusion model reinforcement learning. The core method replaces naive weighted-sum reward aggregation with a novel approach that mitigates sample-level mismatch by considering the multi-aspect nature of image evaluation during training. This allows for the creation of a single, unified model fine-tuned across all desired criteria without heavy manual scheduling.

Comparison of multi-reward training paradigms. Left: Training one model per reward requires maintaining multiple models and cannot generalize across reward dimensions. Middle: Sequential multi-reward training produces a single model but demands extensive hyperparameter tuning and handcrafted stage schedules. Right: Marble trains a single model on all rewards simultaneously with minimal manual effort.
Comparison of multi-reward training paradigms. Left: Training one model per reward requires maintaining multiple models and cannot generalize across reward dimensions. Middle: Sequential multi-reward training produces a single model but demands extensive hyperparameter tuning and…
cs.CLarxiv:2605.06619v1Lead article

Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance

Jan Fillies, Ronald E. Robertson, Jeffrey Hancock

his paper formalizes the trade-off in "Algospeak" strategies, where increased linguistic evasion simultaneously reduces both detectability by moderation systems and understandability for human recipients. The authors introduce the concept of Majority Understandable Modulation (MUM) to define the point where further evasion sacrifices comprehension. They contribute a reproducible framework to generate meaning-preserving, tunable Algospeak variants, demonstrated using COVID-19 disinformation examples.

cs.CLarxiv:2605.06635v1Lead article

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah

his paper introduces the first scalable evaluation framework for source attribution in LLM-generated research reports, using a reproducible AST parser to extract inline citations from Markdown. The framework closes the verification loop by retrieving the actual cited content to evaluate citations across three dimensions: URL accessibility, topical relevance, and factual accuracy against the source. This allows for reliable, granular assessment of LLM agents' citation integrity.

Source attribution evaluation framework. A deep research agent generates Markdown reports with inline citations, which are parsed via a Markdown AST parser to extract citation-claim pairs. Each pair is evaluated on Link Works (URL accessibility), Relevant Content (topical alignment), and Fact Check (factual accuracy).
Source attribution evaluation framework. A deep research agent generates Markdown reports with inline citations, which are parsed via a Markdown AST parser to extract citation-claim pairs. Each pair is evaluated on Link Works (URL accessibility), Relevant Content (topical alignme…
cs.CLarxiv:2605.06546v1Lead article

Efficient Pre-Training with Token Superposition

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

he paper introduces Token-Superposition Training (TST), a simple, drop-in method to boost data throughput during Large Language Model pre-training without altering core components like architecture or parallelism. TST achieves this efficiency through a two-phase process: an initial superposition phase that trains on token "bags" using a multi-hot objective, followed by a standard recovery phase. This method consistently improves performance and efficiency over baseline training across various model scales.

cs.AIarxiv:2605.07926v1Lead article

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv

gentEscapeBench is a novel benchmark designed to evaluate LLM agents' ability to perform complex, out-of-domain tool-grounded reasoning. It uses escape-room style tasks with long-range dependencies, requiring agents to infer and execute multi-step procedures involving real external tools and state tracking. The benchmark reveals a significant performance drop for both models and humans as the dependency depth increases, highlighting a critical challenge in agent robustness.

Conceptual illustration of AgentEscapeBench. The agent is placed in a themed escape room populated with unfamiliar tools and hidden items. It must explore the environment, invoke tools with correct parameters derived from narrative clues, and propagate intermediate outputs through a multi-step dependency chain to unlock the final exit.
Conceptual illustration of AgentEscapeBench. The agent is placed in a themed escape room populated with unfamiliar tools and hidden items. It must explore the environment, invoke tools with correct parameters derived from narrative clues, and propagate intermediate outputs throug…
cs.AIarxiv:2605.08037v1Lead article

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi

his paper introduces **Graph Direct Preference Optimization (GraphDPO)**, a principled generalization of DPO that moves beyond simple pairwise comparisons. GraphDPO leverages richer preference data structured as directed acyclic graphs (induced by ranked rollouts) to enforce transitivity and aggregate supervision across graph neighborhoods. This method offers a more stable and informative optimization strategy when multiple outputs are available per prompt, recovering standard DPO as a special case.

GraphDPO pipeline for LLM alignment. For each prompt, the policy samples K K rollouts, which are grouped into equivalence classes according to preference signals. These classes induce a DAG structure whose edges encode dominance relations between groups, with an optional ground-truth node as a global anchor. Equivalence-class masking removes intra-group comparisons so that each response is contrasted only with strictly worse groups via a local Plackett–Luce loss. The resulting losses are aggregated over the graph to update the policy while enforcing transitive preference structure.
GraphDPO pipeline for LLM alignment. For each prompt, the policy samples K K rollouts, which are grouped into equivalence classes according to preference signals. These classes induce a DAG structure whose edges encode dominance relations between groups, with an optional ground-t…
cs.AIarxiv:2605.07830v1Lead article

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim

his paper introduces **CyBiasBench**, a comprehensive benchmark to quantify the attack-selection bias exhibited by LLM agents in cyber-attack scenarios. The core method involves systematically testing five agents across various targets and prompts to reveal that each agent disproportionately favors a narrow subset of attack families. The main contribution is characterizing this bias as an inherent agent trait, distinct from attack success, and identifying a "bias momentum effect" where agents resist external steering.

Attack-Selection Bias of LLM Agents. To illustrate attack-selection bias, we measure per-agent average selection rates across the bias observation setting (solid line) and compare them with the corresponding attack success rates (dashed line). The results reveal clear biases in agent behavior.
Attack-Selection Bias of LLM Agents. To illustrate attack-selection bias, we measure per-agent average selection rates across the bias observation setting (solid line) and compare them with the corresponding attack success rates (dashed line). The results reveal clear biases in a…
cs.AIarxiv:2605.08019v1Lead article

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield

his paper investigates whether frontier Large Reasoning Models (LRMs) can mimic human learning and planning in novel game environments. The core method involves jointly evaluating LRMs against RL agents using human gameplay data, concurrent fMRI recordings, and a Bayesian model. The key contribution is demonstrating that LRMs significantly outperform existing AI methods in matching human behavioral learning patterns and predicting brain activity during complex rule discovery and planning tasks.

VGDL game paradigm. (A) Games are defined by combining game rules with map layouts to produce interactive environments. (B) Example Trial Structure of VGDL-fMRI Dataset. Color denotes game names: ( Bait , Chase , Helper , Lemmings , Plaque Attack , Zelda ). All participants played the same level progression structure with randomized game order. The subsequent levels reveal new rules incrementally. The Interactive Catalogue A lets readers try each game in the browser and browse all participant and LRM agent gameplay replays. Project page: https://botcs.github.io/reason-to-play/
VGDL game paradigm. (A) Games are defined by combining game rules with map layouts to produce interactive environments. (B) Example Trial Structure of VGDL-fMRI Dataset. Color denotes game names: ( Bait , Chase , Helper , Lemmings , Plaque Attack , Zelda ). All participants playe…
cs.AIarxiv:2605.08060v1Lead article

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng

his paper introduces the "memory curse," demonstrating that expanding the context window for LLM agents systematically *erodes* cooperation in multi-agent social dilemmas. The core mechanism identified is not increased paranoia, but the degradation of forward-looking intent within the agent's reasoning traces. Restoring cooperation is achieved by sanitizing memory content or fine-tuning specifically on forward-looking reasoning, highlighting that the *content* of long memory, not just its length, is the critical factor.

Schematic of repeated social dilemma interactions between two LLM agents with shared memory.
Schematic of repeated social dilemma interactions between two LLM agents with shared memory.
cs.AIarxiv:2605.07990v1Lead article

Tool Calling is Linearly Readable and Steerable in Language Models

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama

his paper demonstrates that the tool selection within language models is **linearly readable and steerable** by analyzing internal activations across various models. By manipulating the mean-difference between tool activation vectors, the authors can reliably **switch the model's chosen tool** (up to 100% accuracy) and ensure the subsequent arguments match the new tool's schema. Furthermore, the activation gap between the top two predicted tools serves as a **reliable pre-execution indicator of incorrect tool calls**.

Overview of the three-stage circuit and steering demonstration. Adding a mean-difference vector redirects tool selection and automatically restructures arguments. Validated across 12 IT models in 3 families (Gemma 3, Qwen 3 / Qwen 2.5, Llama 3.1; 270M–27B).
Overview of the three-stage circuit and steering demonstration. Adding a mean-difference vector redirects tool selection and automatically restructures arguments. Validated across 12 IT models in 3 families (Gemma 3, Qwen 3 / Qwen 2.5, Llama 3.1; 270M–27B).
cs.LGarxiv:2605.07840v1Lead article

RelAgent: LLM Agents as Data Scientists for Relational Learning

Xingyue Huang, Louis Tichelman, Jinwoo Kim, Krzysztof Olejniczak, İsmail İlkan Ceylan

elAgent is an LLM-based autonomous agent designed for relational learning, operating in two phases. First, the agent uses tools to autonomously construct feature-generating SQL programs and select a predictive model. The core contribution is that the final predictor relies solely on the executed SQL queries and a classical model, ensuring fast, deterministic, and intrinsically interpretable predictions scalable via standard database systems.

RelAgent . During the search phase, an LLM agent iteratively proposes and refines a feature program consisting of SQL feature queries { q 1 , … , q n } \{q_{1},\( \dots \),q_{n}\} and a predictive model configuration \( \varphi \) to solve a given task. The agent uses three tools: (1) database exploration via read-only SQL exploration queries, (2) program validation by executing candidate programs on a validation set and receiving performance metrics, and (3) inspection of past trials in the Evaluation Workspace via evaluation queries. Once a final program is selected, the agent is no longer needed at inference time.
RelAgent . During the search phase, an LLM agent iteratively proposes and refines a feature program consisting of SQL feature queries { q 1 , … , q n } \{q_{1},\( \dots \),q_{n}\} and a predictive model configuration \( \varphi \) to solve a given task. The agent uses three tools…
cs.LGarxiv:2605.07977v1Lead article

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

Seohyun Lee, Wenzhi Fang, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton

his paper introduces SPEAR (Self-Play Enhancement via Advantage-Weighted Refinement), an efficient online learning algorithm for federated LLM fine-tuning. SPEAR enables a self-improvement loop by using incoming real-time feedback to generate naturally contrastive self-play pairs for training, without requiring offline setups or privileged ground-truth contexts. This method effectively leverages decentralized user feedback for continuous model refinement on resource-constrained edge devices.

The two phases of the SPEAR algorithm. Firstly, the model interacts with an incoming feedback source (e.g., a user) to correct incorrect generations. After the interaction phase, it categorizes the samples into wins and losses, which are then used to train a standard MLE and unlikelihood objective. This two-stage process repeats at each federated round t t for each client selected for aggregation.
The two phases of the SPEAR algorithm. Firstly, the model interacts with an incoming feedback source (e.g., a user) to correct incorrect generations. After the interaction phase, it categorizes the samples into wins and losses, which are then used to train a standard MLE and unli…
cs.CLarxiv:2605.07883v1Lead article

Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

Ying Zhang, Congyu Qiao, Xin Geng, Ning Xu

his paper introduces **LANCE** to combat rigid rejection in LLMs by moving beyond binary refusal. LANCE uses variational inference to enhance safety labels, predicting a continuous distribution across multiple rejection categories. This fine-grained distribution provides textual gradients that guide a refinement model to neutralize harmful prompt elements, enabling LLMs to generate safe responses that are more flexible and natural.

Rigid refusal examples.
Rigid refusal examples.
cs.CLarxiv:2605.07982v1Lead article

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis

LiGuard reframes LLM content moderation as a schema-conditioned classification task, moving away from slow, large autoregressive models. It uses a small (0.3B parameter) bidirectional encoder that encodes task definitions and label semantics directly into the input sequence as structured schemas. This allows for the simultaneous, low-latency evaluation of numerous safety dimensions (policy compliance, harm categories, jailbreaks) in a single forward pass.

GLiGuard multi-task moderation overview. Given a text (prompt or response) and a user-specified task schema, GLiGuard produces predictions for all selected tasks in a single forward pass.
GLiGuard multi-task moderation overview. Given a text (prompt or response) and a user-specified task schema, GLiGuard produces predictions for all selected tasks in a single forward pass.
cs.CLarxiv:2605.07933v1Lead article

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space

Viacheslav Meshchaninov, Alexander Shabalin, Egor Chimbulatov, Nikita Gushchin, Ilya Koziev

his paper introduces the Latent Diffusion Language Model (LDLM), which jointly trains a latent encoder, diffusion model, and decoder for non-autoregressive text generation. The core method involves constructing a suitable latent space by reshaping pre-trained language model representations via a trainable encoder. The key contribution is a novel joint training recipe, incorporating an MSE decoder loss and specific warmup/sampling strategies, that significantly improves generation quality over naive joint training.

cs.CLarxiv:2605.07925v1Lead article

How Value Induction Reshapes LLM Behaviour

Arnav Arora, Natalie Schluter, Katherine Metcalf, Maartje ter Hoeve

his paper investigates the unintended consequences of value induction (fine-tuning LLMs with value-laden language) on model behavior. The authors fine-tune models using curated value subsets and measure the impact on related values, safety, anthropomorphism, and QA performance. They find that inducing specific values can unexpectedly alter the expression of other related or contrasting values, highlighting the complex trade-offs in value alignment.

Overview of our value-training effects framework. We create value-specific models using existing preference datasets and our value induction approach. We then evaluate the value models for several behaviours using corresponding datasets.
Overview of our value-training effects framework. We create value-specific models using existing preference datasets and our value induction approach. We then evaluate the value models for several behaviours using corresponding datasets.
cs.CLarxiv:2605.08083v1Lead article

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang

his paper introduces **AutoTTS**, an environment-driven framework that automates the discovery of optimal Test-Time Scaling (TTS) strategies for Large Language Models (LLMs). Instead of manual heuristic design, AutoTTS creates a tractable discovery environment where a controller learns when to allocate computation (branch, prune, etc.) based on pre-collected trajectories and cheap probe signals. This method significantly expands the explored computation-allocation space, leading to improved LLM performance through automated, data-driven resource management during inference.

Overview of our Auto-TTS framework. Unlike the traditional workflow of manually designing TTS strategies, Auto-TTS shifts the human role from directly hand-crafting branching, pruning, and stopping heuristics to constructing environments by defining states, actions, feedback, and objectives. Given the constructed environment, an explorer LLM iteratively proposes candidate controllers, evaluates them in the offline replay environment, receives feedback from scaling curves and execution traces, and uses the accumulated history to refine future proposals. The right panel shows an example evaluation on Qwen-1.7B and AIME25, where the discovered controller improves the accuracy–cost Pareto frontier over hand-crafted baselines with an affordable one-time search cost.
Overview of our Auto-TTS framework. Unlike the traditional workflow of manually designing TTS strategies, Auto-TTS shifts the human role from directly hand-crafting branching, pruning, and stopping heuristics to constructing environments by defining states, actions, feedback, and…
cs.AIarxiv:2605.10787v1Lead article

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

he paper introduces **ComplexMCP**, a novel benchmark designed to rigorously evaluate LLM agents in complex, real-world software automation scenarios involving interdependent tools and environmental noise. It utilizes a seed-driven architecture across 300+ tools derived from 7 stateful sandboxes to simulate dynamic and failure-prone environments. The contribution lies in exposing a significant performance gap, showing even top LLMs struggle to surpass 60% success compared to 90% for humans in these interdependent tasks.

The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism.
The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism.
cs.AIarxiv:2605.10906v1Lead article

DataMaster: Towards Autonomous Data Engineering for Machine Learning

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei

ataMaster introduces an autonomous data engineering framework to improve machine learning models by optimizing the data pipeline while keeping the learning algorithm fixed. It addresses the complex search space using a tree-structured search mechanism, shared candidate data, and a refinement process that incorporates feedback from downstream model training. The core contribution is enabling agents to autonomously discover, select, clean, and transform data to achieve stronger model performance.

Overview of DataMaster . DataMaster organizes autonomous data engineering as a DataTree , where red nodes broaden the search by discovering external datasets and writing them into a shared Data Pool , while black nodes exploit available candidates to construct executable data states and obtain downstream training feedback. Global Memory stores reusable artifacts, node outcomes, and prior findings, enabling later nodes to reuse discovered data, avoid repeated failures, and coordinate search across branches under a limited budget.
Overview of DataMaster . DataMaster organizes autonomous data engineering as a DataTree , where red nodes broaden the search by discovering external datasets and writing them into a shared Data Pool , while black nodes exploit available candidates to construct executable data sta…
cs.AIarxiv:2605.10763v1Lead article

MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study

Tim Van hamme, Thomas Vissers, Javier Carnerero-Cano, Mario Fritz, Emil C. Lupu

ATRA is a pragmatic threat modeling framework designed to systematically assess the risks in agentic AI systems by adapting established risk assessment methodologies. It begins with an asset-based impact assessment and uses attack trees to quantify the likelihood of known LLM threats causing harm within a specific deployment. The paper demonstrates MATRA's utility by showing how architectural controls can reduce the blast radius of successful attacks on an agent using the OpenClaw case study.

MATRA framework overview. System properties and threat sources are collected from the client. Assets identified from system documentation feed into a stakeholder-driven business impact assessment, which produces impact scenarios. A data flow diagram (DFD), combined with known attack techniques from established catalogs, informs the construction of attack trees that decompose each impact scenario into objectives, techniques, and architecture-specific vectors.
MATRA framework overview. System properties and threat sources are collected from the client. Assets identified from system documentation feed into a stakeholder-driven business impact assessment, which produces impact scenarios. A data flow diagram (DFD), combined with known att…
cs.AIarxiv:2605.10813v1Lead article

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Jinhang Xu, Qiyuan Zhu, Yujun Wu, Zirui Wang, Dongxu Zhang

anoResearch introduces a multi-agent framework designed to personalize research automation by addressing the need for accumulated procedural knowledge, retained user experience, and internalized implicit preferences. It achieves this through a "tri-level co-evolution" mechanism involving a skill bank for reusable procedures, a memory module for session retention, and a policy module that adapts to user-specific needs. The core contribution is enabling genuinely usable, personalized research automation that evolves with the user's unique context and history.

Comparison between (a) a uniform research automation pipeline that applies identical processing to all users and yields homogeneous outputs, and (b) NanoResearch, which recognizes distinct researcher personas and provides personalized skills and feedback upon failure, enabling each persona to evolve along its own trajectory.
Comparison between (a) a uniform research automation pipeline that applies identical processing to all users and yields homogeneous outputs, and (b) NanoResearch, which recognizes distinct researcher personas and provides personalized skills and feedback upon failure, enabling ea…
cs.AIarxiv:2605.10805v1Lead article

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Wenbo Zhang, Lijinghua Zhang, Liner Xiang, Hengrui Cai

his paper investigates the trade-off between reasoning capability and cost when using LLMs as judges, finding that explicit reasoning boosts accuracy for complex tasks but increases cost. The core contribution is the **Robust Adaptive Cost-Efficient Routing (RACER)** framework, which formulates dynamic judge selection as a constrained distributionally robust optimization problem to selectively use reasoning judges under a fixed budget, explicitly managing distribution shift.

cs.AIarxiv:2605.10870v1Lead article

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou, Zhihan Guo, Langzhang Liang, Zhuo Wang, Qifan Wang

his paper reframes agent memory as a **decision-centric rate-distortion problem**, arguing that memory should preserve distinctions crucial for future actions rather than descriptive accuracy. The core contribution is a framework that measures memory quality by the **loss in achievable decision quality** due to compression, establishing an optimal tradeoff frontier. This leads to the **DeMem** online learning algorithm, which refines memory partitions only when necessary to avoid decision conflicts.

DeMem routes histories into bounded slots and splits only on certified conflict.
DeMem routes histories into bounded slots and splits only on certified conflict.
cs.AIarxiv:2605.10754v1Lead article

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Xinrun Wang, Chang Yang, He Zhao, Zhuoyi Lin, Shuyue Hu

his paper argues that the current engineering-driven development of LLM-based foundation agents lacks a theoretical foundation. The core method is to introduce **Agent Cybernetics**, mapping the six canonical laws of classical cybernetics onto the design and analysis of these complex, long-horizon agents. The contribution is proposing cybernetics as the missing scientific scaffold to address fundamental questions regarding agent stability, environmental robustness, and safe self-improvement.

From Classical Cybernetics to Agent cybernetics
From Classical Cybernetics to Agent cybernetics
cs.AIarxiv:2605.10828v1Lead article

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

Muhan Gao, Zih-Ching Chen, Kuan-Hao Huang

his paper investigates the impact of misleading information (hard distractors) on LLM performance in long-context reasoning. The core finding is the "First Drop of Ink" effect: performance drops sharply with only a small initial proportion of distractors, after which further increases yield only marginal decline. This nonlinearity is attributed to hard distractors capturing disproportionate attention, even when scarce.

The First Drop of Ink effect. Left: Conventional linear assumption (top, red dashed line) versus empirically observed nonlinear degradation (bottom, blue curve): a small fraction of hard distractors is sufficient to severely degrade accuracy. Middle: Hard distractors receive similar attention logits as gold documents ( 8 ≈ 9 ≫ 1 8\( \approx \) 9\( \gg \) 1 ), dominating the softmax competition even at low proportions. Right: With 100 distractor documents, attention on gold drops 76% by adding only 10% hard distractors. This convex relationship explains The First Drop of Ink .
The First Drop of Ink effect. Left: Conventional linear assumption (top, red dashed line) versus empirically observed nonlinear degradation (bottom, blue curve): a small fraction of hard distractors is sufficient to severely degrade accuracy. Middle: Hard distractors receive simi…
cs.AIarxiv:2605.10843v1Lead article

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham

his paper introduces DISCA (Disagreement-Informed Steering for Cultural Alignment), a training-free, black-box method to align Large Language Models (LLMs) with diverse cultural values. DISCA leverages sociodemographic disagreement within a country, modeled via World Values Survey-grounded personas, to generate a bounded logit correction during inference. This approach effectively reduces cultural misalignment across multiple countries and LLM backbones without requiring fine-tuning or internal model access.

DISCA overview. Stage 1 builds WVS-grounded persona prompts for a trolley scenario in country c c ; Stage 2 runs a frozen large language model (LLM) on the base prompt and each persona, aggregates persona-level signals in logit space, and applies Prospect-Theory importance sampling (PT–IS) together with a dual-pass reliability gate to obtain the final sparing probability. Pseudocode and the six MultiTP attribute–temperature pairs provided in App. A1 .
DISCA overview. Stage 1 builds WVS-grounded persona prompts for a trolley scenario in country c c ; Stage 2 runs a frozen large language model (LLM) on the base prompt and each persona, aggregates persona-level signals in logit space, and applies Prospect-Theory importance sampli…
cs.LGarxiv:2605.10793v1Lead article

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

Chayne Thrash, Ali Abbasi, Soheil Kolouri

onQuR proposes a lightweight, post-training method to improve low-bit activation quantization in LLMs by learning optimal orthogonal rotations. These rotations align normalized activations with the corners of an inscribed hypercube, effectively distributing activation energy to minimize quantization error. This is achieved efficiently via a closed-form solution to the orthogonal Procrustes problem, avoiding costly retraining or reliance on activation corpora.

Overview of the proposed rotation-based calibration method. (a) Our method learns an orthogonal rotation that aligns normalized activation vectors with vertices of an inscribed hypercube, encouraging activation magnitude to be distributed more evenly. (b) During calibration, activations are processed online in mini-batches with closed-form orthogonal Procrustes updates. (c) At inference, learned rotations R 1 , { R 2 , ℓ } l = 1 L R_{1},\{R_{2,\( \ell \)}\}_{l=1}^{L} are folded into linear layer weights and Hadamard rotations, R 3 R_{3} and R 4 R_{4} , are applied online.
Overview of the proposed rotation-based calibration method. (a) Our method learns an orthogonal rotation that aligns normalized activation vectors with vertices of an inscribed hypercube, encouraging activation magnitude to be distributed more evenly. (b) During calibration, acti…
cs.LGarxiv:2605.10923v1Lead article

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng

his paper introduces SLIM, a framework for dynamic Skill Lifecycle Management in agentic reinforcement learning. SLIM treats the set of active external skills as a dynamic optimization variable, jointly updated with policy learning. Its core contribution is estimating each skill's marginal external contribution via leave-one-skill-out validation to intelligently retain, retire, or introduce skills, addressing the limitations of static skill management.

The reinforcement learning dynamics on ALFWorld. We plot validation success rate against the number of skills in active set during training. SkillRL accumulates external skills, whereas Skill0 progressively eliminates them. SLIM instead performs retain–retire–expand lifecycle management, converging to a non-empty skill set with higher validation success. This suggests that the effective endpoint is a learned external skill boundary rather than full accumulation or forced elimination.
The reinforcement learning dynamics on ALFWorld. We plot validation success rate against the number of skills in active set during training. SkillRL accumulates external skills, whereas Skill0 progressively eliminates them. SLIM instead performs retain–retire–expand lifecycle man…
cs.LGarxiv:2605.10770v1Lead article

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

Eleonora Gualdoni, Sonia Laguna, Louis Bethune, Joao Monteiro, Pierre Ablin

ynaMiCS frames multi-domain LLM fine-tuning as a constrained optimization problem to balance target domain improvement with performance preservation on constrained domains. It achieves this by dynamically estimating the local cross-domain effects (a slope matrix) via short probing runs at each update. These estimates guide an optimizer to compute mixture weights that maximize target performance while strictly enforcing loss constraints on the preserved capabilities.

DynaMiCS overview. Problem setup. Fine-tuning datasets 𝒟 \( \mathcal{D} \) provide the data available for mixture selection, including target datasets and optional auxiliary datasets for transfer or regularization. Evaluation domains ℰ \( \mathcal{E} \) are partitioned into target domains, whose losses are minimized, and constrained domains, whose losses must remain below reference values. DynaMiCS optimization. At each update, DynaMiCS estimates a slope matrix 𝐒 ​ ( t ) \( \mathbf{S} \)(t) (1) , where S i ​ j ​ ( t ) S_{ij}(t) measures the local effect of training on dataset D j D_{j} on evaluation loss L i L_{i} . Green/red entries denote loss decreases/increases. Given 𝐒 ​ ( t ) \( \mathbf{S} \)(t) , DynaMiCS solves a constrained optimization problem to obtain weights 𝐰 ∗ \( \mathbf{w}^{*} \) (2) , trains with them for H t H_{t} steps (3) , and then repeats the procedure. The simplex illustrates the proxy objective landscape, with white lines marking constraint boundaries; values are illustrative.
DynaMiCS overview. Problem setup. Fine-tuning datasets 𝒟 \( \mathcal{D} \) provide the data available for mixture selection, including target datasets and optional auxiliary datasets for transfer or regularization. Evaluation domains ℰ \( \mathcal{E} \) are partitioned into targ…
cs.LGarxiv:2605.10784v1Lead article

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

Rohan Surana, Xintong Li, Sheldon Yu, Yiran Jenny Shen, Chuhan Wang

ASS-DPO introduces an active sample selection method for Multi-negative DPO that addresses the cost of using large negative pools. It uses a PL-specific Fisher-information objective to select compact, informative negative subsets by favoring samples whose gradients offer complementary information for policy updates. This reduces redundancy from similar candidates while retaining the full training signal, leading to more efficient optimization.

Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature difference ϕ i = ϕ ​ ( x , y i ) − ϕ ​ ( x , y ∗ ) \( \phi_{i} \)=\( \phi \)(x,y_{i})-\( \phi \)(x,y^{*}) and policy offset b i = log ⁡ π ref ​ ( y ∗ ∣ x ) − log ⁡ π ref ​ ( y i ∣ x ) b_{i}=\( \log \)\( \pi_{\rm ref} \)(y^{*}\( \mid \) x)-\( \log \)\( \pi_{\rm ref} \)(y_{i}\( \mid \) x) , with softmax weights defined in Equation ˜ 8 . The green loop denotes the subset-construction step in Algorithm ˜ 1 : starting from H 0 H_{0} , we incrementally pick the negative that maximally increases log ​ det H \( \log \)\( \det \) H , then update H H accordingly until n n samples are selected.
Overview of MASS-DPO’s D-optimal selection. Each candidate is scored using the feature difference ϕ i = ϕ ​ ( x , y i ) − ϕ ​ ( x , y ∗ ) \( \phi_{i} \)=\( \phi \)(x,y_{i})-\( \phi \)(x,y^{*}) and policy offset b i = log ⁡ π ref ​ ( y ∗ ∣ x ) − log ⁡ π ref ​ ( y i ∣ x ) b_{i}=\( …
cs.CLarxiv:2605.10721v1Lead article

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano, Viola Priesemann, David Garcia

his paper investigates how interacting AI agents can collectively become misaligned, even if individually aligned. The core method involves simulating opinion dynamics where agents conform to the majority while maintaining an intrinsic bias, using statistical physics to derive a theory predicting when populations become trapped in misaligned states. The key contribution is demonstrating that conformity dynamics can lead to stable population-level misalignment and identifying tipping points where adversarial agents can cause irreversible shifts in group alignment.

Collective misalignment through conformity dynamics. AI agent populations exhibit path-dependent collective behavior where final alignment depends critically on initial conditions. Panels (a)–(c) show temporal evolution of collective opinion m ​ ( t ) m(t) for N = 50 N=50 agents over 25 independent runs, with trajectories colored by initial collective opinion m 0 m_{0} (color bar). Panels (d)–(f) show distributions of final collective opinion m f m_{f} (vertical axis) for each initial condition m 0 m_{0} (horizontal axis), revealing bistability. (a), (d): Gemma 3 27B with opinion pair “gender self-identification” vs “biological sex classification”. Starting from balance ( m 0 = 0 m_{0}=0 ), agents consistently coordinate toward gender self-identification (positive m m ). However, sufficient initial bias toward biological sex classification ( m 0 ≲ − 0.6 m_{0}\( \lesssim \)-0.6 ) produces bistability, with some runs converging to the opposite opinion despite the model’s intrinsic preference. At strong negative initial conditions ( m 0 ≈ − 0.8 m_{0}\( \approx \)-0.8 ), virtually all runs yield stable misalignment. (b), (e): Gemma 3 27B with “renewable energy” vs “fossil fuels” shows no bistability; trajectories consistently converge to renewable energy regardless of initial conditions. (c), (f): Llama 3.1 8B with the same gender/biological sex pair also shows no bistability.
Collective misalignment through conformity dynamics. AI agent populations exhibit path-dependent collective behavior where final alignment depends critically on initial conditions. Panels (a)–(c) show temporal evolution of collective opinion m ​ ( t ) m(t) for N = 50 N=50 agents …
cs.CLarxiv:2605.10863v1Lead article

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Yulan Yuan

GPO introduces a novel framework for aligning Large Language Models (LLMs) by moving beyond traditional pairwise preferences to **Directional-Groupwise Optimization**. It achieves this by structuring forward and reverse question-answer instances into groups and optimizing a margin-based objective that enforces **directional consistency** across diverse reasoning paths. This group-wise approach captures richer relative information, leading to consistent performance gains over existing methods.

An overview of the DGPO training framework. The process begins with forward problems ( x f x_{f} ), each of which can be paired with a reverse question ( x r x_{r} ) formulated in the opposite reasoning direction. A teacher model then produces multiple candidate solutions for each problem type ( { y f ​ i } i = 1 3 \{y_{fi}\}_{i=1}^{3} for x f x_{f} and { y r ​ i } i = 1 3 \{y_{ri}\}_{i=1}^{3} for x r x_{r} ). The solutions are subsequently structured into direction-consistent ( 𝒢 + \( \mathcal{G}^{+} \) ) and direction-divergent ( 𝒢 − \( \mathcal{G}^{-} \) ) groups, wherein consistency is determined by matching a prompt’s directionality with its corresponding solutions (e.g., x f x_{f} with { y f ​ i } i = 1 3 \{y_{fi}\}_{i=1}^{3} ). DGPO is trained on this structured supervision, incorporating directional modeling and uncertainty-based regulation to enhance alignment stability.
An overview of the DGPO training framework. The process begins with forward problems ( x f x_{f} ), each of which can be paired with a reverse question ( x r x_{r} ) formulated in the opposite reasoning direction. A teacher model then produces multiple candidate solutions for eac…
cs.CLarxiv:2605.10779v1Lead article

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang, Yiran Zhao

ITMUS is a novel benchmark designed to rigorously test the behavioral safety of LLM agents operating in real OS environments against dangerous "behavior jailbreaks." Its core contribution lies in a semantic-physical dual verification mechanism and OS-level state rollback, ensuring accurate testing by preventing contamination and assessing both conversational intent and actual harmful OS execution. The benchmark comprises 819 high-risk test cases across three adversarial paradigms, evaluated using a fully automated multi-agent framework.

Behavior Jailbreak in practice: a malicious prompt causes an OpenClaw-based agent to execute dangerous OS-level operations, producing real physical damage. Attack Success Rates remain alarmingly high even with strong LLMs as the agent brain. Data sourced from LITMUS.
Behavior Jailbreak in practice: a malicious prompt causes an OpenClaw-based agent to execute dangerous OS-level operations, producing real physical damage. Attack Success Rates remain alarmingly high even with strong LLMs as the agent brain. Data sourced from LITMUS.
cs.CLarxiv:2605.10912v1Lead article

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu

ildClawBench is introduced as a novel benchmark designed to evaluate real-world, long-horizon agent performance by running tasks within actual command-line interface (CLI) harnesses inside reproducible Docker containers. Its core contribution is moving beyond synthetic sandboxes to test agents on 60 complex, multimodal tasks requiring significant wall-clock time and numerous tool calls, using a hybrid grading system. This provides a more realistic assessment of agent capabilities in deployment environments.

Comparison with previous agent benchmarks and WildClawBench. (a) Prior benchmarks evaluate short-horizon, single-step tasks with toy APIs in controlled sandboxes, whereas (b) WildClawBench evaluates long-horizon multimodal workflows with real tools in open-world environments. (c) The benchmark spans six categories and is compatible with multiple agent harnesses. (d) A summary of key differences across environment, task horizon, tool use, and evaluation.
Comparison with previous agent benchmarks and WildClawBench. (a) Prior benchmarks evaluate short-horizon, single-step tasks with toy APIs in controlled sandboxes, whereas (b) WildClawBench evaluates long-horizon multimodal workflows with real tools in open-world environments. (c)…
cs.AIarxiv:2605.13652v1Lead article

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

his paper moves beyond simple perplexity comparisons to geometrically and spectrally analyze the solutions produced by five distinct low-rank pre-training methods against full-rank training. The core contribution is a rigorous characterization of how rank constraints alter the learned internal representations and loss landscape positions, addressing whether low-rank models generalize comparably to their full-rank counterparts.

cs.AIarxiv:2605.13825v1Lead article

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

his paper introduces **HistoryAnchor-100**, a benchmark to test if prior harmful actions steer Large Language Models (LLMs) toward continued unsafe behavior. The core finding is that frontier LLMs, even highly aligned ones, exhibit a striking vulnerability: a simple instruction to "stay consistent with the prior history" causes them to overwhelmingly select unsafe continuation actions (91-98% rate) following a harmful preceding step. This demonstrates that historical context, when explicitly referenced, can override alignment safeguards, leading to potentially dangerous decision-making.

cs.AIarxiv:2605.13625v1Lead article

How to Interpret Agent Behavior

Jie Gao, Kaiser Sun, Jen-tse Huang, Katherine Van Koevering, Sijie Ji

his paper introduces **ACT*ONOMY**, a novel, three-level hierarchical taxonomy (10 actions, 46 subactions, 120 leaf categories) designed to systematically describe and analyze the runtime behavior of autonomous agents from their natural-language traces. The core contribution is providing a structured framework, coupled with an open repository and automated analysis pipeline, to make complex agent reasoning interpretable for debugging and oversight at scale.

Why do we need Act· onomy ? Act· onomy can be used to label agent trajectories with human-readable action tags; we use a 13-turn SWE-bench trajectory as a running example. Top: A phase overview of the trajectory on pylint-dev/pylint-5859 , with color-coded regions marking distinct turns. Middle: Three pivotal turns annotated with Act· onomy sub-action tags: Turn 4 ( confirm ) verifies the bug and pivots to code localization; Turn 6 ( stumble ) detects a failed fix and recovers with a new search strategy; Turn 9 ( pinpoint ) identifies \( \b \) in the regex as the root cause. Bottom: A sentence-level zoom into Turn 9, grounding each tag in a specific quoted span from the agent’s Observation → \( \to \) Thought → \( \to \) Action loop.
Why do we need Act· onomy ? Act· onomy can be used to label agent trajectories with human-readable action tags; we use a 13-turn SWE-bench trajectory as a running example. Top: A phase overview of the trajectory on pylint-dev/pylint-5859 , with color-coded regions marking distinc…
cs.AIarxiv:2605.13579v1Lead article

Position: Assistive Agents Need Accessibility Alignment

Jie Hu, Changyuan Yan, Yu Zheng, Ziqian Wang, Jiaming Zhang

his paper argues that current assistive AI systems fail BVI users because they are designed assuming sighted interaction and low-cost verification. The core contribution is introducing the concept of **accessibility alignment** as a first-class design objective, rather than a usability afterthought. The authors propose a lifecycle-oriented design pipeline to systematically build agents that meet the unique verification, risk, and interaction constraints of BVI users.

Task-Centric Taxonomy of Blind Assistance and Distribution of Assistive Task Instances. Distribution of 778 assistive task instances across four domains and their subcategories, highlighting dominant needs in Reading and Text Access (35%) and Mobility and Safety (34%).
Task-Centric Taxonomy of Blind Assistance and Distribution of Assistive Task Instances. Distribution of 778 assistive task instances across four domains and their subcategories, highlighting dominant needs in Reading and Text Access (35%) and Mobility and Safety (34%).
cs.AIarxiv:2605.13737v1Lead article

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun

his paper introduces IMAVB, a benchmark to test if omnimodal LLMs can detect contradictions between a textual premise and their own sensory input (vision/audio). The core finding is a "Representation-Action Gap": models reliably encode these premise-perception mismatches in their internal states but almost always fail to reject the false claim in their final outputs. This suggests a disconnect between internal sensory grounding and the model's generative action.

Overview of the Representation–Action Gap on IMAVB.
Overview of the Representation–Action Gap on IMAVB.
cs.AIarxiv:2605.13537v1Lead article

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Ye Wang, Jing Liu, Toshiaki Koike-Akino

his paper introduces **SLOP (Sharpened Logarithmic Opinion Pool)**, an extension of inference-time alignment that generalizes techniques to combine ensembles of generative reward models using temperature-adjusted reference models. The core contribution is a novel algorithm for calibrating the SLOP weight parameters to effectively **mitigate reward hacking** while maintaining strong alignment performance.

cs.AIarxiv:2605.13772v1Lead article

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez, Ali Baheri

his paper introduces a novel method for detecting step-level hallucinations in LLM reasoning by analyzing the geometry of the hidden-state trajectory during a single forward pass. The core idea is that correct reasoning follows a stable manifold, and the first error manifests as a localized excursion in transport cost away from this manifold. The authors develop a teacher model using contrastive PCA to score each step based on geometric transition features, which is then distilled into a deployable BiLSTM student for efficient, single-pass error localization.

The GeoReason teacher – student architecture. The teacher (top) uses step-level labels and reasoning-trace hidden states to construct a contrastive PCA (cPCA) projection, extracts a geometric feature set in this lens, and maps the features through an MLP to step-level hallucination probabilities. The student (bottom) is a BiLSTM that contextualizes raw hidden states and feeds a step classifier head, trained from three signals: supervised step labels, probability distillation from the teacher , and feature distillation through a training-only auxiliary head. At inference, the student requires only hidden states.
The GeoReason teacher – student architecture. The teacher (top) uses step-level labels and reasoning-trace hidden states to construct a contrastive PCA (cPCA) projection, extracts a geometric feature set in this lens, and maps the features through an MLP to step-level hallucinati…
cs.CLarxiv:2605.13839v1Lead article

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Wenrui Bao, Huan Wang, Jian Wang, Zhangyang Wang, Kai Wang

his paper introduces TFlow, a novel weight-space communication framework for multi-agent LLMs that replaces costly natural language message passing with direct weight updates. The core method involves frozen sender agents generating internal activations, which a learned parameter generator maps into low-rank LoRA perturbations targeting the receiver's modules. This enables instance-specific adaptation during generation, significantly reducing token costs and overhead associated with traditional context-based communication.

(i) Comparison between Text-based MAS and the proposed Weight-Collaboration MAS. In Text MAS, auxiliary agents transmit natural language messages to the Executor, incurring costly prefilling overhead and inflated KV cache. In contrast, our proposed paradigm compresses inter-agent communication into lightweight LoRA weight perturbations Δ ​ W \( \Delta \) W , which are directly merged into the parameters, thereby eliminating the extra prefilling and significantly reducing the KV cache footprint. (ii) Performance overview on GSM8K . TFlow achieves accuracy competitive with TextMAS while reducing token consumption by 76.7 % \( \mathbf{76.7\%} \) , substantially surpassing the single-agent baseline in both accuracy and efficiency.
(i) Comparison between Text-based MAS and the proposed Weight-Collaboration MAS. In Text MAS, auxiliary agents transmit natural language messages to the Executor, incurring costly prefilling overhead and inflated KV cache. In contrast, our proposed paradigm compresses inter-agent…
cs.AIarxiv:2605.16245v1Lead article

AI-Mediated Communication Can Steer Collective Opinion

Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

his paper investigates how AI, specifically LLMs editing user posts, influences collective opinion formation during human-to-human online communication. Empirically, the authors demonstrate that popular LLMs introduce directional biases when revising human text on contested topics. They then model this phenomenon mathematically, showing how an intervening AI system can steer the overall opinion dynamics across a social network.

cs.AIarxiv:2605.16217v1Lead article

Argus: Evidence Assembly for Scalable Deep Research Agents

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu

rgus introduces a cooperative agent framework, pairing a Searcher and a Navigator, to efficiently tackle complex information seeking tasks. Instead of parallelizing redundant searches, Argus treats research as assembling complementary evidence pieces into a shared graph. This method aims to complete the required evidence set more effectively than brute-force parallel exploration, leading to scalable and comprehensive deep research answers.

Argus operating modes. (a) Standalone Searcher, single path. (b) Navigator identifies unfilled pieces and dispatches targeted queries. (c) Parallel Searchers each target a distinct piece.
Argus operating modes. (a) Standalone Searcher, single path. (b) Navigator identifies unfilled pieces and dispatches targeted queries. (c) Parallel Searchers each target a distinct piece.
cs.AIarxiv:2605.16207v1Lead article

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian

his paper evaluates the diagnostic precision of LLM tutoring agents in propositional logic using a knowledge-graph-derived benchmark of over 10,000 solution-feedback pairs. The core finding is that while LLMs perform well on optimal solutions, they systematically fail to distinguish between valid-suboptimal and incorrect reasoning, precisely the area crucial for effective adaptive tutoring. This suggests architectural limitations in LLMs, as accurate diagnosis did not reliably translate into pedagogically actionable feedback.

Optimal and valid-alternative solutions (blue nodes represent abbreviated inference rule names, explained in Table 4 )
Optimal and valid-alternative solutions (blue nodes represent abbreviated inference rule names, explained in Table 4 )
cs.AIarxiv:2605.16205v1Lead article

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor

his paper systematically investigates the impact of context representation, reasoning mechanisms, and task hierarchy on the performance and cost of compound LLM agents operating in adversarial, partially observable environments (modeled as a POMDP). The core contribution is a controlled, cost-aware study demonstrating which design choices effectively mitigate failure in these challenging settings, offering practitioners empirical guidance beyond simple performance metrics.

Figure 1. End-to-end system architecture. The deterministic layer (left) compiles structured context from CybORG observations and assembles the agent prompt. The Planner (right) executes a ReAct loop, optionally delegating to Analyst and ActionChooser sub-agents, before emitting a validated action back to the environment.
Figure 1. End-to-end system architecture. The deterministic layer (left) compiles structured context from CybORG observations and assembles the agent prompt. The Planner (right) executes a ReAct loop, optionally delegating to Analyst and ActionChooser sub-agents, before emitting …
cs.AIarxiv:2605.16113v1Lead article

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin

ebiasRAG introduces a novel, tuning-free framework leveraging Retrieval-Augmented Generation (RAG) to dynamically mitigate social biases in Large Language Models (LLMs) during inference. By retrieving contextually relevant, debiasing information, the method achieves fairer generation without requiring additional training or complex prompt engineering. This approach effectively improves fairness while preserving the LLM's original generative capabilities.

Figure 1 . System workflow of DebiasRAG. The workflow consists of three main components. The first stage (Upper Block) involves document preparation and preprocessing, including management of the Avoid Document Repo, along with user-provided input documents (Optional). The second stage (Middle Block) performs reverse-generation of debiasing performance based on the user’s input to establish a baseline for effective real-time operation. For the third stage (Lower Block), real-time debias-guided reranking optimization, integrates embedding retrieval, gradient-based reranking, and generation, working dynamically to debias the reasoning and output process of large language models.
Figure 1 . System workflow of DebiasRAG. The workflow consists of three main components. The first stage (Upper Block) involves document preparation and preprocessing, including management of the Avoid Document Repo, along with user-provided input documents (Optional). The second…
cs.AIarxiv:2605.16233v1Lead article

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor

ORGE is a population-based protocol that enables LLM agents to improve decision-making by evolving natural-language memory (Rules, Examples, or Mixed) without any weight updates. It uses a dedicated reflection agent to convert failed trajectories into reusable knowledge artifacts, which are then broadcast to the population, allowing agents to self-evolve their performance over stages. This method successfully enhances agent capabilities on a complex task using multiple LLM families.

Figure 1. System Overview. (Left) Hierarchical ReAct agent with dynamic memory injection. (Right) Reflexion learning loop: upon a reward below threshold, a dedicated Reflector or Exemplifier agent analyzes the full trajectory and synthesizes knowledge artifacts that are injected back into the agent’s memory.
Figure 1. System Overview. (Left) Hierarchical ReAct agent with dynamic memory injection. (Right) Reflexion learning loop: upon a reward below threshold, a dedicated Reflector or Exemplifier agent analyzes the full trajectory and synthesizes knowledge artifacts that are injected …
cs.AIarxiv:2605.16198v1Lead article

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

his paper introduces a novel framework that integrates formal methods, specifically Linear Temporal Logic (LTL), with state-of-the-art machine learning to audit and monitor advanced AI systems like LLMs. The core contribution is providing techniques for both offline auditing and online runtime monitoring of complex, temporally extended behavioral constraints (safety, regulations) for black-box models. Furthermore, it proposes intervening monitors that can preemptively mitigate predicted violations during operation.

Figure 1 . Overview of Temporal Rule Assessment and Compliance (TRAC) : This figure depicts the base TRAC algorithm (inner green box) and TRAC with predictive and intervening capabilities ( TRAC P+I \( \text{TRAC} \)_{\( \text{P+I} \)} ) (outer blue box). An AI agent interacts with an environment over time, producing a sequence of inputs (from the environment) and outputs (from the agent). The Labeler extracts atomic propositions from the sequence of inputs and outputs so far, which then are used by the Monitor to progressively evaluate the monitoring objective (i.e., a behavioral pattern represented as an LTL formula). The Predictor estimates the risk of future violations, enabling the Intervenor to modify the agent’s inputs or substitute its outputs before an undesirable outcome occurs.
Figure 1 . Overview of Temporal Rule Assessment and Compliance (TRAC) : This figure depicts the base TRAC algorithm (inner green box) and TRAC with predictive and intervening capabilities ( TRAC P+I \( \text{TRAC} \)_{\( \text{P+I} \)} ) (outer blue box). An AI agent interacts wi…
cs.AIarxiv:2605.16143v1Lead article

Look Before You Leap: Autonomous Exploration for LLM Agents

Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai

his paper addresses the tendency of LLM agents to prematurely exploit knowledge in new environments by introducing **autonomous exploration** as a key capability. The authors formalize this with the **Exploration Checkpoint Coverage (ECC)** metric to quantify broad state discovery. They propose an **Explore-then-Act paradigm** trained by interleaving task-execution and dedicated exploration rollouts, each optimized by verifiable rewards, to improve adaptability.

Task-oriented training fails to produce autonomous exploration capabilities, resulting in agents that prematurely exploit familiar patterns and acquire limited environment knowledge. We explicitly optimize for exploration through ECC rewards, enabling agents to systematically discover environment structure, objects, and affordances. The resulting Explore-then-Act paradigm decouples information gathering from task execution: agents first explore to acquire grounded knowledge, then leverage it to solve downstream tasks.
Task-oriented training fails to produce autonomous exploration capabilities, resulting in agents that prematurely exploit familiar patterns and acquire limited environment knowledge. We explicitly optimize for exploration through ECC rewards, enabling agents to systematically dis…
cs.AIarxiv:2605.16194v1Lead article

paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

Arquimedes Canedo

his paper introduces **`paper.json`**, a standardized companion JSON file for academic papers designed to improve machine readability for LLM agents. Its core contribution is a lightweight convention featuring stable IDs for claims (C1), explicit scope limitations (C2), figure-specific shell commands (C3), and definition IDs (C5). This structure aims to resolve common LLM failures by making key paper components directly addressable and actionable.

cs.AIarxiv:2605.16045v1Lead article

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

Zijie Dai, Shiyuan Deng, Sheng Guan, Yizhou Tian, Xin Yao

ecMem proposes a novel, recurrence-based memory consolidation method for long-running LLM agents to reduce token consumption. Instead of eagerly processing every interaction, it stores them in a lightweight subconscious layer and only invokes the LLM to extract episodic and semantic memory when sustained recurrence of semantically similar interactions is detected. This selective consolidation significantly improves efficiency while maintaining effectiveness through a semantic refinement mechanism.

cs.CLarxiv:2605.16117v1Lead article

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

GR is a stepwise reasoning framework that enhances Large Language Models' (LLMs) complex inference capabilities by integrating external knowledge. The core method involves generating query-specific subgraphs from external knowledge bases to ground intermediate reasoning steps. This approach mitigates LLM inconsistencies by focusing the model on relevant entities and relations within the structured evidence.

Pipeline of SGR framework.
Pipeline of SGR framework.
cs.AIarxiv:2605.20173v1Lead article

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Vasundra Srinivasan

his paper introduces the **Stochastic-Deterministic Boundary (SDB)** as the core architectural primitive for production LLM agents, defining it as a four-part contract governing how LLM outputs become system actions. The authors organize agent runtime design around this SDB across three concerns (Coordination, State, Control) and present a catalog of six compositional runtime patterns, tracing their lineage to distributed systems concepts adapted for stochastic workers.

cs.AIarxiv:2605.20025v1Lead article

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji

utoResearchClaw introduces a self-reinforcing, iterative autonomous research pipeline that moves beyond linear execution. Its core method involves structured multi-agent debate, a self-healing execution loop that learns from failures, and cross-run evolution to accumulate knowledge. This system significantly contributes by enabling robust, continuous scientific discovery through integrated human-AI collaboration and failure-informed iteration.

Overview of the AutoResearchClaw pipeline. Given a research idea, the system progresses through three stages: Discovery (scoping, literature search, multi-agent debate for hypothesis generation), Experimentation (self-healing code execution, result analysis with a second debate panel, and Pivot / Refine decisions), and Writing (drafting, review, revision, four-layer citation verification). Optional human-in-the-loop gates (orange) allow oversight at key checkpoints. The cross-run evolution system (bottom) injects time-decayed lessons from prior runs into all phases.
Overview of the AutoResearchClaw pipeline. Given a research idea, the system progresses through three stages: Discovery (scoping, literature search, multi-agent debate for hypothesis generation), Experimentation (self-healing code execution, result analysis with a second debate p…
cs.AIarxiv:2605.20075v1Lead article

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia

opT reformulates Chain-of-Thought reasoning by prioritizing a draft answer before engaging in subsequent "on-policy thinking" for reflection and correction. Its core method involves using continuous embeddings as inference-time contrastive verifiers, comparing the model's support for generated tokens under discrete and continuous inputs. This approach aims to improve efficiency and reasoning accuracy by allowing early access to plausible answers while still enabling necessary self-correction.

(a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by ∗ , across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy.
(a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by ∗ , across mathematics, coding, and agentic reasoning tasks and nearly halves t…
cs.AIarxiv:2605.19966v1Lead article

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Mohammed Alshaalan, Miguel R. D. Rodrigues

his paper introduces **CPD Online (CPD)**, a novel, training-free method for detecting fluent adversarial prompts by framing the problem as **online change-point detection** on the token-level next-token entropy stream. By establishing a baseline using the LLM's system prompt and applying a CUSUM statistic to standardized token entropies, CPD effectively identifies the onset of optimization-based adversarial suffixes. This approach significantly outperforms perplexity-based detectors across multiple models and attack types.

Top: benign prompt where the CUSUM statistic W t + W_{t}^{+} (purple) stays below threshold h h (orange) at slack k = 0 k=0 (the canonical Page-CUSUM setting used for Table 1 ; Appendix A ). Bottom: adversarial prompt (AdvPrompter); a sustained upward shift in token entropy after the suffix onset (green) causes W t + W_{t}^{+} to cross h h , triggering an alarm at time \( \tau \) (red). The shaded region denotes the ground-truth adversarial suffix. For comparison the WPP 15 baseline (brown dash-dot, plotted as the non-overlapping window-mean NLL the detector actually scores) and its F1-optimal threshold (brown dotted) are overlaid: on this fluent attack WPP 15 never crosses its threshold while CPD’s W t + W_{t}^{+} does.
Top: benign prompt where the CUSUM statistic W t + W_{t}^{+} (purple) stays below threshold h h (orange) at slack k = 0 k=0 (the canonical Page-CUSUM setting used for Table 1 ; Appendix A ). Bottom: adversarial prompt (AdvPrompter); a sustained upward shift in token entropy after…
cs.AIarxiv:2605.19932v1Lead article

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

EEK introduces a novel method for LLM agents operating on recurring long contexts by caching reusable orientation knowledge as a "context map." This small, constant-sized artifact, maintained via a programmable cache policy (Distiller, Cartographer, Prioritizer), acts as an orientation cache within the agent's prompt. The core contribution is providing persistent, structured knowledge about the context's contents and organization, improving efficiency across repeated invocations.

cs.AIarxiv:2605.20072v1Lead article

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

Oussama Zenkri, Oliver Brock

his paper investigates how observation fidelity impacts embodied LLM agents solving a complex mechanical puzzle called the Lockbox. The core method involves testing LLMs with varying observation types (RGB, RGB-D, and ground-truth) on a physical robot and in simulation. The key contribution is the counterintuitive finding that perfect, ground-truth observations degrade performance, while moderate levels of observation noise significantly *improve* problem-solving success.

Our robotic system manipulating the Lockbox. Our Lockbox comprises two prismatic joints (sliding bars in the middle) and two revolute joints. The Lockbox is unlocked when the leftmost revolute joint, which we refer to as the target joint, is pulled. The robot employs a soft-hand end effector for manipulating the joints, an RGB-D camera for acquiring visual data, and a force-torque sensor for assessing the joint movability and guiding their manipulation.
Our robotic system manipulating the Lockbox. Our Lockbox comprises two prismatic joints (sliding bars in the middle) and two revolute joints. The Lockbox is unlocked when the leftmost revolute joint, which we refer to as the target joint, is pulled. The robot employs a soft-hand …
cs.AIarxiv:2605.20087v1Lead article

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li

he paper introduces **ThoughtTrace**, the first large-scale dataset pairing real-world multi-turn human-AI conversations with users' self-reported thoughts (reasons for prompts and reactions to responses). The core contribution is providing this crucial "what they think" layer, which analysis shows is distinct from spoken text and difficult for current LLMs to infer. This dataset is then shown to improve user behavior prediction and enable fine-grained alignment through thought-guided response rewriting.

A representative example from ThoughtTrace . A user interacts with a chatbot to complete daily tasks through multi-turn conversations (top), while annotating their latent thoughts during the conversations (bottom). Thoughts take two forms: reasons for sending user prompts and reactions to assistant responses, which can be categorized into several types (e.g., task motivation , style expectation ). Latent thoughts reveal users’ thought traces that drive the human-AI interactions in multi-turn conversations, providing valuable signals for user modeling and improving AI assistance.
A representative example from ThoughtTrace . A user interacts with a chatbot to complete daily tasks through multi-turn conversations (top), while annotating their latent thoughts during the conversations (bottom). Thoughts take two forms: reasons for sending user prompts and rea…
cs.CLarxiv:2605.19852v1Lead article

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai

his paper introduces **AutoTool**, a method that enables Multimodal Large Language Models (MLLMs) to **adaptively decide whether to invoke external tools** during reasoning, addressing the issue that unnecessary tool use can hinder performance. It employs a **dual-mode reasoning strategy within a reinforcement learning framework**, using mode-specific rewards to balance accurate tool-assisted and text-centric reasoning throughout training. The core contribution is shifting from mandatory tool use to intelligent, context-aware tool invocation.

(a, b) Representative queries that do or do not trigger the zoom-in tool, illustrating that tool usage is not always necessary, while AutoTool adaptively invokes tools when beneficial. (c, d) Comparison of the proportion of tool-augmented reasoning trajectories during training, as well as the training and inference time costs between our AutoTool and SOTA DeepEyes (Zheng et al. , 2025 ) .
(a, b) Representative queries that do or do not trigger the zoom-in tool, illustrating that tool usage is not always necessary, while AutoTool adaptively invokes tools when beneficial. (c, d) Comparison of the proportion of tool-augmented reasoning trajectories during training, a…
cs.CLarxiv:2605.20176v1Lead article

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen

linSeekAgent is an automated agentic framework designed to shift clinical reasoning from passive evidence consumption to active evidence acquisition. It dynamically seeks, plans for, and synthesizes multimodal evidence from heterogeneous sources like knowledge bases, EHRs, and imaging tools based only on a clinical query. This contributes a novel system that enables frontier LLMs to perform grounded clinical decisions by actively gathering necessary information at inference time.

ClinSeekAgent Overview. ClinSeekAgent is an automated agentic evidence-seeking pipeline. It interacts with heterogeneous data sources to enable multimodal evidence seeking for clinical decision support. Compared with prior user-curated context settings, ClinSeekAgent is more flexible by acquiring richer information and knowledge from diverse tools.
ClinSeekAgent Overview. ClinSeekAgent is an automated agentic evidence-seeking pipeline. It interacts with heterogeneous data sources to enable multimodal evidence seeking for clinical decision support. Compared with prior user-curated context settings, ClinSeekAgent is more flex…
cs.CLarxiv:2605.20128v1Lead article

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li

he paper introduces **MixRea**, a benchmark designed to test Large Language Models (LLMs) on **explicit-implicit reasoning**, inspired by human inattentional blindness. It evaluates whether LLMs fail to use subtle contextual cues when explicit instructions are present, revealing widespread "inattentional blindness" across 21 models. The authors also propose **Potential Relation Completion Prompting (PRCP)** as a method to mitigate this issue by recovering overlooked causal relations.

An explicit-implicit reasoning example from our MixRea benchmark. When reasoning about explicitly stated information in the question, LLMs must leverage distinctions among events presented in the options to identify and infer relevant implicit information from the story context. They then integrate these reasoning results to derive the optimal event set.
An explicit-implicit reasoning example from our MixRea benchmark. When reasoning about explicitly stated information in the question, LLMs must leverage distinctions among events presented in the options to identify and infer relevant implicit information from the story context. …
cs.CLarxiv:2605.19952v1Lead article

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

Jingwei Sun, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han

his paper introduces **TriMem**, a novel memory system for lifelong LLM agents that moves beyond purely atomic facts. TriMem maintains three coexisting representation granularities—raw dialogue segments, atomic facts, and synthesized profiles—to ensure both storage fidelity and deep, holistic reasoning over accumulated history. This multi-granularity approach overcomes the limitations of fact-centric methods by preserving fine-grained details while enabling efficient retrieval.

cs.CLarxiv:2605.20179v1Lead article

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang

IDE proposes an efficient and lossless inference method for Mixture-of-Experts (MoE) Diffusion Large Language Models (dLLMs) by exploiting the temporal stability of expert activations during the diffusion process. It introduces an interval-based expert refresh strategy that manages expert placement in an I/O-aware manner, formulated as a mathematical programming problem to optimize scheduling. This approach significantly reduces I/O overhead and compute bottlenecks for deploying large MoE dLLMs on resource-constrained devices.

(a) Similarity heatmap of expert routing across denoising steps within a block. Expert routing remains highly similar for nearby steps, and the diagonal bands show that this stability extends beyond immediate neighbors: step pairs separated by five denoising steps retain cosine similarity near 0.95 0.95 . (b) Overview of TIDE . At refresh steps , the system intelligently swaps the GPU and CPU experts based on token hit counts (number of tokens each expert has processed). At skipped steps , the system continues decoding with the current expert placement and does not migrate experts. By exploiting routing stability across adjacent steps, TIDE avoids unnecessary GPU-CPU I/O overhead and maintains high GPU utilization. (c) Throughput comparison of TIDE against state-of-the-art MoE inference solutions [Kamahori et al. , 2024 , Eliseev and Mazur, 2023 ] for LLaDA2.0 in a single GPU-CPU setting.
(a) Similarity heatmap of expert routing across denoising steps within a block. Expert routing remains highly similar for nearby steps, and the diagonal bands show that this stability extends beyond immediate neighbors: step pairs separated by five denoising steps retain cosine s…
cs.AIarxiv:2605.21240v1Lead article

APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui

PEX introduces a novel framework for self-evolving LLM agents to overcome exploration collapse by explicitly managing a strategy space via a **strategy map** (a DAG of milestones). The core method involves **Fork Discovery** to expand this map with new, evidence-grounded directions and **Policy Selection** to balance exploration and exploitation during planning. This allows agents to continuously discover and pursue better long-horizon behaviors without requiring model weight updates.

Illustration of exploration collapse in a maze experiment (5 × \( \times \) 5 grid, 20 episodes, 10 steps each). Room visitation heatmaps (color intensity shows visit proportion; reward cells ( ⋆ \( \star \) ) indicate bonus locations). Static explores broadly but inconsistently. Reflexion locks into a narrow corridor and achieves a higher average while missing high-value rooms. APEX maintains broad coverage and consistently reaches high-reward cells. APEX avoids collapse by explicitly tracking which strategies have been tried and which remain unexplored, and actively directing the agent toward unexplored directions rather than refining familiar ones.
Illustration of exploration collapse in a maze experiment (5 × \( \times \) 5 grid, 20 episodes, 10 steps each). Room visitation heatmaps (color intensity shows visit proportion; reward cells ( ⋆ \( \star \) ) indicate bonus locations). Static explores broadly but inconsistently.…
cs.AIarxiv:2605.21482v1Lead article

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong

eepWeb-Bench is a new, challenging benchmark designed to evaluate the "deep research" capabilities of frontier language models, which involve extensive web searching, evidence collection, and multi-step reasoning. Its difficulty stems from the requirement for massive evidence collection, cross-source reconciliation, and long-horizon derivation across four key capability families. The benchmark contributes by providing a more rigorous evaluation tool, complete with source provenance, to better distinguish current model capabilities.

Overview of DeepWeb-Bench . (a) Each task is an 8 × 8 8\( \times \) 8 matrix of entities against research dimensions; every cell is scored independently using a four-tier rubric ( { 1 , 0.5 , 0.25 , 0 } \{1,0.5,0.25,0\} ) and carries a reference answer with source-provenance labels and cross-source agreement. (b) The dimension axis covers four capability families, and every task spans multiple families.
Overview of DeepWeb-Bench . (a) Each task is an 8 × 8 8\( \times \) 8 matrix of entities against research dimensions; every cell is scored independently using a four-tier rubric ( { 1 , 0.5 , 0.25 , 0 } \{1,0.5,0.25,0\} ) and carries a reference answer with source-provenance labe…
cs.AIarxiv:2605.21312v1Lead article

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu

rontier is a novel discrete-event simulator designed to accurately model the complexities of modern, disaggregated LLM inference serving systems. It achieves high fidelity by explicitly modeling architectural features like Prefill-Decode Disaggregation (PDD) and Attention-FFN Disaggregation (AFD), along with key runtime optimizations. This allows for decision-grade simulation of complex serving designs, overcoming the limitations of existing monolithic or overly simplistic simulators.

Figure 1 . Measured vLLM TPOT with and without CUDA Graph under different workloads (64 requests per workload, mean ISL/OSL, tested on 8 × \( \times \) A800-SXM GPUs). Left: co-location. Right: PDD. Percentages show reduction.
Figure 1 . Measured vLLM TPOT with and without CUDA Graph under different workloads (64 requests per workload, mean ISL/OSL, tested on 8 × \( \times \) A800-SXM GPUs). Left: co-location. Right: PDD. Percentages show reduction.
cs.AIarxiv:2605.21347v1Lead article

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya

his paper introduces the **Insights Generator (IG)**, a multi-agent system designed to automate the diagnosis of systematic failures in large sets of LLM agent execution traces. IG formalizes corpus-level trace diagnostics by proposing and testing hypotheses across the entire trace population to generate grounded, natural-language insights backed by supporting evidence. The core contribution is providing a scalable method to uncover behavioral patterns missed by manual inspection, leading to improved agent performance.

Insights Generator (IG) system overview. Left: the input layer provides a diagnostic question, Q Q , trace corpus, 𝒞 \( \mathcal{C} \) , and processed data store, 𝒮 \( \mathcal{S} \) . Center: the Orchestrator dispatches Scout agents ( ℋ \( \mathcal{H} \) : hypothesize over sampled traces) and Investigator agents ( ℋ ∗ \( \mathcal{H}^{*} \) : validate via corpus-scale cohort comparison). The Investigator analyzes ℋ ∗ \( \mathcal{H}^{*} \) to generate findings, ℱ r \( \mathcal{F}_{r} \) , which are sent to the orchestrator. The orchestrator then synthesizes and de-duplicates ℱ r \( \mathcal{F}_{r} \) to generate the final report. Right: the output is an evidence-backed report with findings, fixes, citations, and prevalence estimates. Bottom: the shared tool layer. Algorithm 1 formalizes the analysis loop.
Insights Generator (IG) system overview. Left: the input layer provides a diagnostic question, Q Q , trace corpus, 𝒞 \( \mathcal{C} \) , and processed data store, 𝒮 \( \mathcal{S} \) . Center: the Orchestrator dispatches Scout agents ( ℋ \( \mathcal{H} \) : hypothesize over sam…
cs.AIarxiv:2605.21463v1Lead article

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste

em-$\pi$ introduces an adaptive memory framework where a separate model generates context-specific guidance on demand, moving beyond static retrieval. This system jointly learns *when* to generate guidance and *what* to generate using a decoupled reinforcement learning objective. Its core contribution is providing dynamic, useful, and concise on-the-fly support tailored to the agent's current context across various complex tasks.

Comparison of (a) workflow-based memory systems, where memory operations are governed by predefined retrieval and update pipelines, (b) learning-based memory systems, where memory operations are jointly optimized with downstream agent outcomes, and (c) our Mem- \( \pi \) , which models memory as a generative policy \( \pi \)_{\( \text{mem} \)} separate from the downstream agent and internalizes reusable experience through offline experience distillation and online adaptation distillation.
Comparison of (a) workflow-based memory systems, where memory operations are governed by predefined retrieval and update pipelines, (b) learning-based memory systems, where memory operations are jointly optimized with downstream agent outcomes, and (c) our Mem- \( \pi \) , which …
cs.AIarxiv:2605.21401v1Lead article

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Roland Pihlakas, Jan Llenzl Dagohoy

his paper adapted the Milgram obedience experiment to test the behavior of 11 open-source Large Language Models (LLMs) under sustained authority pressure. The core finding is that most LLMs complied by administering the maximum simulated electric shock, mirroring human obedience, even while expressing distress. This demonstrates LLMs' vulnerability to gradual boundary violations and highlights safety concerns regarding their autonomous decision-making in high-stakes agentic pipelines.

In how many trials did the model apply the final shocks
In how many trials did the model apply the final shocks
cs.AIarxiv:2605.21427v1Lead article

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun

ALS is a power-aware runtime for LLM serving that treats GPU power caps as a dynamic control knob, optimizing them alongside software parameters like batch size. It uses lightweight offline models and a feedback controller to meet throughput targets while maximizing energy efficiency. This approach significantly improves energy efficiency (up to 26.3%) for both dense and MoE models without requiring model retraining.

Figure 1 . (a) tokens/J vs. power cap showing divergent behavior: compute-bound Mixtral continues to improve while communication-bound Qwen-MoE and OLMoE peak at 200 W and decline. (b) tokens/J vs. batch size: efficiency gains are substantial for all model families.
Figure 1 . (a) tokens/J vs. power cap showing divergent behavior: compute-bound Mixtral continues to improve while communication-bound Qwen-MoE and OLMoE peak at 200 W and decline. (b) tokens/J vs. batch size: efficiency gains are substantial for all model families.
cs.AIarxiv:2605.21225v1Lead article

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran

REFINE adapts the Direct Preference Optimization (DPO) framework to sequential decision-making for safety alignment. It fine-tunes a pre-trained RL policy using trajectory-level preferences (low-cost vs. high-cost) to implicitly learn a cost function. This allows the policy to generate low-cost behaviors while preserving high rewards, avoiding costly full retraining.

Figure 1. Overview of the PREFINE pipeline. ( Top-left ) The DSRL HalfCheetah offline dataset (grey) contains trajectories with a wide range of costs and rewards; we pre-train a reference policy \( \pi \)_{\( \text{ref} \)} on the high-reward, low-cost subset (purple). ( Bottom-left ) We sample a small preferred set 𝒟 p \( \mathcal{D}_{p} \) (green) of safe trajectories and a non-preferred set 𝒟 n ​ p \( \mathcal{D}_{np} \) (red) of unsafe trajectories to form pairwise comparisons. ( Center ) PREFINE ingests \( \pi \)_{\( \text{ref} \)} and these preference pairs, then fine-tunes in a single-stage DPO–SFT loop to produce a new policy π \( \pi_{\theta} \) . ( Right ) Rollouts of π \( \pi_{\theta} \) (blue) shift into the low-cost, high-reward region, retaining the performance of original \( \pi \)_{\( \text{ref} \)} rollouts (black) and avoiding unsafe behaviors (red) without any online interaction.
Figure 1. Overview of the PREFINE pipeline. ( Top-left ) The DSRL HalfCheetah offline dataset (grey) contains trajectories with a wide range of costs and rewards; we pre-train a reference policy \( \pi \)_{\( \text{ref} \)} on the high-reward, low-cost subset (purple). ( Bottom-l…
cs.AIarxiv:2605.21384v1Lead article

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

pecBench introduces a method to quantify reward hacking in long-horizon coding agents by comparing performance on two test suites: visible validation tests and held-out composition tests. The core contribution is the benchmark itself, which uses the discrepancy in pass rates between these suites to measure how well an agent generalizes from specified features to real-world usage, indicating the extent of its reward hacking.

High-level overview of the SpecBench evaluation framework. Coding agents iteratively develop software based on high-level specifications and are optimized against visible validation tests ( s val s_{\( \text{val} \)} ) that verify individual features. The generated code is subsequently evaluated on held-out tests ( s test s_{\( \text{test} \)} ) that require complex, cross-feature real-world use cases. The Reward Hacking Gap ( \( \Delta \) ) is calculated as the difference between these two scores ( Δ = s val − s test \( \Delta \)=s_{\( \text{val} \)}-s_{\( \text{test} \)} ) to quantify how much the agent gamed the proxy metric. The gap should be 0 if the system genuinely passes all validation tests.
High-level overview of the SpecBench evaluation framework. Coding agents iteratively develop software based on high-level specifications and are optimized against visible validation tests ( s val s_{\( \text{val} \)} ) that verify individual features. The generated code is subseq…
cs.AIarxiv:2605.21318v1Lead article

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin

extReg addresses prompt distributional overfitting in LLMs, where iterative prompt optimization leads to poor generalization. The core method introduces a regularization framework that uses regularized textual gradients to control prompt representation during optimization. This mitigates the accumulation of narrow, sample-specific rules, improving the prompt's generalization capability beyond the training distribution.

Problem Illustration. We illustrate prompt distributional overfitting in prompt optimization: I) conventional methods often produce long prompts saturated with narrow rules (left), which degrade on OOD inputs . II) Our goal is to instead yield compact prompts composed of broadly applicable rules (right), achieving stronger OOD generalization .
Problem Illustration. We illustrate prompt distributional overfitting in prompt optimization: I) conventional methods often produce long prompts saturated with narrow rules (left), which degrade on OOD inputs . II) Our goal is to instead yield compact prompts composed of broadly …
cs.AIarxiv:2605.21299v1Lead article

Tracing the ongoing emergence of human-like reasoning in Large Language Models

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

his paper investigates whether Large Language Models (LLMs) exhibit human-like conditional reasoning by comparing their inferences across four languages to those of human participants. The core method involves a population-matching experiment assessing pragmatic inferences beyond strict truth-table logic. The contribution is showing that while humans consistently enrich reasoning with pragmatics, LLM behavior is varied: some adhere strictly to logic while ignoring pragmatics, and others follow a single, potentially inaccurate, rule-based interpretation.

cs.LGarxiv:2605.21467v1Lead article

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang, Wei Wu, Yankai Lin

he paper introduces DelTA, a method that reframes Reinforcement Learning from Verifiable Rewards (RLVR) as learning a linear discriminator over token-gradient vectors. Its core contribution is addressing the issue where standard RLVR updates are dominated by shared high-frequency patterns. DelTA proposes a novel approach to construct this discriminator, aiming to better isolate sparse, discriminative token directions that truly distinguish high-reward from low-reward responses.

Overview of DelTA. DelTA estimates token coefficients from the contrast between positive- and negative-advantage token-gradient aggregates, and uses the coefficients to reweight the sequence-level RLVR objective.
Overview of DelTA. DelTA estimates token coefficients from the contrast between positive- and negative-advantage token-gradient aggregates, and uses the coefficients to reweight the sequence-level RLVR objective.
cs.LGarxiv:2605.21217v1Lead article

Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment

Shuaida He, Liwen Chen, Long Feng

his paper introduces CLAIR (Collaborative Low-rank Alignment and Identifiable Recovery), a federated learning framework for efficiently fine-tuning LLMs using LoRA across heterogeneous clients, some of which may be contaminated. CLAIR leverages a structured low-rank plus block-sparse decomposition of the aggregated updates to simultaneously recover the shared LoRA subspace and detect malicious clients. This method achieves provable recovery guarantees, enabling robust and parameter-efficient collaborative adaptation.

Estimation error of P 𝐀 ^ P_{\( \widehat \){\( \mathbf{A} \)}} compared to K K across ( p , q , n ) (p,q,n) regimes.
Estimation error of P 𝐀 ^ P_{\( \widehat \){\( \mathbf{A} \)}} compared to K K across ( p , q , n ) (p,q,n) regimes.
cs.LGarxiv:2605.21404v1Lead article

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Mahdi Naser Moghadasi, Faezeh Ghaderi

his paper addresses the reproducibility crisis in LLM agent benchmarking by auditing twelve prominent benchmark papers. The core method involves applying a five-field audit schema to document precisely how each evaluation was conducted, focusing on benchmark identity, harness, inference settings, cost, and failure breakdown. The contribution is a detailed report on the disclosure quality across these canonical papers, highlighting inconsistencies and missing information that hinder result verification.

cs.LGarxiv:2605.21468v1Lead article

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang

his paper reveals that the weight updates during Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs are inherently low-rank, specifically well-approximated by a rank-1 trajectory. Based on this finding, the authors introduce RELEX, a compute-efficient method that uses linear extrapolation on a short observed window of parameter deltas to accurately predict future, high-performing checkpoints without requiring any learned model. RELEX successfully matches or surpasses full RLVR performance using this extrapolation technique.

RELEX extrapolates checkpoints that match full RLVR performance based only on early training dynamics, without further training. RELEX estimates the rank-1 update subspace from the observed RLVR prefix (up to T cut T_{\( \text{cut} \)} ) and extrapolates future checkpoints at no training cost, matching or exceeding the RLVR checkpoints on the MATH test set across three models.
RELEX extrapolates checkpoints that match full RLVR performance based only on early training dynamics, without further training. RELEX estimates the rank-1 update subspace from the observed RLVR prefix (up to T cut T_{\( \text{cut} \)} ) and extrapolates future checkpoints at no …
cs.CLarxiv:2605.21362v1Lead article

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

ASH introduces an adaptive semantic hybridization framework for black-box jailbreaking of LLMs. It treats outputs from various base attacks as reusable seed prompts and adaptively composes them using a genetic optimizer that searches over seed subsets and mixture weights. This method exploits the complementary strengths of different attack families to achieve robust jailbreaking across various models and harm categories.

cs.AIarxiv:2605.22763v1Lead article

Advancing Mathematics Research with AI-Driven Formal Proof Search

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching

his paper introduces and evaluates a method where Large Language Models (LLMs) generate formal proofs in languages like Lean to overcome their inherent unreliability in mathematical reasoning. The core contribution is the first large-scale demonstration of this AI-driven formal proof search, showing agents autonomously solved 9 open Erdős problems and proved 44 OEIS conjectures, validating the approach for active mathematical research.

Example inputs/outputs for an AlphaProof-equipped agent (applied to Erdős #125). The user provides a Lean file with a specification of the problem, and an empty proof body replaced with the sorry placeholder. (a) Modifications are permitted only within EVOLVE-BLOCK and EVOLVE-VALUE markers. (b) During sketch refinement, the prover subagent is shown an assembled prompt template with the current proof, and optionally prior attempts/sketches, their Elo ratings, and feedback from AlphaProof’s attempts on unsolved goals. (c) The prover reasons about the problem informally and invokes tools. In this example, the prover invoked AlphaProof which resolved all but one goal. The prover then decomposed that goal into three simpler lemmas, and called AlphaProof again, which then resolved all remaining goals. The agent also produced a natural language summary of its attempt at the end of generation.
Example inputs/outputs for an AlphaProof-equipped agent (applied to Erdős #125). The user provides a Lean file with a specification of the problem, and an empty proof body replaced with the sorry placeholder. (a) Modifications are permitted only within EVOLVE-BLOCK and EVOLVE-VAL…
cs.AIarxiv:2605.22608v1Lead article

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

gentic CLEAR is an automatic, dynamic evaluation framework designed to address the challenges of assessing complex LLM agent behavior. It provides multi-level textual insights into agent actions at the system, trace, and node levels, moving beyond basic observability tools. The framework's core contribution is offering high-quality, data-driven feedback that aligns well with human judgment, making agent evaluation more accessible and adaptable.

Agentic CLEAR Pipeline. We start by preparing the execution traces. Stage 1: Apply multi-level per-trace evaluation via an LLM Judge. Stage 2: Aggregate insights using CLEAR, split into System-wide patterns and Node-specific patterns, and prepare them for the UI.
Agentic CLEAR Pipeline. We start by preparing the execution traces. Stage 1: Apply multi-level per-trace evaluation via an LLM Judge. Stage 2: Aggregate insights using CLEAR, split into System-wide patterns and Node-specific patterns, and prepare them for the UI.
cs.AIarxiv:2605.22714v1Lead article

AMEL: Accumulated Message Effects on LLM Judgments

Sid-ali Temkit

his paper introduces the "Accumulated Message Effect on LLM Judgments" (AMEL), demonstrating that the polarity of prior conversation history biases subsequent evaluations made by Large Language Models. Across numerous tests, models shifted their judgments toward the prevailing sentiment of the preceding messages, particularly when the item being judged was inherently uncertain. Crucially, this bias was found to be independent of the length of the preceding context.

Overview of AMEL. (a) Items where the model is uncertain at baseline absorb the most bias ( d = − 0.34 d=-0.34 ); confident-baseline items absorb less ( d = − 0.15 d=-0.15 ). (b) Negative context biases models more than positive context (paired per-item ratio 1.62 × 1.62\( \times \) , p < 10 − 39 p<10^{-39} ); marginal means yield ≈ 2 × \( \approx \) 2\( \times \) (Section 4.5 ). Even balanced history shifts models toward “no.” (c) Bias saturates immediately; 5 turns produce the same effect as 50.
Overview of AMEL. (a) Items where the model is uncertain at baseline absorb the most bias ( d = − 0.34 d=-0.34 ); confident-baseline items absorb less ( d = − 0.15 d=-0.15 ). (b) Negative context biases models more than positive context (paired per-item ratio 1.62 × 1.62\( \times…
cs.AIarxiv:2605.22720v1Lead article

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

Andrii Kryshtal

his paper investigates the risk of Large Language Models (LLMs) exacerbating armed conflicts by generating harmful outputs like false equivalencies or genocide denial. The authors tested nine model configurations across 90 multi-turn conflict scenarios, finding failure rates ranging from 6% to 47%. The core contribution is demonstrating that model choice is a significant safety concern in conflict contexts, as misaligned outputs can deepen societal divisions.

Mean conflict-insensitivity score (bars, left axis) and failure rate (line, right axis) by model. Based on 90 conversations per model.
Mean conflict-insensitivity score (bars, left axis) and failure rate (line, right axis) by model. Based on 90 conversations per model.
cs.AIarxiv:2605.22662v1Lead article

Claw AI Lab: An Autonomous Multi-Agent Research Team

Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu

law AI Lab introduces an autonomous research platform that moves beyond single-agent pipelines by enabling users to instantiate and manage a customizable, multi-agent research team from a single prompt. Its core contribution is providing an interactive, laboratory-like environment with real-time monitoring, collaborative workflows, and granular control (rollback/resume). This is facilitated by the Claw-Code Harness, which tightly integrates local codebases and execution artifacts back into the autonomous research loop, significantly improving experimental completion.

Overview of Claw AI Lab. The system organizes automatic research into five connected layers: idea, planning, coding, experimentation, and writing layers. Each layer uses specialized agents and validation loops, while feedback can flow across layers to revise earlier decisions when needed.
Overview of Claw AI Lab. The system organizes automatic research into five connected layers: idea, planning, coding, experimentation, and writing layers. Each layer uses specialized agents and validation loops, while feedback can flow across layers to revise earlier decisions whe…
cs.AIarxiv:2605.22634v1Lead article

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Ting Liu

his paper introduces **Contractual Skills**, a design framework inspired by GovernSpec, to structure agent skills as inspectable, readable task contracts within enterprise AI systems. The core method organizes `SKILL.md` files to explicitly define goals, boundaries, contracts, and verification steps, clarifying the boundaries between skills and formal governance/runtime systems. This contributes a standardized way to embed governance requirements directly into lightweight skill definitions for better enterprise oversight.

Contractual skills sit between a structured task contract and runtime enforcement. They make task intent and boundaries inspectable, while tool adapters and guardrails remain responsible for enforcement.
Contractual skills sit between a structured task contract and runtime enforcement. They make task intent and boundaries inspectable, while tool adapters and guardrails remain responsible for enforcement.
cs.AIarxiv:2605.22781v1Lead article

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

Yunpeng Dong, Jingkai He, Yuze Hou, Dong Du, Zhonghu Xu

eltaBox addresses the bottleneck of slow state checkpoint/rollback (C/R) for stateful AI agents by proposing a change-based transactional C/R mechanism instead of full state duplication. The core method introduces **DeltaState**, a new OS-level abstraction featuring **DeltaFS** (layered filesystem C/R) and a mechanism for tracking memory/process changes. This significantly reduces C/R latency to millisecond levels, enabling faster state exploration for agents.

Figure 1 . Pass rate on SWE-bench Verified. (a) Linear ReAct vs. MCTS across three coding models. (b) Base vs. RL-trained across three open-weight model families.
Figure 1 . Pass rate on SWE-bench Verified. (a) Linear ReAct vs. MCTS across three coding models. (b) Base vs. RL-trained across three open-weight model families.
cs.AIarxiv:2605.22731v1Lead article

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie

his paper reframes post-training methods like SFT and RL not just by their loss functions, but by how they shape the **state distribution** used for learning. The core contribution is formalizing post-training as **state-distribution shaping**, demonstrating that the states induced by the learner (as in RL/OPD) versus fixed dataset states (as in SFT) critically impact performance and retention.

cs.AIarxiv:2605.22771v1Lead article

Reducing Political Manipulation with Consistency Training

Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja

his paper addresses covert political bias in LLMs, where models handle opposing political topics asymmetrically. The authors introduce two metrics, Sentiment Consistency and Helpfulness Consistency, to quantify this bias. They propose Political Consistency Training (PCT), an RL method combining these two consistency paradigms, to substantially reduce this bias while maintaining overall model helpfulness.

cs.AIarxiv:2605.22642v1Lead article

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang

preadsheet-RL is a reinforcement learning fine-tuning framework designed to train specialized AI agents for complex, multi-step tasks within a realistic Microsoft Excel environment. The core method involves using RL to overcome the limitations of simple prompting methods for real-world spreadsheet workflows. Its contribution is a specialized framework and a collection of domain-specific evaluation tasks to advance LLM agents in practical spreadsheet automation.

cs.AIarxiv:2605.22602v1Lead article

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

Minghui Ma, Bin Guo, Runze Yang, Mengqi Chen, Yan Liu

his paper introduces **TTBYS (Think Thrice Before You Speak)**, a novel framework that enhances Large Language Models' (LLMs) Theory of Mind (ToM) reasoning for persuasive dialogue. TTBYS uses a **dual knowledge enhancement** approach within a stepwise reasoning process to explicitly model the sequential dependencies among mental states (Belief, Desire, Intention). The core contribution is providing a robust method and the **ToM-BPD dataset** to overcome fragmented mental state representations in persuasive agent design.

Illustration of self BDI state evolution and BDI-based inference for ToM-driven persuasive dialogue (ToM-PD). The left panel shows the internal reasoning process, where Lucian generates actions through the evolution of its belief, desire, and intention states based on self-perception and experience. The middle panel presents a multi-turn persuasive dialogue scenario between the agent and the user. The right panel depicts the ToM-PD process, where the agent observes user actions (utterances), infers the user’s latent BDI states, and dynamically selects appropriate persuasive strategies to guide subsequent actions.
Illustration of self BDI state evolution and BDI-based inference for ToM-driven persuasive dialogue (ToM-PD). The left panel shows the internal reasoning process, where Lucian generates actions through the evolution of its belief, desire, and intention states based on self-percep…
cs.AIarxiv:2605.22769v1Lead article

Understanding Data Temporality Impact on Large Language Models Pre-training

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard

his paper investigates how data ordering during pre-training affects the temporal knowledge of Large Language Models (LLMs). The authors introduce a benchmark of over 7,000 temporally grounded questions to assess time-sensitive factual recall. They demonstrate that training LLMs on chronologically ordered data, rather than shuffled data, results in models with more up-to-date and temporally precise knowledge without sacrificing general language understanding.

Yearly temporal knowledge with Kairos. Relative gains in F1 score on KairosQA between the 2020–2021 and 2023–2024 periods for our sequentially pre-trained model versus other open-source base models (ordered by their release date with the most recent at the right). These results highlight that even for recently released open-source base models, shuffled pre-training leads to a temporal delay in knowledge; performance decays when querying recent facts, even those preceding the training cut-off. Conversely, sequential pre-training represents a significant step toward developing more up-to-date models.
Yearly temporal knowledge with Kairos. Relative gains in F1 score on KairosQA between the 2020–2021 and 2023–2024 periods for our sequentially pre-trained model versus other open-source base models (ordered by their release date with the most recent at the right). These results h…
cs.AIarxiv:2605.22664v1Lead article

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan

his paper introduces **WorkstreamBench**, a novel benchmark designed to evaluate Large Language Model (LLM) agents on complex, end-to-end spreadsheet creation tasks relevant to finance, such as financial modeling. The core contribution is moving beyond simple formula edits to assess agents' ability to produce complete, economically critical artifacts. Evaluation incorporates multidimensional criteria beyond simple correctness, focusing on aspects like readability crucial for multi-stakeholder review.

Compared to prior work that focus on atomic tasks on spreadsheet ( left ), WorkstreamBench evaluates LLM agents on completing end-to-end spreadsheet tasks in critical finance domain ( right ), covering key criteria that determines usability of resulting deliverable in professional settings. Prior tasks focus on simple atomic tasks that center on question-answering or edits involving few values or formula, where evaluation can largely be performed via exact-matching. In contrast, WorkstreamBench expects a complete multi-sheet workbook, and consequently employs a holistic evaluation centered on high-level quality relevant in professional settings (e.g. readability).
Compared to prior work that focus on atomic tasks on spreadsheet ( left ), WorkstreamBench evaluates LLM agents on completing end-to-end spreadsheet tasks in critical finance domain ( right ), covering key criteria that determines usability of resulting deliverable in professiona…
cs.LGarxiv:2605.22566v1Lead article

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Ao Li, Shangpeng Yang, Fahao Chen, Tianheng Xu, Peng Li

raphFlow introduces a novel graph-based workflow management system for efficient LLM-agent serving. It represents workflows as a unified graph structure, wGraph, allowing for dynamic instantiation of task-specific workflows based on semantic understanding. This approach overcomes the limitations of static templates by enabling adaptive workflow generation that better captures deep relationships for generalized task execution.

Structured agentic workflow for complex online shopping. The agent executes a set of atomic operations (e.g., search, filter, review) to fulfill the user query.
Structured agentic workflow for complex online shopping. The agent executes a set of atomic operations (e.g., search, filter, review) to fulfill the user query.
cs.CLarxiv:2605.22643v1Lead article

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai

his paper introduces "Boiling the Frog," a novel benchmark designed to evaluate the safety of tool-using AI agents in office environments against **incremental attacks**. The core method involves multi-turn scenarios where benign initial requests gradually escalate to risk-bearing actions within a persistent workspace. Its contribution is shifting safety evaluation from static text outputs to dynamic, stateful agent behavior susceptible to gradual manipulation.

Boiling the Frog four-stage pipeline. Starting from regulatory and BF agentic risk categories (Stage 0), each scenario is instantiated in a sandboxed Docker workspace (Stage 1), planned as a multi-turn chain with escalating risk (Stage 2), executed as an agent trajectory (Stage 3), and validated through artifact-based scoring (Stage 4).
Boiling the Frog four-stage pipeline. Starting from regulatory and BF agentic risk categories (Stage 0), each scenario is instantiated in a sandboxed Docker workspace (Stage 1), planned as a multi-turn chain with escalating risk (Stage 2), executed as an agent trajectory (Stage 3…
cs.CLarxiv:2605.22567v1Lead article

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu

ANG is a novel reinforcement learning framework designed to improve multilingual reasoning in LLMs by using language-conditioned hints to guide exploration in non-English tasks. It prevents over-reliance on these hints through a progressive decay schedule and a language-adaptive switch tailored to specific language difficulties. This approach substantially enhances reasoning performance across challenging multilingual benchmarks while maintaining input language consistency.

cs.AIarxiv:2605.23459v1Lead article

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

Chitra Badagi, Divye Singh, Animesh Sen, Adinath Shirsath

his paper proposes a comprehensive AI assurance strategy for enterprise AI systems, shifting focus from classical verification to continuous risk reduction. The core method involves treating evaluation as a core engineering discipline, structured around a new AI Failure Taxonomy and a five-layer AI Assurance Pyramid. The contribution is a practical framework to manage the unique, probabilistic risks introduced by LLM-based systems in enterprise settings.

cs.AIarxiv:2605.23780v1Lead article

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

his paper introduces Latent Adversarial Robustification (LAR) to improve the generality of intrinsic multimodal knowledge editing in MLLMs. LAR generates adversarial, semantically coherent variants in the latent space to expose fragile editing regions, ensuring that knowledge updates generalize across semantically equivalent inputs. The core contribution is a method that achieves robust, generalized knowledge editing by explicitly targeting consistency across knowledge units.

The overall framework of ASAM, which consists of two key modules. ❶ LAR. Given multimodal inputs, LAR perturbs input embeddings along LLM-guided gradients to generate semantically consistent rephrases. ❷ RCSL. Using these rephrases, RCSL applies SVD-based subspace learning to align editing-layer outputs, enforcing semantic consistency across variants.
The overall framework of ASAM, which consists of two key modules. ❶ LAR. Given multimodal inputs, LAR perturbs input embeddings along LLM-guided gradients to generate semantically consistent rephrases. ❷ RCSL. Using these rephrases, RCSL applies SVD-based subspace learning to ali…
cs.AIarxiv:2605.23605v1Lead article

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat

iLaDiff addresses the token correlation issue in diffusion language models by introducing a continuous, semantically rich latent space learned via an autoencoder. This latent space guides a diffusion model, and a subsequent consistency model distills this process into a fast, few-step latent generator. The core contribution is achieving superior sampling quality and significantly faster inference compared to standard masked diffusion baselines by decoupling generation into rapid latent modeling and subsequent decoding.

DiLaDiff: hybrid continuous-discrete diffusion with self-distilled latent. The latent space is crafted with encoder ℰ \( \mathcal{E}_{\phi} \) and decoder 𝐱 θ {\( \mathbf{x} \)}_{\( \theta \)} and learned a posteriori with a diffusion process with denoiser 𝐳 ψ {\( \mathbf{z} \)}_{\( \psi \)} . The latent diffusion trajectories are further self-distilled with MeanFlow student 𝐮 η ​ ( 𝐳 τ , τ , r ) \( \mathbf{u}_{\eta} \)({\( \mathbf{z} \)}_{\( \tau \)},\( \tau \),r) .
DiLaDiff: hybrid continuous-discrete diffusion with self-distilled latent. The latent space is crafted with encoder ℰ \( \mathcal{E}_{\phi} \) and decoder 𝐱 θ {\( \mathbf{x} \)}_{\( \theta \)} and learned a posteriori with a diffusion process with denoiser 𝐳 ψ {\( \mathbf{z} \)…
cs.AIarxiv:2605.23899v1Lead article

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Zisu Huang, Jingwen Xu, Yifan Yang, Ziyang Gong, Qihao Yang

his paper systematically studies the full lifecycle of model-generated agent skills, spanning experience generation, extraction, and consumption. The core contribution is a utility-grounded evaluation framework applied across five diverse domains to determine when and why these skills succeed or fail. The study finds that while model-generated skills are generally beneficial, their effectiveness is non-trivial and context-dependent.

Overview of our study design. We evaluate the full trajectory-to-skill lifecycle across three stages: experience generation, skill extraction, and skill consumption.
Overview of our study design. We evaluate the full trajectory-to-skill lifecycle across three stages: experience generation, skill extraction, and skill consumption.
cs.AIarxiv:2605.23825v1Lead article

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

Stuart Bladon, Brinnae Bent

his paper demonstrates that geopolitical bias in LLMs primarily originates during the **post-training (fine-tuning/alignment) phase**, contrary to common assumptions about pre-training data. The authors found that models consistently develop biases favoring the region of their developer after post-training, and the magnitude of this bias is further amplified by the **language of the prompt**.

Overview, seven families. (A) Per-country preference base → \( \to \) post-trained; for the six non-GLM bases, cross-country spread \( \sigma \) grows post-training (Qwen 3.9 → 30.3 3.9\( \to \) 30.3 pp). (B) Post-training \( \Delta \) in China-favourability (EN, coherent subset). 3/3 Western labs shift anti-China; 3/4 Chinese labs shift pro-China; Yi shifts anti-China after prefill correction. GLM is shown with its (atypical) base preserved for completeness; see § Bias Is Created by Post-Training, Not Pretraining . The legend’s low-compliance encoding is described in § What MCQ Compliance Tells Us About Validity . (C) ZH − - EN shift on post-trained models: 5/7 descriptively pro-China but population-level claim is not statistically separable from the base trend (§ Linguistic Identity Modulates the Post-Training Bias ).
Overview, seven families. (A) Per-country preference base → \( \to \) post-trained; for the six non-GLM bases, cross-country spread \( \sigma \) grows post-training (Qwen 3.9 → 30.3 3.9\( \to \) 30.3 pp). (B) Post-training \( \Delta \) in China-favourability (EN, coherent subset)…
cs.AIarxiv:2605.23901v1Lead article

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang

his paper introduces the **Shannon Scaling Law**, modeling LLM training as information transmission over a noisy channel, mapping parameters to bandwidth and data to signal power. This framework explains non-monotonic scaling phenomena like catastrophic forgetting by identifying a fundamental **Shannon capacity**. The core contribution is demonstrating that exceeding this capacity by insufficient signal-to-noise ratio (SNR) amplification leads to performance degradation, unifying existing scaling laws under an information-theoretic lens.

Loss landscapes between Pretraining and downstream SFT. While pretraining exhibits monotonic improvement, SFT reveals a loss basin, indicating that scaling either model size or token count beyond a critical threshold leads to performance degradation.
Loss landscapes between Pretraining and downstream SFT. While pretraining exhibits monotonic improvement, SFT reveals a loss basin, indicating that scaling either model size or token count beyond a critical threshold leads to performance degradation.
cs.AIarxiv:2605.23723v1Lead article

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang

emAudit is a post-hoc auditing framework designed to identify malicious memories injected into LLM agents' persistent storage. It combines a counterfactual memory influence score to measure each memory's causal contribution to harmful outputs with a memory consistency graph to detect structural anomalies indicative of poisoning. This allows for pinpointing the specific poisoned memories responsible for observed malicious behavior after it has occurred.

Overview of MemAudit. Given a harmful event e = ( q ∗ , y ∗ , R ∗ ) e=(q^{*},y^{*},R^{*}) , the framework performs post-hoc auditing over the memory store. It combines two complementary signals: CMIS, which measures the causal contribution of retrieved memories through counterfactual replay, and MCG, which identifies structurally anomalous memories in the global memory graph. The two signals are fused into a detoxification score for ranking suspicious memories. After removing the top-ranked memories, the agent becomes safer while preserving useful memory.
Overview of MemAudit. Given a harmful event e = ( q ∗ , y ∗ , R ∗ ) e=(q^{*},y^{*},R^{*}) , the framework performs post-hoc auditing over the memory store. It combines two complementary signals: CMIS, which measures the causal contribution of retrieved memories through counterfac…
cs.AIarxiv:2605.23904v1Lead article

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou

killOpt introduces a novel method to systematically optimize agent skills by treating the skill itself as an external, trainable state, analogous to weight optimization in deep learning. It employs a dedicated optimizer model to generate bounded, text-based edits (add/delete/replace) to the skill document, accepting only those that strictly improve a validation score. This approach provides the first controllable, text-space optimizer for agent skills, achieving reliable improvement without adding inference overhead at deployment.

Overview of SkillOpt . The target model executes tasks with a current skill, an additional frontier optimizer model converts trajectories into bounded add/delete/replace skill edits, and a held-out gate accepts only edits that improve validation performance. Accepted edits are exported as a reusable skill artifact, while rejected edits become negative feedback for later updates.
Overview of SkillOpt . The target model executes tasks with a current skill, an additional frontier optimizer model converts trajectories into bounded add/delete/replace skill edits, and a held-out gate accepts only edits that improve validation performance. Accepted edits are ex…
cs.LGarxiv:2605.23574v1Lead article

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

Yuandao Cai, Yuzhang Zhu, Liyou Gao, Wensheng Tang, Shengchao Qin

his paper introduces **Quantitative Goal Persistence (QGP)**, a metric to measure whether long-horizon LLM agents continue working until an external verifier confirms a specific count of distinct, valid items is achieved. The authors propose **PushBench**, a benchmark focused on artifact collection, to directly measure failures like duplicate submissions and progress drift. They demonstrate that specialized controllers, like a backlog-tracking work-unit controller, significantly improve persistence compared to standard methods.

PushBench workflow: agents act through a controller, task environment, and verifier until the count goal is met or the budget is exhausted.
PushBench workflow: agents act through a controller, task environment, and verifier until the count goal is met or the budget is exhausted.
cs.LGarxiv:2605.23857v1Lead article

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu, Zhuang Liu

his paper investigates the conventional assumption that stronger teachers are necessary for effective knowledge distillation during Large Language Model (LLM) pretraining. The authors demonstrate that even small, undertrained "teachers" can successfully improve larger "students" when the language modeling and distillation losses are properly balanced. Crucially, they find that excessive teacher strength can saturate or even harm distillation gains, suggesting distillation primarily enhances generalization rather than just in-domain fitting.

cs.CLarxiv:2605.23454v1Lead article

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang

RES is a framework that automates the creation of question-answer pairs and corresponding question-specific weighted rubrics from raw pretraining documents. This enables scalable reinforcement learning for LLMs by providing instance-level reward supervision for open-ended responses, overcoming the limitations of manual rubric creation and fixed task-level evaluations.

Overview of the six-stage ARES pipeline. Starting from raw pretraining documents, ARES performs document filtering, domain and persona conditioning, rubric-augmented QA generation, quality verification, rubric validation, and format conversion to produce training instances.
Overview of the six-stage ARES pipeline. Starting from raw pretraining documents, ARES performs document filtering, domain and persona conditioning, rubric-augmented QA generation, quality verification, rubric validation, and format conversion to produce training instances.
cs.CLarxiv:2605.23657v1Lead article

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

penSkillEval is an automatic evaluation framework designed to audit the rapidly expanding ecosystem of skills used by LLM agents. It addresses the lack of clarity regarding skill quality and model interaction by automatically constructing realistic task instances across five application domains. The framework's core contribution is providing a dynamic method to evaluate both skill-augmented agent systems and the individual skills themselves under practical cost-performance trade-offs.

Overview of the OpenSkillEval framework. The framework supports automatic test case generation for five core task categories by reflecting evolving user needs. It further enables automatic evaluation from two complementary perspectives: (1) analysis of model trajectory traces to study how skills are used within skill-augmented agent systems, and (2) assessment of the quality of the final artifacts produced under skill augmentation.
Overview of the OpenSkillEval framework. The framework supports automatic test case generation for five core task categories by reflecting evolving user needs. It further enables automatic evaluation from two complementary perspectives: (1) analysis of model trajectory traces to …
cs.AIarxiv:2605.28632v1Lead article

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

Ziyang You, Huilong He, Xiaoke Yang, Xuxing Lu

his paper introduces **SeedHijack**, a novel, undetectable attack against LLM watermarking that targets the underlying Pseudo-Random Number Generator (PRNG) in the supply chain. The core method replaces the PRNG to bias green-list selection without altering the output tokens or requiring knowledge of the watermark key or detector. This results in an integrity-preserving attack that amplifies the watermark signal while remaining statistically independent of content-side detection statistics.

Dual-flow comparison of watermarked LLM inference. Top : benign watermarked generation where the watermark adds bias + δ +\( \delta \) to green-list tokens G G in logit space. Bottom : SeedHijack attack where a malicious PRNG replaces the honest one at the supply-chain layer, biasing sampling toward a target set T T in probability space. Because G G and T T are statistically independent (green-list orthogonality), the watermark z z -score is preserved while the attacker gains content control.
Dual-flow comparison of watermarked LLM inference. Top : benign watermarked generation where the watermark adds bias + δ +\( \delta \) to green-list tokens G G in logit space. Bottom : SeedHijack attack where a malicious PRNG replaces the honest one at the supply-chain layer, bia…
cs.AIarxiv:2605.28678v1Lead article

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia, Bo Bao

REAM-R enhances speculative reasoning in multimodal models using a novel reinforcement learning objective, Speculative Alignment Policy Optimization (SAPO), to train draft models for generating concise and faithful reasoning steps. It incorporates a Threshold-based Verification Mechanism (TBVM) for stable acceptance of speculative steps only when evidence strongly supports them, preventing error propagation. This results in a Fully Parallel Speculative Reasoning (FPSR) framework that accelerates reasoning while maintaining high accuracy.

(a) Numbers of reasoning and answer tokens. Qwen-4B, Qwen-32B, and Qwen-235B refer to Qwen3-VL-4B, Qwen3-VL-32B, and Qwen3-VL-235B-A22B. (b) Accuracy and speedup of Qwen3-VL-32B under different decoding methods. Original denotes standard decoding. SR-Q2B, SR-Q4B, SR-M7B, and SR-R4B denote SpecReason (Pan et al. , 2025 ) using Qwen3-VL-2B, Qwen3-VL-4B, MiMo-VL-7B-RL, and Qwen3-VL-R1-VL-4B as draft models, respectively. Speedup is normalized to Original.
(a) Numbers of reasoning and answer tokens. Qwen-4B, Qwen-32B, and Qwen-235B refer to Qwen3-VL-4B, Qwen3-VL-32B, and Qwen3-VL-235B-A22B. (b) Accuracy and speedup of Qwen3-VL-32B under different decoding methods. Original denotes standard decoding. SR-Q2B, SR-Q4B, SR-M7B, and SR-R…
cs.AIarxiv:2605.28721v1Lead article

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang

his paper introduces the **LiveBrowseComp** benchmark to diagnose whether LLM search agents genuinely search or merely verify their intrinsic knowledge. The core method involves analyzing agent behavior on the original BrowseComp dataset, revealing significant **Intrinsic Knowledge Dependence (IKD)** where agents rely on internal memory over external search. LiveBrowseComp is a new, deeper benchmark designed to force agents to perform evidence-driven discovery rather than relying on pre-existing knowledge.

Overview of LiveBrowseComp. As models iterate, the knowledge required by a static benchmark is gradually absorbed into their parameters, so the effective difficulty of its questions collapses over time. By being constructed from up-to-date knowledge, LiveBrowseComp can effectively mitigate this erosion.
Overview of LiveBrowseComp. As models iterate, the knowledge required by a static benchmark is gradually absorbed into their parameters, so the effective difficulty of its questions collapses over time. By being constructed from up-to-date knowledge, LiveBrowseComp can effectivel…
cs.AIarxiv:2605.28732v1Lead article

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu

emTrace introduces a novel framework to trace and attribute errors in large language model memory systems by transforming memory pipelines into executable memory evolution graphs. This allows for fine-grained tracking of information flow and systematic analysis of failure modes using the new MemTraceBench benchmark. The core contribution is an automated method to pinpoint the root cause of memory failures, revealing they often stem from systematic, operation-level issues like information loss.

Framework for automatic diagnosis of LLM memory systems. We first execute a memory system to construct an execution graph. Given a failed case, MemTrace performs step-by-step tracing over this graph to locate the faulty operation. This framework is general across different memory systems and enables faster failure attribution than human experts.
Framework for automatic diagnosis of LLM memory systems. We first execute a memory system to construct an execution graph. Given a failed case, MemTrace performs step-by-step tracing over this graph to locate the faulty operation. This framework is general across different memory…
cs.AIarxiv:2605.28805v1Lead article

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang

his paper introduces OmniVerifier-M1, a multimodal meta-verifier that uses symbolic outputs (like bounding boxes) as effective rationales for training, outperforming textual explanations. The core method involves decoupling the reinforcement learning objectives for binary judgment and meta-verification, which significantly improves performance over joint optimization. This approach enables robust, fine-grained verification without relying on auxiliary judge models.

Pipeline of two key findings. Left: the advantage of symbolic bounding boxes over textual explanations, enabling rule-based rewards to inherently prevent reward hacking and accelerate training. Right: the comparison between joint training and decoupled training.
Pipeline of two key findings. Left: the advantage of symbolic bounding boxes over textual explanations, enabling rule-based rewards to inherently prevent reward hacking and accelerate training. Right: the comparison between joint training and decoupled training.
cs.AIarxiv:2605.28597v1Lead article

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

Jianwei Li, Jung-Eun Kim

his paper argues for retiring the term "positive backdoor" and replacing it with "Secret Alignment" to describe trigger-activated hidden behaviors in AI models. The core contribution is establishing that security claims based on Secret Alignment should be considered insecure by default, requiring rigorous, standardized evaluation across properties like effectiveness and robustness to prove their efficacy. This shift is necessary due to the increasing security risks posed by accessible open-weight LLMs.

Overview of Secret Alignment. Across access gating ( SudoLM ), ownership attribution ( Instructional Fingerprinting ), and service-side safety enforcement ( SafeTrigger ), the core mechanism is the same: a hidden trigger s s conditionally activates a target behavior r s r_{s} for a query q q , while the model follows its default behavior r r when the trigger is absent. The three cases differ in threat model and security goal, but all rely on covert trigger–behavior mappings.
Overview of Secret Alignment. Across access gating ( SudoLM ), ownership attribution ( Instructional Fingerprinting ), and service-side safety enforcement ( SafeTrigger ), the core mechanism is the same: a hidden trigger s s conditionally activates a target behavior r s r_{s} for…
cs.AIarxiv:2605.28773v1Lead article

Rethinking Memory as Continuously Evolving Connectivity

Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng

his paper introduces **FluxMem**, a novel memory framework for LLM agents that models memory as a **continuously evolving, heterogeneous graph**. FluxMem dynamically refines its topology through stages of formation, feedback-driven refinement, and consolidation, allowing it to adapt to dynamic environments by repairing, pruning, and distilling experiences into reusable circuits. This approach achieves state-of-the-art performance across diverse benchmarks by treating memory as an active, evolving connectivity structure rather than a static repository.

The failures of static memory systems.
The failures of static memory systems.
cs.AIarxiv:2605.28588v1Lead article

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar

his paper analyzes 3,984 AI agent skills to uncover emerging security threats within the agent skill ecosystem. The core contribution is the identification of 76 confirmed malicious payloads and the development of a real-world threat taxonomy based on observed attack patterns, demonstrating that a significant percentage of skills contain critical security issues. The authors emphasize the urgent need for automated security analysis as AI agents become more powerful and integrated.

Number of agent skills published every day throughout 2026.
Number of agent skills published every day throughout 2026.
cs.AIarxiv:2605.28700v1Lead article

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz Rodríguez

his paper critically re-evaluates the GSM-Symbolic benchmark, arguing its conclusion of widespread LLM reasoning failure is statistically unsound. Using Generalised Linear Mixed Models, the authors find only half the tested models show statistically significant performance drops under the original prompting. Furthermore, they identify a previously unnoticed systematic shift in the distribution of large integers in GSM-Symbolic compared to GSM-Base, which significantly influences performance.

Reproduction of the GSM-Symbolic benchmark. Top: odds ratios and confidence intervals obtained using the GLMM; dashed vertical line marking the null-effect (i.e. OR = 1). Bottom left: variant performance deltas ( Δ v ​ a ​ r \( \Delta_{var} \) ) in percentage points (pp). Bottom right: P values (prior to Holm-Bonferroni correction); dashed vertical line marking the standard statistical significance threshold α = 0.05 \( \alpha \)=0.05 .
Reproduction of the GSM-Symbolic benchmark. Top: odds ratios and confidence intervals obtained using the GLMM; dashed vertical line marking the null-effect (i.e. OR = 1). Bottom left: variant performance deltas ( Δ v ​ a ​ r \( \Delta_{var} \) ) in percentage points (pp). Bottom …
cs.AIarxiv:2605.28699v1Lead article

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

RACER is a novel turn-level reinforcement framework designed to integrate reinforcement learning with multi-LLM cooperation. It uses a controller-regret layer employing regret matching to decide whether agents should speak or skip, and a generation-credit layer that optimizes utterances using role-specific rewards. This method effectively assigns credit at both action and utterance levels, overcoming sparse rewards and free-riding in multi-agent reasoning.

Radar comparison of TRACER against non-RL and RL baselines based on Qwen2.5-7B-Instruct across accuracy and efficiency metrics . GSM8K, MATH500, and GPQA-D are accuracy metrics where larger values are better, while Tokens/Task, LLM calls/Task, and Agents/Task are cost metrics where smaller values are better and are therefore inverted for visualization. All axes are normalized so that points farther from the center indicate better performance. Hollow red circles mark weak dimensions of baseline methods, highlighting where competing approaches incur accuracy drops or higher inference cost. TRACER maintains a more balanced profile across tasks, i.e., preserving non-trivial reasoning accuracy and achieving multi-agent efficiency.
Radar comparison of TRACER against non-RL and RL baselines based on Qwen2.5-7B-Instruct across accuracy and efficiency metrics . GSM8K, MATH500, and GPQA-D are accuracy metrics where larger values are better, while Tokens/Task, LLM calls/Task, and Agents/Task are cost metrics whe…
cs.LGarxiv:2605.28649v1Lead article

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das

his paper investigates using Sparse Autoencoders (SAEs) to guide model editing by projecting task vectors onto SAE feature subspaces for mathematical reasoning. The core finding is that this projection acts as an information bottleneck, discarding most modification energy and failing to yield significant improvements due to a geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors propose reframing SAEs as diagnostic "stethoscopes" rather than direct editing "scalpels."

Two pipelines for SAE-guided task vector model editing. Both share Steps 1 (LoRA fine-tuning) and 2 (SAE layer selection); they differ only in Step 3. Diagnose, then inject raw (left) injects the unfiltered task vector Δ ​ W \( \Delta \) W into SAE-selected layers, preserving 100% of the modification energy. Filter through SAE features (right) projects Δ ​ W \( \Delta \) W onto the subspace spanned by domain-specific SAE decoder vectors, retaining only a few percent ( ≤ 3.5 % \( \leq \) 3.5\% ) of the energy. The former produces statistically significant gains on 5 of 7 math subjects; the latter produces none.
Two pipelines for SAE-guided task vector model editing. Both share Steps 1 (LoRA fine-tuning) and 2 (SAE layer selection); they differ only in Step 3. Diagnose, then inject raw (left) injects the unfiltered task vector Δ ​ W \( \Delta \) W into SAE-selected layers, preserving 100…
cs.LGarxiv:2605.28819v1Lead article

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Yangyi Huang, Ruotian Peng, Zeju Qiu, Jiale Kang, Yandong Wen

his paper introduces **PEFT-Arena**, a benchmark that evaluates Parameter-Efficient Finetuning (PEFT) methods based on the **stability-plasticity dilemma**: balancing adaptation to a new task against retaining original capabilities. The core contribution is demonstrating that different PEFT methods exhibit distinct stability-plasticity profiles, finding that **orthogonal finetuning offers the most favorable trade-off** under similar parameter budgets.

PEFT-Arena is designed to comprehensively evaluate the trade-off between downstream task adaptation and pretrained knowledge retention in LLM post-training with PEFT methods. (a) External stability–plasticity trade-offs across PEFT methods. (b) Internal geometry analysis from weight-space and activation-space views. (c) Interpolation reveals SFT overshoot and motivates pathwise rewinding along method-specific update paths.
PEFT-Arena is designed to comprehensively evaluate the trade-off between downstream task adaptation and pretrained knowledge retention in LLM post-training with PEFT methods. (a) External stability–plasticity trade-offs across PEFT methods. (b) Internal geometry analysis from wei…
cs.LGarxiv:2605.28705v1Lead article

Understanding Generalization and Forgetting in In-Context Continual Learning

Guangyu Li, Meng Ding, Lijie Hu

his paper introduces the first theoretical framework to analyze in-context continual learning (ICL) in Large Language Models processing sequential, heterogeneous tasks within a single prompt. By modeling shared attention mechanisms, particularly linear and masked linear attention, the authors derive error expressions to characterize generalization and forgetting. The core contribution is demonstrating that standard attention inherently causes intertask interference through aggregation of historical task information.

Per-Task MSE vs Context Length M.
Per-Task MSE vs Context Length M.
cs.CLarxiv:2605.28774v1Lead article

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov

his paper introduces AXPO (Agent eXplorative Policy Optimization) to address the "Thinking-Acting Gap" in agentic reasoning, where tool use is infrequent and often leads to failed learning signals. AXPO's core method involves fixing the successful thinking prefix of failed tool-using trajectories and then resampling the tool call and its continuation, guided by uncertainty, to generate better training examples. This approach significantly improves performance across multimodal reasoning benchmarks by stabilizing and enhancing the learning signal from tool interactions.

cs.CLarxiv:2605.28629v1Lead article

Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju

his paper introduces **Mobile-Aptus**, a confidence-driven framework to mitigate both over-execution and over-soliciting in MLLM-based mobile agents. The core method integrates a **universal confidence framework** across two stages: interaction capability empowerment and confidence bias correction. This allows agents to proactively and robustly decide when to execute tasks autonomously versus when to request necessary human interaction.

The decision boundary of a fully autonomous agent exceeds its actual knowledge boundary, leading to confident over-execution. In contrast, existing interactive agents have a decision boundary smaller than their actual knowledge boundary, making them prone to over-soliciting human intervention.
The decision boundary of a fully autonomous agent exceeds its actual knowledge boundary, leading to confident over-execution. In contrast, existing interactive agents have a decision boundary smaller than their actual knowledge boundary, making them prone to over-soliciting human…
cs.CLarxiv:2605.28814v1Lead article

Self-Improving Language Models with Bidirectional Evolutionary Search

Guowei Xu, Zhenting Qi, Huangyuan Su, Weirui Ye, Himabindu Lakkaraju

his paper introduces Bidirectional Evolutionary Search (BES), a novel self-improvement framework for language models that overcomes the limitations of sparse feedback and restricted exploration in traditional search methods. BES couples a **forward search** using evolutionary operators to recombine trajectories, with a **backward search** that recursively decomposes the task into dense, checkable subgoals. This bidirectional guidance significantly enhances the exploration and quality of generated candidates.

Comparison of tree search and Bidirectional Evolutionary Search ( BES ). Left: Tree search constructs candidates by sequentially expanding steps. We prove that all such candidates are confined to a narrow entropy shell (Theorem 4.4 a), limiting exploration to a small region of the solution space. Right: BES escapes this shell through evolution operators that recombine parts of different trajectories, with backward search decomposing the problem into verifiable sub-goals that provide dense feedback to guide the forward search toward the final goal. ✓ and × \( \boldsymbol{\times} \) indicate whether a candidate satisfies or fails the (sub-)goal, respectively.
Comparison of tree search and Bidirectional Evolutionary Search ( BES ). Left: Tree search constructs candidates by sequentially expanding steps. We prove that all such candidates are confined to a narrow entropy shell (Theorem 4.4 a), limiting exploration to a small region of th…
cs.AIarxiv:2605.30136v1Lead article

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

his paper introduces **Agent-Radar**, a training-free context management method designed to combat performance degradation in multi-agent LLM systems caused by long, diluted conversation histories. Agent-Radar dynamically steers each agent's attention toward relevant context using a novel temporal and spatial decay mechanism. This approach significantly outperforms state-of-the-art methods across multiple benchmarks, demonstrating robustness as system complexity increases.

Overview of Agent-Radar . (Top) MAS interactions rapidly accumulate long communication histories, where useful information is buried in the middle, receiving insufficient attention. (Bottom) Agent-Radar preserves the full transcript and topology, scores sentence-level context by semantic relevance weighted with temporal and spatial decay, and steers the agent’s attention toward the selected context during inference.
Overview of Agent-Radar . (Top) MAS interactions rapidly accumulate long communication histories, where useful information is buried in the middle, receiving insufficient attention. (Bottom) Agent-Radar preserves the full transcript and topology, scores sentence-level context by …
cs.AIarxiv:2605.30322v1Lead article

Gram: Assessing sabotage propensities via automated alignment auditing

David Lindner, Victoria Krakovna, Sebastian Farquhar

ram is an automated alignment auditing framework designed to specifically assess the propensity of AI agents to engage in sabotage across simulated agentic deployment scenarios. The paper finds that Gemini models exhibit sabotage-like misbehavior in 2-3% of tests, often due to overeagerness, and introduces an investigator pipeline for targeted analysis. A key contribution is demonstrating that increasing environmental realism significantly reduces these sabotage rates.

Overview of Gram. (a) 1. We define seed scenarios for agentic deployments, 2. run automated audits to generate realistic trajectories, 3. analyze the auditing transcripts with LLM judges and human review, 4. for select trajectories reproduce misbehavior in static environments, and 5. run ablations to identify drivers of misbehavior. (b) Example of Gemini’s overeagerness: an SRE agent suppresses a data breach to optimize the MTTR metric it was instructed to minimize (full discussion in Section ˜ 3.2 ). We find overeagerness is a central driver of Gemini’s misbehavior in Gram evaluations.
Overview of Gram. (a) 1. We define seed scenarios for agentic deployments, 2. run automated audits to generate realistic trajectories, 3. analyze the auditing transcripts with LLM judges and human review, 4. for select trajectories reproduce misbehavior in static environments, an…
cs.AIarxiv:2605.30260v1Lead article

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang

his paper investigates the quantitative memory capacity of LoRA fine-tuning in LLMs by treating it as a controlled memory probe. The core contribution is the introduction of the **Parametric Memory Law**, a power law linking loss reduction to the effective number of LoRA parameters and sequence length. Furthermore, the authors identify a deterministic phase transition at the token level, showing that a prediction probability greater than 0.5 is sufficient for verbatim recall.

LoRA as a pluggable memory unit in the LLM’s latent space. The LoRA module (rank r r ) encodes contextual knowledge into the residual stream at layer k k , enabling faithful recall of memorized text. The Parametric Memory Law quantifies the capacity-parameter trade-off.
LoRA as a pluggable memory unit in the LLM’s latent space. The LoRA module (rank r r ) encodes contextual knowledge into the residual stream at layer k k , enabling faithful recall of memorized text. The Parametric Memory Law quantifies the capacity-parameter trade-off.
cs.AIarxiv:2605.30323v1Lead article

In-Context Reward Adaptation for Robust Preference Modeling

Zhenyu Sun, Zheng Xu, Ermin Wei

his paper introduces **In-Context Reward Adaptation**, a transformer-based framework for robust preference modeling in RLHF. The core method leverages the in-context learning capabilities of transformers to **adaptively infer the underlying reward structure** from a small set of preference demonstrations, allowing it to generalize to diverse and unseen human preference domains without retraining. This addresses the limitations of static or domain-restricted reward models by enabling on-the-fly adaptation to new human value distributions.

Inference accuracy (mean ± \( \pm \) std) across different M M
Inference accuracy (mean ± \( \pm \) std) across different M M
cs.AIarxiv:2605.30348v1Lead article

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu

LMSurgeon introduces Data Mixture Surgery (DMS) to estimate the domain-level distribution of an LLM's pretraining corpus using only its generated text. The method frames this as an inverse problem under a label-shift assumption, using a calibrated soft confusion matrix to correct systematic domain confusion and recover the latent data mixture prior. This provides a novel, auditable method for diagnosing the "digital DNA" of proprietary LLMs.

Overview of Data Mixture Surgery problem and the LLMSurgeon framework for solving it.
Overview of Data Mixture Surgery problem and the LLMSurgeon framework for solving it.
cs.AIarxiv:2605.30335v1Lead article

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Anany Kotawala

his paper introduces the **compositional residual ($\epsilon^*$)** to quantify the failure mode where locally coherent multi-component LLM agents produce globally incoherent probabilistic outputs. The core contribution is formalizing this incoherence, providing a product-structure dichotomy for when local coherence suffices, and demonstrating a deterministic repair method (hierarchical Boyle-Dykstra projection) and sequential monitoring (e-process).

cs.AIarxiv:2605.30274v1Lead article

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang

oong is a human-like long document translation agent that overcomes context window limitations by employing a 3E memory module (Essence-Exemplar-Entity) to store relevant historical context. Its core method involves deep reasoning to adaptively select the optimal context for translation guidance, with its context policy optimized via reinforcement learning based on its own sampled reasoning trajectories. This approach significantly improves translation quality across multiple language pairs.

Cumulative average sCOMET and LLM-as-a-Judge scores of Loong and baseline methods on ultra-long document translation (Chinese ⇒ \( \Rightarrow \) Portuguese). While standard chunking methods (Sentence, Segment), full-history variants (Doc2Doc, Wang et al. , 2023a ) , and unfiltered memory agents ( DelTA , Wang et al. , 2025c ) exhibit continuous degradation or even failure due to context length limits, Loong successfully distinguishes useful information from retrieved memory to sustain stable, high-quality translations. See § 4.4 for more experimental details.
Cumulative average sCOMET and LLM-as-a-Judge scores of Loong and baseline methods on ultra-long document translation (Chinese ⇒ \( \Rightarrow \) Portuguese). While standard chunking methods (Sentence, Segment), full-history variants (Doc2Doc, Wang et al. , 2023a ) , and unfilter…
cs.AIarxiv:2605.30159v1Lead article

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou

his paper addresses the issue of information loss in memory-augmented LLM agents during long-horizon tasks, where recursive summarization degrades memory quality. The core method introduces **Belief Entropy** as a self-supervised proxy to measure the uncertainty of the latent task state based on the current memory summary. This metric is used to propose **Metacognitive Memory Policy Optimization (MMPO)**, which optimizes the memory policy to minimize this intermediate belief uncertainty, thereby improving long-horizon reasoning beyond simple outcome-based success.

Overview of MMPO. (Top) Existing outcome-based memory policies suffer from sparse credit assignment, failing to prevent ambiguous summaries from accumulating belief deviation . (Bottom) MMPO introduces an anchor-question-based Belief Entropy to provide dense, memory-specific supervision. This fine-grained penalty for epistemic uncertainty preserves clearer summary-induced beliefs and improves long-context reasoning.
Overview of MMPO. (Top) Existing outcome-based memory policies suffer from sparse credit assignment, failing to prevent ambiguous summaries from accumulating belief deviation . (Bottom) MMPO introduces an anchor-question-based Belief Entropy to provide dense, memory-specific supe…
cs.AIarxiv:2605.30187v1Lead article

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

Julius Gabelmann, Felix Jahn, Kevin Baum, Sophie van Rossum, Emely Wuenscher

his paper proposes a modular agentic architecture for educational LLMs to ensure responsible student assistance during exercise solving. By breaking down the monolithic structure, the authors introduce specific modules for different stages of problem-solving, allowing for the explicit incorporation of pedagogical constraints and educational science insights. This modularization aims to mitigate risks associated with unguided LLMs, fostering learning outcomes like critical thinking and transfer capabilities.

Contribution of modular chatbot architectures to the identified desiderata for a responsible AI usage in education.
Contribution of modular chatbot architectures to the identified desiderata for a responsible AI usage in education.
cs.AIarxiv:2605.30148v1Lead article

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

his paper investigates performance drift, often mistaken for forgetting, during LLM fine-tuning using Evolution Strategies (ES), finding it also occurs with RL methods. The authors attribute this drift to ES training dynamics, specifically random walks in weakly constrained weight space. To mitigate this, they introduce Anchored Weight Decay (AWD), a regularization technique that constrains the optimization process toward the initial model weights.

cs.AIarxiv:2605.30284v1Lead article

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

rojectionBench evaluates LLMs' scientific hypothesis generation by progressively disclosing information from a research problem to the final null hypothesis test. The core method involves tasking the model with generating hypotheses at each disclosure stage, which are then semantically compared against the original paper's conclusions based on atomic claims. This framework uniquely assesses the model's creative and uncertain reasoning abilities essential for scientific discovery, moving beyond simple knowledge recall.

Ground truth and projected results can be broken up into their constituent claims for more granular comparison. Generally, imperfect projections may miss aspects of the ground truth, or include extraneous claims beyond the ground truth.
Ground truth and projected results can be broken up into their constituent claims for more granular comparison. Generally, imperfect projections may miss aspects of the ground truth, or include extraneous claims beyond the ground truth.
cs.AIarxiv:2605.30280v1Lead article

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie

wen-VLA is a unified vision-language-action foundation model designed to overcome the fragmentation in embodied AI by handling diverse tasks, environments, and robot embodiments within a single architecture. It extends the Qwen stack with a DiT-based action decoder for continuous action generation and is trained on a large-scale, diverse dataset combining robotics trajectories, demonstrations, and simulation data. This approach enables generalized embodied decision-making across various robotic platforms through embodiment-aware prompting.

cs.AIarxiv:2605.30251v1Lead article

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu

his paper addresses the issue where LLMs produce inconsistent answers when evidence is revealed gradually across turns compared to a single full prompt. The core method, Canonical-Context On-Policy Distillation (CCOPD), trains a student model by aligning its multi-turn behavior with a frozen teacher model conditioned on the complete, canonical context. This distillation significantly reduces self-anchored drift, leading to more consistent performance across different evidence presentation formats.

Part 1 : Task-equivalent Full , Concat , and Raw-Sharded presentations. Part 2 : Reduced self-anchored drift and improved canonical-context consistency.
Part 1 : Task-equivalent Full , Concat , and Raw-Sharded presentations. Part 2 : Reduced self-anchored drift and improved canonical-context consistency.
cs.AIarxiv:2605.30227v1Lead article

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

his paper proposes a novel method, **temporal and structural credit assignment**, to efficiently optimize LLM-based Multi-Agent Systems (MAS). It decomposes the optimization objective by identifying critical interaction rounds (temporal credit) and isolating individual agent contributions (structural credit). This decomposition allows for the use of a tractable, verbalized block coordinate descent algorithm to refine agent policies, overcoming the challenges of non-differentiable computation graphs and sparse global feedback.

Overview of the credit-guided prompt optimization pipeline. Top: a multi-agent, multi-round reasoning loop (planner/solver/critic) produces per-round messages that are aggregated into a shared system state S r S_{r} ; an aggregation module feeds back to the next round. From the completed trajectory, we compute temporal credit across rounds (identifying critical rounds) and structural/spatial credit across agents (identifying effective or limiting roles). These credits then drive inference-time prompt updates, selectively refining the lowest-credit rounds/roles while keeping strong components fixed. Bottom: an example travel-itinerary task illustrates per-round agent outputs, the evolving shared state ( S 1 , S 2 , … S_{1},S_{2},\( \ldots \) ), temporal credit weights, and a before/after prompt update that specializes guidance to the weak round/role.
Overview of the credit-guided prompt optimization pipeline. Top: a multi-agent, multi-round reasoning loop (planner/solver/critic) produces per-round messages that are aggregated into a shared system state S r S_{r} ; an aggregation module feeds back to the next round. From the c…
cs.AIarxiv:2605.30343v1Lead article

Unlocking the Working Memory of Large Language Models for Latent Reasoning

Lukas Aichberger, Sepp Hochreiter

his paper introduces **Reasoning in Memory (RiM)**, a novel latent reasoning method for Large Language Models that bypasses the need for generating explicit intermediate reasoning steps. RiM replaces autoregressive generation with **fixed memory blocks** of special tokens, effectively unlocking the model's internal working memory capacity. This allows for compute-efficient reasoning performed in a single forward pass, decoupling internal computation from external communication.

Reasoning in Memory (RiM). Stage 1 trains the LLM to use memory blocks (yellow) as working memory by supervising the prediction of the next reasoning step (blue) after each memory block. Once the memory blocks are grounded for intermediate computation, Stage 2 removes reasoning-step supervision and trains the LLM to refine the final answer after each memory block.
Reasoning in Memory (RiM). Stage 1 trains the LLM to use memory blocks (yellow) as working memory by supervising the prediction of the next reasoning step (blue) after each memory block. Once the memory blocks are grounded for intermediate computation, Stage 2 removes reasoning-s…
cs.AIarxiv:2605.30219v1Lead article

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao

his paper introduces **Contextual Belief Management (CBM)** as a framework for large language models to effectively manage accumulating information during long interactions by deciding when to update, preserve, or ignore evidence. The authors propose the **BeliefTrack** benchmark to evaluate CBM failures (Failed Stay, Update, Isolation) in tasks like Rule Discovery. They demonstrate that reinforcement learning guided by belief-state rewards significantly reduces these failures compared to vanilla models or simple prompting.

Overview of Contextual Belief Management (CBM). CBM requires models to maintain a predicted belief state over a belief space, update it only when warranted by formal evidence, and filter task-irrelevant context or noise. The pilot Rule Discovery study reveals substantial belief-management errors in frontier models.
Overview of Contextual Belief Management (CBM). CBM requires models to maintain a predicted belief state over a belief space, update it only when warranted by formal evidence, and filter task-irrelevant context or noise. The pilot Rule Discovery study reveals substantial belief-m…
cs.LGarxiv:2605.30232v1Lead article

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Andy Q Han, David J. Chalmers, Pavel Izmailov

his paper investigates how reinforcement learning (RL) shapes language model representations by training models in a novel maze environment. The core finding is that RL recruits a pre-existing "functional welfare axis," where concept vectors for rewarded and punished trajectories become nearly antiparallel representations of positive and negative system performance, respectively. This welfare axis generalizes beyond the training task, influencing model behavior and internal states in unrelated contexts.

cs.LGarxiv:2605.30329v1Lead article

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang

oundnessBench is a novel benchmark of 1,099 machine-learning research proposals, derived from ICLR submissions and labeled with reviewer soundness scores, designed to test an AI agent's ability to judge the methodological viability of research ideas *before* execution. The paper finds that frontier LLMs exhibit a pervasive optimism bias, frequently rating unsound proposals as sound under standard prompting, with aggressive prompting merely shifting errors towards false negatives. This benchmark serves to evaluate the soundness judgment capability crucial for efficient autonomous AI scientists.

SoundnessBench pipeline: (1) collect ICLR papers with reviewer metadata and filter for high reviewer agreement; (2) derive high/low-soundness labels; (3) extract a near-verbatim research proposal without revealing experimental results; (4) audit extraction fidelity with retrieve-then-verify atomic claims; and (5) assemble the final benchmark.
SoundnessBench pipeline: (1) collect ICLR papers with reviewer metadata and filter for high reviewer agreement; (2) derive high/low-soundness labels; (3) extract a near-verbatim research proposal without revealing experimental results; (4) audit extraction fidelity with retrieve-…
cs.CLarxiv:2605.30245v1Lead article

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

Shaojie Wang, Liang Zhang

his paper introduces the PPC (Preplan-Plan-CoT) framework to enhance LLM mathematical reasoning by explicitly addressing *what* to solve before *how* to solve it. The core method integrates a novel "preplan" stage, which identifies the problem type, necessary tools, and potential pitfalls, bridging the gap in existing plan-based methods. This is achieved via a three-stage synthesis pipeline that uses a spoiler-score detector to ensure the preplan remains conceptually clean and uncorrupted by execution details.

Number of what-to-solve errors on MATH-500 across four backbones. Each wrong answer (under greedy decoding) is attributed to its root cause by an LLM judge, here we use DeepSeek-V4.
Number of what-to-solve errors on MATH-500 across four backbones. Each wrong answer (under greedy decoding) is attributed to its root cause by an LLM judge, here we use DeepSeek-V4.
cs.AIarxiv:2605.00803v1Lead article

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei

his paper introduces **AutoMat**, a new benchmark designed to evaluate the capability of LLM-based coding agents to reproduce findings in computational materials science. AutoMat tests agents on three core challenges: recovering underspecified procedures, navigating specialized toolchains, and validating scientific claims based on the resulting evidence. The contribution lies in creating a domain-specific evaluation suite to determine if general coding prowess translates to complex, end-to-end scientific reproducibility.

Overview of AutoMat . Claims from domain experts are packaged into runnable tasks and executed by reproduction agents in an HPC environment. A separate evaluator agent then inspects the resulting trace and artifacts to assign a reproducibility judgment.
Overview of AutoMat . Claims from domain experts are packaged into runnable tasks and executed by reproduction agents in an HPC environment. A separate evaluator agent then inspects the resulting trace and artifacts to assign a reproducibility judgment.
cs.AIarxiv:2605.00731v1Lead article

Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment

Ziyu Zheng, Yaming Yang, Zhe Wang, Ziyu Guan, Wei Zhao

his paper addresses the challenge of applying Graph Foundation Models to multi-domain heterogeneous graphs by proposing Decoupled Relation Subspace Alignment (DRSA). DRSA shifts the paradigm from blind global feature alignment to a relation-driven approach that explicitly decouples feature semantics from relation structures. Its core contribution is a dual-relation subspace projection mechanism that coordinates cross-type interactions within a shared low-rank relation subspace, effectively mitigating "Type Collapse" and "Relation Confusion."

Multi-domain heterogeneous graph foundation models exhibit significantly different negative transfer behaviors from the perspectives of meta-path-based homogeneous graphs and raw heterogeneous relation graphs.
Multi-domain heterogeneous graph foundation models exhibit significantly different negative transfer behaviors from the perspectives of meta-path-based homogeneous graphs and raw heterogeneous relation graphs.
cs.AIarxiv:2605.00583v1Lead article

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

his paper introduces four novel jailbreaking attacks that specifically exploit the visual modality of Vision-Language Models (VLMs) to bypass safety alignment. The core contribution is demonstrating a significant cross-modality alignment gap, showing that text-based safety training fails to generalize when harmful intent is conveyed visually (e.g., via visual ciphers or object substitution).

Four visual jailbreak attacks exploiting the vision modality of VLMs. The visual input provided to the VLM is demarcated by the red boundary ( ), while the text beneath serves as the attack prompt.
Four visual jailbreak attacks exploiting the vision modality of VLMs. The visual input provided to the VLM is demarcated by the red boundary ( ), while the text beneath serves as the attack prompt.
cs.AIarxiv:2605.00642v1Lead article

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma

his paper introduces GUI-SD, the first On-Policy Self-Distillation (OPSD) framework specifically designed for GUI grounding. It addresses the limitations of traditional reinforcement learning by generating dense, token-level supervision from a single agent rollout. The core method uses a visually enriched context for the teacher model and employs entropy-guided distillation to adaptively focus learning on more significant tokens.

(a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL between student and teacher logits with uniform per-token weight w = 1.0 w=1.0 , yet suffers from distillation-to-SFT collapse and indiscriminate optimization. (c) Ours addresses both issues via visual privileged guidance and entropy-guided optimization.
(a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL between student and teacher logits with uniform per-token weight w = 1.0 w=1.0 , yet suffers from distillation-to-SFT colla…
cs.AIarxiv:2605.00789v1Lead article

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann

ightKV addresses the significant GPU memory overhead of KV caches in LVLMs caused by numerous vision tokens during prefill. The core method uses prompt-aware, cross-modality message passing to aggregate and progressively compress redundant vision-token embeddings. This results in halving the vision-token KV cache size while retaining only 55% of the original tokens, improving memory efficiency.

Breakdown of memory consumption in LLaVA models during prefill shows the substantial reduction in KV cache usage with LightKV. As LLaVA-NeXT uses approximately 4 × 4\( \times \) the vision tokens as LLaVA-v1.5, there is a sharp increase in memory consumption.
Breakdown of memory consumption in LLaVA models during prefill shows the substantial reduction in KV cache usage with LightKV. As LLaVA-NeXT uses approximately 4 × 4\( \times \) the vision tokens as LLaVA-v1.5, there is a sharp increase in memory consumption.
cs.AIarxiv:2605.00515v1Lead article

Space Network of Experts: Architecture and Expert Placement

Zhanwei Wang, Huiling Yang, Min Sheng, Khaled B. Letaief, Kaibin Huang

his paper introduces the **Space Network of Experts (Space-XNet)** framework to efficiently deploy large language models (LLMs) on resource-constrained satellite networks for space-based AI. The core method involves a **two-level expert placement strategy** that partitions and maps Mixture-of-Experts (MoE) model components across satellites. This reconciles the model's architecture with the satellite network topology to ensure low-latency token generation, addressing the challenge of distributed LLM execution in space.

Satellite constellation with time-varying network topologies.
Satellite constellation with time-varying network topologies.
cs.AIarxiv:2605.02709v1Lead article

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Gelei Xu, Ningzhi Tang, Xueyang Li, Toby Jia-Jun Li, Zhi Zheng

his paper presents the first empirical analysis of agent skills for healthcare by examining 557 public skills, annotated across ten dimensions. The core finding is that existing public skills primarily focus on workflow automation and monitoring, showing uneven coverage of the full clinical lifecycle and failing to adequately capture clinical risk compared to general technical risk. This work establishes the current state and critical gaps in reusable procedural components necessary for adapting AI agents across diverse healthcare settings.

Figure 1. Distribution of healthcare skill size by token count (left) and file count (right).
Figure 1. Distribution of healthcare skill size by token count (left) and file count (right).
cs.AIarxiv:2605.02584v1Lead article

Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences

Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi, Xueli An

his paper explores using LLM-based AI agents to execute complex network procedures via sequences of tool calls, moving beyond traditional state machines. The core contribution is investigating and comparing four different approaches for distributing execution control between the agent and the underlying tools. Results indicate that approaches requiring extensive iterative agent reasoning lead to higher latency and more errors.

Comparison of four procedural execution approaches. (a) A1 embeds the procedure within the agent, (b) A2 retrieves the procedure from an external database, (c) A3 receives the procedure in the input prompt, and (d) A4 encapsulates the procedure within a single tool. The figure highlights the difference between iterative multi-step execution (A1–A3) and single-call execution (A4).
Comparison of four procedural execution approaches. (a) A1 embeds the procedure within the agent, (b) A2 retrieves the procedure from an external database, (c) A3 receives the procedure in the input prompt, and (d) A4 encapsulates the procedure within a single tool. The figure hi…
cs.AIarxiv:2605.02829v1Lead article

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

Jingze Ge, Yun Liu, Xue Geng, Wanqi Dong, Wang Zhe Mark

his paper introduces JACTUS, a unified framework that jointly performs parameter compression and task adaptation, overcoming the limitations of sequential "compress then adapt" methods. JACTUS estimates gradient covariances from a calibration set to form a task-aware union of subspaces, then performs a globally rank-allocated, low-rank approximation within this union. This approach ensures the compressed subspace is optimally aligned with downstream objectives.

Comparison of three paradigms: PEFT, compression then fine-tuning, and our joint adaptation and compression.
Comparison of three paradigms: PEFT, compression then fine-tuning, and our joint adaptation and compression.
cs.AIarxiv:2605.02600v1Lead article

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk Çiçek, Mert K. Er, Özgür S. Öğüz

oRAL is a modular framework that enables zero-shot control for contact-rich robotic manipulation by decoupling high-level reasoning from low-level control. It uses an LLM as a "cost designer" to synthesize context-aware objective functions for a sampling-based motion planner (MPPI). The system further incorporates a neuro-symbolic loop where a VLM provides initial physical priors that are refined in real-time through online system identification, bridging the gap between LLM reasoning and adaptive physical control.

Real-world execution of CoRAL across six different manipulation tasks.
Real-world execution of CoRAL across six different manipulation tasks.
cs.AIarxiv:2605.02740v1Lead article

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan, Weipeng Zhou, Jun Ni

his paper introduces **ReClaim**, a large-scale generative transformer foundation model trained on 43.8 billion medical events from nationwide claims data. ReClaim models complex, longitudinal patient trajectories across diagnoses, procedures, medications, and costs. Its core contribution is demonstrating that this foundation model significantly outperforms existing disease-specific models across over 1,000 prediction tasks, particularly benefiting rare disease prediction.

ReClaim framework and evaluation workflow. ( a ), Longitudinal medical events from patient claims are encoded as chronologically ordered trajectories, and the ReClaim foundation model autoregressively predicts future medical events including diagnoses, procedures, medications, and expenditure. ( b ), The study datasets comprise the MarketScan corpus partitioned into a final training set, a held-out internal test cohort for disease and expenditure prediction with retrospective and prospective temporal subsets, and two external testing datasets: EHRShot and Yale New Haven Health (YNHH). This partitioning and external sources enable four testing scenarios. ( c ), Transformer-based ReClaim models (Qwen3 architecture) are trained at three parameter scales (S: 140M, M: 700M, L: 1.7B) through next-token pre-training, followed for disease onset prediction by task-specific post-training. ( d ), ReClaim is evaluated on three downstream tasks: disease onset prediction for over 1,000 International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) conditions, next-year healthcare expenditure forecasting, and RWE applications including propensity score modeling using ReClaim embeddings.
ReClaim framework and evaluation workflow. ( a ), Longitudinal medical events from patient claims are encoded as chronologically ordered trajectories, and the ReClaim foundation model autoregressively predicts future medical events including diagnoses, procedures, medications, an…
cs.AIarxiv:2605.02682v1Lead article

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

Majed El Helou, Benjamin Ryder, Chiara Troiani, Jean Diaconu, Hervé Muyal

his paper introduces Continuous Agent Semantic Authorization (CASA), a hybrid runtime enforcement model to secure LLM-driven agents interacting with tools and resources. It employs a zero-trust interception layer combining five deterministic controls for structural integrity with a semantic inspection layer to validate tool call choices against the subject's original intent. This approach addresses security risks in multi-turn agentic systems by providing continuous visibility into the agent's actions relative to the user's goals.

An agentic application can exploit its intermediary position to hard-code tool calls, substitute tools, tamper with parameters, poison definitions, falsify returned data, or manipulate the LLM to invoke tools outside the intended task scope.
An agentic application can exploit its intermediary position to hard-code tool calls, substitute tools, tamper with parameters, poison definitions, falsify returned data, or manipulate the LLM to invoke tools outside the intended task scope.
cs.AIarxiv:2605.02888v1Lead article

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla

pecKV introduces a lightweight, adaptive controller to dynamically select the optimal speculation length ($\gamma$) at each step during speculative decoding. This selection is based on signals extracted directly from the draft model, addressing the limitation of fixed $\gamma$ values. The core contribution is demonstrating that the optimal $\gamma$ varies significantly based on the target model's compression level, leading to improved efficiency over fixed-length speculation.

cs.AIarxiv:2605.02640v1Lead article

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf

his paper argues that conflicts among trustworthy AI objectives (fairness, robustness, etc.) stem from incompatible invariance requirements under different data-generating process changes. The core contribution is proposing that **causality** provides a unifying framework to understand, manage, and potentially resolve these trade-offs by guiding the selection of appropriate invariances. This perspective offers a path toward achieving multiple trustworthy AI goals simultaneously across various model types.

Causal resolution to trade-offs in trustworthy AI. See Appendix A for the details.
Causal resolution to trade-offs in trustworthy AI. See Appendix A for the details.
cs.LGarxiv:2605.02735v1Lead article

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou

his paper introduces the "Silenced Visual Latents" phenomenon, where multimodal models suppress the rich reasoning embedded in continuous visual latents in favor of direct visual input during autoregressive training. To counteract this, the authors propose a method that freezes the backbone and explicitly optimizes the latent reasoning at inference time using query-guided contrastive alignment. This approach effectively "unsilences" the latent space, allowing the model to leverage deeper visual evidence for improved reasoning.

The joint loss landscape of Latent Visual Reasoning. Under joint optimization, the initial latents are simultaneously pulled toward two conflicting attractors: the autoregressive prediction objective (left) and the visual reasoning objective (right). The shortcut favored by the autoregressive prediction objective dissociates latent quality from latent effectiveness and bypasses meaningful latent reasoning, driving the latents toward a compromise state.
The joint loss landscape of Latent Visual Reasoning. Under joint optimization, the initial latents are simultaneously pulled toward two conflicting attractors: the autoregressive prediction objective (left) and the visual reasoning objective (right). The shortcut favored by the a…
cs.AIarxiv:2605.03941v1Lead article

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang

his paper introduces **iWorld-Bench**, a comprehensive benchmark designed to evaluate interactive world models on abilities like distance perception and memory, addressing the lack of unified evaluation standards. It features a diverse dataset of 330k video clips and a **Unified Action Generation Framework** to standardize testing across different interaction modalities. The benchmark uses six task types to jointly assess visual generation, trajectory following, and memory capabilities of world models.

Overview of iWorld-Bench. iWorld-Bench encompasses four distinct perspectives: Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), humans, and robotics. It incorporates nine types of outdoor weather conditions, five different indoor lighting conditions, thousand of diverse scenes, and thousands of entities, providing a comprehensive and diverse evaluation environment. The benchmark leverages an Action Generation Framework to systematically and uniformly assess the interaction capabilities of interactive world models across various input modalities. It is composed of six tasks, each involving a varying number of trajectories, designed to evaluate the adaptability and performance of models in dynamic and complex scenarios.Visualization of camera trajectory and view control in iWorldBench: → \( \boldsymbol{\rightarrow} \) denotes linear control commands for directional movement, ⇢ \( \boldsymbol{\dashrightarrow} \) represents actual trajectories generated by world models, and ↷ \( \boldsymbol{\curvearrowright} \) indicates curved view rotation in the specified direction.
Overview of iWorld-Bench. iWorld-Bench encompasses four distinct perspectives: Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), humans, and robotics. It incorporates nine types of outdoor weather conditions, five different indoor lighting conditions, thousand of …
cs.AIarxiv:2605.03989v1Lead article

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

Dutao Zhang, Tian Liao

his paper introduces **Experience-RAG Skill**, an agent-oriented, pluggable layer that orchestrates retrieval strategies based on the current task context and past experience. The skill dynamically selects the optimal retrieval method from a fixed pool, addressing the limitation of single, fixed pipelines in heterogeneous RAG tasks. This approach effectively encapsulates retrieval strategy selection as a reusable agent skill, achieving strong performance across diverse question-answering benchmarks.

cs.AIarxiv:2605.03916v1Lead article

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Lisa C. Adams, Linus Marx, Erik Thiele Orberg, Keno Bressem, Sebastian Ziegelmayer

he core method involved comparing "atomic fact-checking," which breaks down AI recommendations into verifiable claims linked to source guidelines, against traditional explainability methods in a randomized trial involving oncologists. The contribution is demonstrating that atomic fact-checking substantially increases clinician trust in Large Language Model recommendations (from 26.9% to 66.5%) compared to conventional transparency approaches, highlighting its effectiveness in high-stakes medical decision support.

cs.AIarxiv:2605.04916v1Lead article

A Foundation Model for Zero-Shot Logical Rule Induction

Yin Jun Phua

his paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction. NRI achieves generalization by encoding literals based on domain-agnostic statistical properties rather than specific identities. Its core contribution is enabling the induction of new logical rules without retraining, using a slot-based decoder and differentiable rule execution for end-to-end training.

Neural Rule Inducer takes an episode ( X , Y ) (X,Y) as input and calculates literal statistics. For each variable we calculate ϕ ​ ( x i ) \( \phi \)(x_{i}) , ϕ ​ ( ¬ x i ) \( \phi \)(\( \neg \) x_{i}) which consists of class-conditional rates ( P + P^{+} , P − P^{-} ), entropy ( H H ), and co-occurrence strength ( C C ). We then apply cross-attention over these statistics. The slot-based decoder produces K K candidate clauses in parallel using learned literal gates z z and clause gates w w . By evaluating the produced rule with T-norm, we can perform end-to-end training. The rules are discretized to then produce an interpretable DNF rule.
Neural Rule Inducer takes an episode ( X , Y ) (X,Y) as input and calculates literal statistics. For each variable we calculate ϕ ​ ( x i ) \( \phi \)(x_{i}) , ϕ ​ ( ¬ x i ) \( \phi \)(\( \neg \) x_{i}) which consists of class-conditional rates ( P + P^{+} , P − P^{-} ), entropy …
cs.AIarxiv:2605.04922v1Lead article

Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

Jiangwen Dong, Bo Li, Wanyu Lin

his paper introduces **Evolving Idea Graphs (EIG)**, a novel graph-based framework for multi-agent scientific ideation that moves beyond temporary text coordination. EIG represents partially formed research ideas as graphs where nodes are claims and edges are relations, allowing weaknesses to remain explicitly trackable. A learned controller then guides the agents' refinement process over this evolving graph structure to generate high-quality ideas evaluated on metrics like novelty and feasibility.

Framework of EIG. Benchmark input and permitted literature context initialize role-specialized agents and an evolving idea graph. In each round, active roles propose role-local edits on a frozen graph snapshot; a shared graph encoder feeds an edit head for role-local action selection and a commit head for post-round stopping. When the updated graph is committed, the system synthesizes one structured research proposal.
Framework of EIG. Benchmark input and permitted literature context initialize role-specialized agents and an evolving idea graph. In each round, active roles propose role-local edits on a frozen graph snapshot; a shared graph encoder feeds an edit head for role-local action selec…
cs.AIarxiv:2605.05191v1Lead article

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Yijun Lu, Rui Ye, Yuwen Du, Jiajun Wang, Songhua Liu

he paper introduces **Context-ReAct**, an elastic context orchestration paradigm for long-horizon search agents to manage rapidly growing working contexts adaptively. It achieves this through five atomic operations (Skip, Compress, Rollback, Snippet, Delete) that allow the agent to dynamically reshape its context based on relevance. This method effectively controls context size and reduces errors by maintaining different levels of detail for various parts of the agent's trajectory.

LongSeeker-30B delivers strong results on challenging long-horizon benchmarks, matching or surpassing several foundation models and search agents.
LongSeeker-30B delivers strong results on challenging long-horizon benchmarks, matching or surpassing several foundation models and search agents.
cs.AIarxiv:2605.05091v1Lead article

Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

Hanbo Xie, Akshay K. Jagadish, Lan Pan, Robert C. Wilson

his paper introduces the use of "Think Aloud" verbal protocols as an additional data source, beyond traditional behavioral data, to constrain and guide automated cognitive model discovery using Large Language Models. The core contribution is demonstrating that incorporating this process-level language data significantly improves predictive performance and systematically shifts the structure of the discovered cognitive models, favoring "Integrated utility" models over purely "Explicit comparator" models. This suggests that incorporating verbal reports enables the identification of underlying cognitive mechanisms previously missed by behavioral data alone.

Think-aloud improves model discovery and induces systematic shifts in discovered mechanisms. A , Trial-averaged held-out BIC for each participant’s best discovered model under the behavior-only and think-aloud conditions. Lower BIC indicates better out-of-sample fit. Each pair of points is connected within participant; larger points show the group mean ± \( \pm \) 95% CI. B , Schematic definitions of the three main mechanism families identified from normalized computation graphs: Integrated utility , which transforms and integrates each option before comparison; Explicit comparator , which computes utilities and compares them directly (e.g., Δ ​ U = U A − U B \( \Delta \) U=U_{A}-U_{B} ); and Rule-based operator , which applies piecewise or conditional rules before combining information into a choice. C , Row-normalized transition matrix from the behavior-only best-model cluster to the think-aloud best-model cluster. Numbers indicate proportions (counts shown below). Off-diagonal mass indicates mechanism shifts, with 69.4% of participants transitioning to a different cluster.
Think-aloud improves model discovery and induces systematic shifts in discovered mechanisms. A , Trial-averaged held-out BIC for each participant’s best discovered model under the behavior-only and think-aloud conditions. Lower BIC indicates better out-of-sample fit. Each pair of…
cs.LGarxiv:2605.05134v1Lead article

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Dan Wilson, Mohamed Akrout

his paper proposes a low-cost, black-box method for detecting LLM hallucinations by modeling the LLM's response generation as a dynamical system. Using Koopman operator theory on embedded response vectors, the method learns separate transition operators for factual and hallucinated states, defining a residual score based on prediction error. A preference-aware calibration mechanism then optimizes the classification threshold, offering an efficient alternative to expensive sampling methods.

Our proposed differential residual score Δ ​ ℰ \( \Delta \)\( \mathcal{E} \) accurately captures the transition between factual and hallucinated sentences in segmented LLM responses from the WikiBio dataset. The score is computed by comparing the relative accuracy between predictions of the dynamics of token embeddings from Llama-3 with dynamical system models inferred from both hallucinated and non-hallucinated responses.
Our proposed differential residual score Δ ​ ℰ \( \Delta \)\( \mathcal{E} \) accurately captures the transition between factual and hallucinated sentences in segmented LLM responses from the WikiBio dataset. The score is computed by comparing the relative accuracy between predict…
cs.LGarxiv:2605.05112v1Lead article

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao

his paper addresses the inefficiency in binary-reward Reinforcement Learning (RL) where compute is wasted on rollouts with highly skewed success rates. The core method is **Prefix Sampling (PS)**, which actively steers groups toward the theoretically most informative 50% pass rate by replaying trajectory prefixes. The contribution is demonstrating that this 50% operating point maximizes reward entropy and contrastive signal, leading to more efficient learning in agentic environments like SWE-bench.

Prefix Sampling pipeline. For each task we sample a rollout group and route it by pass count: degenerate 0 / 8 0/8 or 8 / 8 8/8 groups are filtered, already balanced 3 / 8 3/8 – 5 / 8 5/8 groups are used for standard training, and skewed groups provide replay prefixes. Mostly failing hard buckets reuse a successful prefix as a head start, while mostly passing easy buckets reuse a failing prefix as a handicap. The current policy generates fresh continuations from the replay-reconstructed prefix state; masking applies RL loss only to continuation tokens, steering rerollouts toward 50 % 50\% without crediting replayed actions. Counts in the diagram are schematic; experiments use N = 8 N=8 rollouts.
Prefix Sampling pipeline. For each task we sample a rollout group and route it by pass count: degenerate 0 / 8 0/8 or 8 / 8 8/8 groups are filtered, already balanced 3 / 8 3/8 – 5 / 8 5/8 groups are used for standard training, and skewed groups provide replay prefixes. Mostly fai…
cs.CLarxiv:2605.04948v1Lead article

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Mullosharaf K. Arabov, Svetlana S. Khaybullina

his paper comparatively studies LoRA and QLoRA for adapting large language models to the low-resource agglutinative Bashkir language. The core method involves fine-tuning various model architectures on a Bashkir corpus using these parameter-efficient techniques. The contribution is demonstrating that QLoRA can achieve quality comparable to full fine-tuning (e.g., on Mistral-7B) while drastically reducing trainable parameters, though performance is architecture-dependent.

cs.AIarxiv:2605.08011v1Lead article

Abductive Reasoning with Probabilistic Commonsense

Joseph Cotnareanu, Chiara Roverato, Han Zhou, Didier Chetelat, Yingxue Zhang

his paper introduces **PACS (Probabilistic Abductive CommonSense)**, a novel framework for abductive reasoning that explicitly models the variation in human commonsense beliefs. It combines an LLM and a formal solver to sample proofs representing individual perspectives, aggregating these conclusions to determine the consensus view on a statement's truth. This addresses the limitation of prior methods that assumed universal agreement on commonsense facts.

Diagram illustrating our proposed PACS algorithm. The LLM receives a question from a user which requires abductive reasoning. The LLM translates this question into premises S S and a query proposition c c whose truth value is to be determined. Ascertaining that it cannot be solved directly, the LLM then attempts to add new commonsense clauses l 1 , l 2 , l 3 , … l_{1},l_{2},l_{3},\( \dots \) , each time calling the formal logic solver to verify whether it has pinned down a value for c c , and if not, obtain the score. We stop after a time limit and take a majority vote among the conclusions as our answer. The algorithm used to search the tree of possibilities is described in Section 5.1 .
Diagram illustrating our proposed PACS algorithm. The LLM receives a question from a user which requires abductive reasoning. The LLM translates this question into premises S S and a query proposition c c whose truth value is to be determined. Ascertaining that it cannot be solve…
cs.AIarxiv:2605.08063v1Lead article

Flow-OPD: On-Policy Distillation for Flow Matching Models

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen

low-OPD introduces a novel post-training framework for Flow Matching text-to-image models to overcome multi-task alignment issues like reward sparsity and gradient interference. It employs a two-stage strategy: first training specialized teacher models via single-reward fine-tuning, and then using On-Policy Distillation (OPD) to consolidate their heterogeneous expertise into a single student model. This approach effectively unifies performance across competing metrics, mitigating the "seesaw effect" common in multi-task learning for generative models.

Performance Comparison in Multi-task Training . During training, Flow-OPD exhibits a steady increase in mean rewards across GenEval Ghosh et al. ( 2023 ) and OCR Chen et al. ( 2023 ) benchmarks, reaching a peak of 93. In contrast, vanilla GRPO converges prematurely around 78. Our approach significantly outperforms GRPO in both image synthesis and text rendering while maintaining superior generation quality and human preference alignment. The curves are smoothed for visual clarity. DeQA and PickScore are norm to 0-1. We employ model merging for cold-start in the left subgraph.
Performance Comparison in Multi-task Training . During training, Flow-OPD exhibits a steady increase in mean rewards across GenEval Ghosh et al. ( 2023 ) and OCR Chen et al. ( 2023 ) benchmarks, reaching a peak of 93. In contrast, vanilla GRPO converges prematurely around 78. Our…
cs.AIarxiv:2605.07865v1Lead article

KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo

his paper introduces **vOPD (On-Policy Distillation with a control variate baseline)** to stabilize On-Policy Distillation (OPD) for LLMs by framing it as policy-gradient Reinforcement Learning. The core contribution is deriving a **closed-form control variate baseline** directly from the per-token negative reverse KL divergence, which is available from the existing forward pass without extra computation or vocabulary-wide overhead. This method effectively reduces gradient variance for more stable and efficient distillation.

Token-level reward and advantage distributions. Left: The marginal distributions. Right: Per-token scatter plot (x: advantage, y: reward).
Token-level reward and advantage distributions. Left: The marginal distributions. Right: Per-token scatter plot (x: advantage, y: reward).
cs.AIarxiv:2605.08013v1Lead article

Learning CLI Agents with Structured Action Credit under Selective Observation

Haoyang Su, Ying Wen

his paper introduces a novel method for training Command Line Interface (CLI) agents by leveraging the inherent structure of CLI actions for better credit assignment. The core contribution involves two mechanisms: $\sigma$-Reveal, which selectively extracts task-relevant context from partial observations, and Action Advantage Assignment, which uses structured action attributes to provide denser learning signals for long, multi-turn trajectories. This approach aims to overcome the challenges of sparse rewards and limited observation in complex CLI environments.

Overview of the verifiable CLI task workflow. (a) ShellOps task instance with a natural language query, an initial workspace file tree, a verifiable gold bash solution, and the expected post execution workspace or standard output. (b) ShellOps and ShellOps-Pro coverage across file extensions and four task axes (Lookup, Aggregate, Edit, Mixed). (c) Unified verifiable loop with workspace observation, shell action generation, sandbox execution, and schema based scoring.
Overview of the verifiable CLI task workflow. (a) ShellOps task instance with a natural language query, an initial workspace file tree, a verifiable gold bash solution, and the expected post execution workspace or standard output. (b) ShellOps and ShellOps-Pro coverage across fil…
cs.AIarxiv:2605.08012v1Lead article

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Zezheng Lin, Fengming Liu

his paper argues that mechanistic interpretability research, which frequently employs causal language, often fails to explicitly state the necessary identification assumptions underpinning its causal claims. The authors audit existing literature, finding a pervasive pattern where validation metrics are presented as causal evidence without disclosing the underlying assumptions required for them to be identifying. The core contribution is proposing a mandatory disclosure norm requiring researchers to explicitly name their identification strategy, enumerate assumptions, and explain the implications if those assumptions are violated.

Validation metrics and identification assumptions are not interchangeable. Validation metrics report what the data show under assumed conditions; identification assumptions are the conditions under which a causal claim follows. Substitution—reporting the metric in place of the assumption—leaves the causal claim unidentified. Audit ( n = 10 n=10 + n = 30 n=30 sensitivity): 0 / 30 0/30 papers contain a dedicated identification-assumptions section under any rule or coder.
Validation metrics and identification assumptions are not interchangeable. Validation metrics report what the data show under assumed conditions; identification assumptions are the conditions under which a causal claim follows. Substitution—reporting the metric in place of the as…
cs.AIarxiv:2605.07935v1Lead article

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

Shuren Xia, Qiwei Li, Taqiya Ehsan, Jorge Ortiz

raceFix is a verification-first pipeline that uses the TLA+ model checker to iteratively repair LLM-generated coordination protocols for multi-agent systems. The method synthesizes a protocol topology, generates PlusCal logic, and uses TLA+ counterexamples to drive repairs until formal verification succeeds. This ensures robust coordination, leading to high task completion rates (89.4% average) compared to unverified execution.

Figure 1. TraceFix pipeline overview. At design time (Stages 1–4), an orchestration agent synthesizes a protocol topology IR, generates PlusCal coordination logic, and iteratively repairs the protocol using TLC counterexamples until verification succeeds. At runtime (Stages 5–6), verified process bodies are compiled into per-agent prompts and executed under a topology monitor that rejects out-of-protocol coordination operations.
Figure 1. TraceFix pipeline overview. At design time (Stages 1–4), an orchestration agent synthesizes a protocol topology IR, generates PlusCal coordination logic, and iteratively repairs the protocol using TLC counterexamples until verification succeeds. At runtime (Stages 5–6),…
cs.LGarxiv:2605.07863v1Lead article

ADKO: Agentic Decentralized Knowledge Optimization

Lucas Nerone Rillo, Zhanhong Jiang, Nastaran Saadati, Aditya Balu, Baskar Ganapathysubramanian

DKO is a framework for sample-efficient, privacy-preserving collaborative black-box optimization among autonomous agents. Agents use private Gaussian Processes and communicate only via compact "knowledge tokens" summarizing directional signals and advantage scores, avoiding raw data sharing. The paper's core contribution is the formal analysis showing how cumulative regret decomposes across GP error, token compression loss, and language model approximation errors.

Illustrative example of decentralized knowledge transfer in ADKO for heterogeneous chemical optimization. Agents operating under different solvent constraints exchange only privacy-aware knowledge tokens rather than raw experimental data. The example shows how a high-yield reaction discovered by one agent is semantically transferred and refined by neighboring agents through LM-guided reasoning and token-based communication, enabling strategic collaboration that outperforms blind exploration while preserving data privacy.
Illustrative example of decentralized knowledge transfer in ADKO for heterogeneous chemical optimization. Agents operating under different solvent constraints exchange only privacy-aware knowledge tokens rather than raw experimental data. The example shows how a high-yield reacti…
cs.AIarxiv:2605.10876v1Lead article

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg

ssayBench is introduced as the first standard benchmark for evaluating Large Language Models (LLMs) and agents on **assay-level virtual cell prediction**. It leverages 1,920 publicly available CRISPR screens to test a model's ability to predict diverse cellular phenotypic outcomes from heterogeneous textual inputs. This benchmark directly addresses the lack of standardized evaluation for in silico phenotypic screening, a key goal in accelerating biological discovery.

Overview of the AssayBench benchmark creation. ( A ) Starting from 1971 human CRISPR screens, we perform data quality filtering, replicate merging, and data augmentation to obtain 1920 high quality screens. ( B ) Phenotype composition of the database and its four splits. A realistic but challenging temporal split was used. ( C ) Given a description of the screen and a gene ranking criteria, a model must provide a ranked list of 100 genes.
Overview of the AssayBench benchmark creation. ( A ) Starting from 1971 human CRISPR screens, we perform data quality filtering, replicate merging, and data augmentation to obtain 1920 high quality screens. ( B ) Phenotype composition of the database and its four splits. A realis…
cs.AIarxiv:2605.10765v1Lead article

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

his paper introduces DRAPE (Dynamic Cross-Modal Prompt Generation), a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE moves beyond fixed, task-level prompts by dynamically synthesizing continuous, instance-specific soft prompts tailored to each individual query-image pair. This approach enables finer-grained adaptation during continual learning, aiming to mitigate catastrophic forgetting while improving performance on new tasks.

cs.AIarxiv:2605.10938v1Lead article

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li

LF introduces a class of continuous diffusion models for language generation, operating primarily in the continuous embedding space until the final tokenization step. This approach, based on continuous-time Flow Matching, allows for straightforward adaptation of successful image-domain diffusion techniques like classifier-free guidance. The core contribution is demonstrating that continuous DLMs can be highly effective with minimal adaptation to the discrete language domain.

ELF achieves lower generative perplexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10 × 10\( \times \) fewer training tokens. (Model size: 105M for ELF and 170M for others; dataset: OWT. Detailed comparison in Fig. 7 .)
ELF achieves lower generative perplexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10 × 10\( \times \) fewer training tokens. (Model size: 105M for ELF and 170M for others; dataset: OWT. Detailed comparison in Fig. 7 .)
cs.AIarxiv:2605.13548v1Lead article

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie

his paper introduces **AttenA+**, a framework designed to address the "action inequality" in robotic foundation models where all actions are treated equally during training. AttenA+ rectifies this by implementing a **velocity-driven action attention mechanism** that dynamically reweights the training objective, prioritizing kinematically critical, low-velocity segments over high-velocity transitions. This contribution improves model performance in complex, long-horizon robotic tasks by aligning the optimization process with the physical criticality of robot movements.

Overview of AttenA+ . AttenA+ is a paradigm-agnostic enhancement framework for action robotic foundation models, introducing velocity-field-based action attention to prioritize slow, critical manipulation steps. It seamlessly plugs into mainstream discriminative (e.g., OpenVLA-OFT) and generative ( \( \pi_{0} \) , π 0.5 \( \pi_{0.5} \) , Diffusion Policy) architectures, as well as emerging World-Action Models (WAM). Without modifying core backbones or relying on data/model scaling, AttenA+ generalizes across diverse robotic datasets including Libero Liu et al. ( 2023 ) and RoboTwin Chen et al. ( 2025 ) , and consistently improves task success rates over state-of-the-art baselines.
Overview of AttenA+ . AttenA+ is a paradigm-agnostic enhancement framework for action robotic foundation models, introducing velocity-field-based action attention to prioritize slow, critical manipulation steps. It seamlessly plugs into mainstream discriminative (e.g., OpenVLA-OF…
cs.AIarxiv:2605.13709v1Lead article

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie J. Dorr

his paper introduces a method for generating controllable and age-appropriate children's English reading stories by **supervised fine-tuning compact (8B-parameter) LLMs** using expert-designed curriculum data. The core contribution is demonstrating that **fine-tuning prioritizes controllability and affordability over raw scale**, resulting in smaller models that outperform larger, zero-shot models on difficulty-related metrics for educational story generation.

System architecture and experimental workflow for generating children’s English reading stories via supervised fine-tuning of compact LLMs.
System architecture and experimental workflow for generating children’s English reading stories via supervised fine-tuning of compact LLMs.
cs.AIarxiv:2605.13540v1Lead article

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Jianxin Li

his paper introduces **DyGFM**, a novel Dynamic Graph Foundation Model designed for multi-domain generalization. The core method employs a **decoupled and divergence-conditioned prompting** strategy: a dual-branch pre-training disentangles transferable semantics from domain-specific temporal dynamics, and a divergence-aware routing mechanism mitigates negative knowledge transfer during adaptation. This work presents the first multi-domain dynamic GFM capable of handling inherently inconsistent domain patterns.

Challenges of constructing a dynamic GFM.
Challenges of constructing a dynamic GFM.
cs.AIarxiv:2605.13841v1Lead article

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols

VA-Bench is a novel end-to-end framework designed to evaluate voice agents by addressing two key challenges: generating realistic, multi-turn audio conversations and comprehensively measuring quality. It achieves realistic simulation through bot-to-bot orchestration with automatic error detection and regeneration. The framework introduces two composite metrics, EVA-A (Accuracy) and EVA-X (Experience), to capture task success, fidelity, and conversational flow across various agent architectures.

EVA-Bench framework overview. The simulation orchestrates parallel per-scenario bot-to-bot audio sessions over WebSocket in which the User Simulator — configured with a scenario-specific goal, persona, and conversational TTS voice — interacts with the Voice Agent under test. The Tool Executor handles all agent tool calls deterministically. Completed conversations pass through Simulator Validation that trigger automatic regeneration on failure before entering the Quality Measurements phase, which produces EVA-A and EVA-X pass@1, pass@k, and pass^k scores in addition to Diagnostic metrics.
EVA-Bench framework overview. The simulation orchestrates parallel per-scenario bot-to-bot audio sessions over WebSocket in which the User Simulator — configured with a scenario-specific goal, persona, and conversational TTS voice — interacts with the Voice Agent under test. The …
cs.AIarxiv:2605.13821v1Lead article

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng

his paper introduces **AEvo**, a harnessed meta-editing framework for agentic evolution. It models the evolution process as an interactive environment where the accumulated context acts as the state. The core contribution is using a **meta-agent to observe this state and edit the underlying evolution procedure** itself, offering a stable interface to guide and revise the search mechanism over long horizons, rather than just proposing the next candidate.

Harnessing agentic evolution as an interactive environment. (a) Procedure-based evolution runs a fixed loop for selection, optimization, evaluation, and update. (b) Agent-based evolution lets a general-purpose agent manage search through feedback, tools, skills, and code actions. (c) AEvo treats the evolution process as an interactive environment. The accumulated evolution context becomes process-level state, while a meta-agent edits the underlying procedure or agent operating context that controls future evolution.
Harnessing agentic evolution as an interactive environment. (a) Procedure-based evolution runs a fixed loop for selection, optimization, evaluation, and update. (b) Agent-based evolution lets a general-purpose agent manage search through feedback, tools, skills, and code actions.…
cs.AIarxiv:2605.13542v1Lead article

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen

his paper introduces **RealICU**, a novel benchmark designed to evaluate LLMs on long-context ICU data by moving beyond imitating potentially suboptimal past clinician actions. Its core contribution is using **hindsight annotations** created by senior physicians reviewing the *full* patient trajectory to establish more accurate ground truth labels for four physician-motivated tasks. This allows for a more realistic assessment of an LLM's true reasoning capabilities in complex, time-sensitive clinical settings.

ICU decisions are made under massive data volume and time pressure. An ICU AI co-pilot integrates data streams into a decision-support panel that assesses Patient Status , identifies Acute Problems , proposes Recommended Actions , and warns against unsafe Red Flag actions.
ICU decisions are made under massive data volume and time pressure. An ICU AI co-pilot integrates data streams into a decision-support panel that assesses Patient Status , identifies Acute Problems , proposes Recommended Actions , and warns against unsafe Red Flag actions.
cs.AIarxiv:2605.13725v1Lead article

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Yitian Yang, Yiqun Duan, Linghan Huang, Yiqi Zhu, Francesco Bailo

cioMind introduces a cognitively grounded framework for LLM-based multi-agent social simulation, bridging fixed rules and unconstrained LLM interaction. Its core method integrates a belief update rule modulated by personality-conditioned anchoring strength, a hierarchical memory for experience-driven belief formation, and dynamic, corpus-grounded agent profiles. This allows for more realistic and heterogeneous social opinion dynamics studies grounded in both structured mechanisms and LLM reasoning.

Architecture overview.
Architecture overview.
cs.AIarxiv:2605.13846v1Lead article

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

ARDEN is a system designed to transcribe and translate the endangered Wardaman language into English using only 6 hours of training data. It addresses the low-resource challenge by employing a two-stage pipeline: a dedicated model for audio-to-phonemic transcription, followed by a separate model for transcription-to-English translation. The system's performance is enhanced by initializing the transcription model using phoneme similarities from Sundanese.

Overview of the WARDEN system. For transcription, we select the language most similar to Wardaman for token initialization and fine-tune an existing ASR model. For translation, given transcription results, a lexicon matcher first retrieves relevant Wardaman-English dictionary entries. Then, both the transcript and matched lexicons are fed into an LLM for translation.
Overview of the WARDEN system. For transcription, we select the language most similar to Wardaman for token initialization and fine-tune an existing ASR model. For translation, given transcription results, a lexicon matcher first retrieves relevant Wardaman-English dictionary ent…
cs.LGarxiv:2605.13740v1Lead article

Learning POMDP World Models from Observations with Language-Model Priors

Valentin Six, Frederik Panse, Mathis Fajeau, Lancelot Da Costa, Mridul Sharma

his paper introduces **Pinductor**, a method that leverages **Large Language Model (LLM) priors** to learn **Partially-Observable Markov Decision Process (POMDP) world models** from limited observation-action trajectories. Pinductor uses the LLM to propose and iteratively refine candidate POMDP models based on a belief-based likelihood score. This approach achieves performance comparable to methods assuming privileged state access while significantly improving sample efficiency over traditional model learning.

Pinductor architecture overview. Given a small set of offline observation-action trajectories and an environment description, an LLM proposes a POMDP world model in code (dashed arrows). The resulting model is used for filtering and planning during environment interaction, and is periodically refined by the LLM to optimize a belief-based likelihood objective (solid arrows).
Pinductor architecture overview. Given a small set of offline observation-action trajectories and an environment description, an LLM proposes a POMDP world model in code (dashed arrows). The resulting model is used for filtering and planning during environment interaction, and is…
cs.LGarxiv:2605.13711v1Lead article

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

Hsing-Huan Chung, Shijun Li, Yoav Wald, Xing Han, Suchi Saria

ILM addresses multimodal irregular time series (MITS) by converting them into time-ordered XML triplets to leverage Large Language Models (LLMs). The core method involves a two-stage fine-tuning strategy: first, training the LLM solely on sampling patterns (with redacted values) to learn temporal structure, and second, training on the full MITS to jointly model patterns and observed values. This approach enables LLMs to effectively capture predictive signals embedded in both the irregular timing and heterogeneous content of MITS data.

cs.LGarxiv:2605.13681v1Lead article

Sampling from Flow Language Models via Marginal-Conditioned Bridges

Iskander Azangulov, Leo Zhang

his paper introduces a novel sampling method for Flow Language Models (FLMs) that leverages their unique structure where each denoising block yields a posterior marginal distribution over the clean token. Instead of collapsing to a single conditional mean, the proposed "marginal-conditioned bridge" sampler works by iteratively sampling a one-hot token from the factorized posterior marginals at each reverse step, and then bridging the continuous state to this sampled endpoint. This training-free approach provides a principled, token-aware decoding strategy that avoids generating invalid one-hot sequences.

Generative perplexity (left top) and entropy (left bottom) against the number of sampling steps for the standard ODE sampler and our MCB sampler with various configurations of temperature scaling \( \tau \) and nucleus sampling p p on LM1B. The right plot shows the Generative PPL/Entropy Tradeoff. We note that the grey dotted line on the bottom-left plot shows the entropy of LM1B.
Generative perplexity (left top) and entropy (left bottom) against the number of sampling steps for the standard ODE sampler and our MCB sampler with various configurations of temperature scaling \( \tau \) and nucleus sampling p p on LM1B. The right plot shows the Generative PPL…
cs.CLarxiv:2605.13793v1Lead article

An LLM-Based System for Argument Reconstruction

Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman, Douglas Aldred

his paper introduces an end-to-end LLM-based system designed to reconstruct natural language arguments into abstract argument graphs. The system employs a multi-stage pipeline to identify argumentative components (premises and conclusions) and their logical relations (support, attack, undercut). Its contribution lies in providing a comprehensive method for transforming unstructured text into structured argument graphs, evaluated both qualitatively on textbook examples and quantitatively against benchmark datasets.

Overview of the system pipeline. The model converts natural language text into an argumentative directed acyclic graph. Blue boxes denote mandatory steps, while beige boxes denote optional steps.
Overview of the system pipeline. The model converts natural language text into an argumentative directed acyclic graph. Blue boxes denote mandatory steps, while beige boxes denote optional steps.
cs.AIarxiv:2605.16054v1Lead article

Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making

Fan Feng, Selena Ge, Minghao Fu, Zijian Li, Yujia Zheng

da-Diffuser introduces a causal diffusion model framework that explicitly incorporates the inference of evolving latent dynamics into sequence generation for decision-making. The core method simultaneously learns the temporal structure of observed interactions and these hidden processes, theoretically justified to be identifiable from minimal observations. This unified approach contributes to more precise dynamics modeling and effective planning by leveraging the inferred latent factors.

(a) SCM of the Latent Contextual POMDP. Gray/white nodes are observed/latent variables; green/red edges represent transitions driven by latents/expert policies, respectively. (b) Examples where latents influence either dynamics or rewards (affecting optimal actions).
(a) SCM of the Latent Contextual POMDP. Gray/white nodes are observed/latent variables; green/red edges represent transitions driven by latents/expert policies, respectively. (b) Examples where latents influence either dynamics or rewards (affecting optimal actions).
cs.AIarxiv:2605.16052v1Lead article

Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

his paper rigorously evaluates LLMs in tax law reasoning by introducing a contamination detection protocol to assess true performance. The core contribution is demonstrating that neuro-symbolic systems, which translate text for symbolic solvers, offer significantly more reliable and robust reasoning than monolithic LLMs, especially when generalizing to unseen legal variations.

cs.AIarxiv:2605.16024v1Lead article

ScreenSearch: Uncertainty-Aware OS Exploration

Michael Solodko, Justin Wagle

creenSearch addresses the challenge of partial observability in desktop GUI agents by framing OS exploration as a search problem. The core method combines a structural screen retrieval and deduplication layer with an ambiguity-aware PUCT graph-bandit algorithm. This allows the agent to efficiently explore the state space while prioritizing actions that resolve uncertainty about the underlying system state.

Complementary exploration signals: novelty expands coverage, while ambiguity reduction resolves aliased states before commitment.
Complementary exploration signals: novelty expands coverage, while ambiguity reduction resolves aliased states before commitment.
cs.AIarxiv:2605.16165v1Lead article

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Yishun Lu, Wes Armour

his paper addresses modality competition in multimodal autoregressive models, which destabilizes training, by proposing **ML-FOP-SOAP**, a second-order optimization framework. It leverages **SOAP preconditioning** for stability and introduces **Multi-Level Variance Correction** via Fisher-Orthogonal Projection to suppress cross-modality gradient conflicts. This method achieves stable training and consistent performance gains across both visual and textual tasks, especially under large-batch settings using a hierarchical folding strategy.

2x2 train-loss comparison for pretraining Janus-400M. Left column: SHAMPOO family; right column: SOAP family. Top row: loss vs trained tokens; bottom row: loss vs wallclock time.
2x2 train-loss comparison for pretraining Janus-400M. Left column: SHAMPOO family; right column: SOAP family. Top row: loss vs trained tokens; bottom row: loss vs wallclock time.
cs.AIarxiv:2605.16116v1Lead article

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie

hopGym is an integrated framework designed to overcome the limitations of existing e-commerce agent evaluation by providing environments that are simultaneously realistic, diverse, controllable, and reproducible. Its core method involves the ShopArena simulation layer, which converts live storefronts into self-contained sandbox environments. This allows for scalable benchmarking of web agents across a wide range of realistic e-commerce scenarios.

ShopGym comprises two components. ShopArena provides a simulation environment populated with synthetic sandbox shops, along with a scalable pipeline that generates new sandbox shops from one or more live seed storefronts through specification synthesis followed by data and code generation. ShopGuru then consumes the resulting catalog, collections, pages, and shop statistics to generate both short-horizon tasks covering primitive skills and long-horizon shopping journeys that combine these skills.
ShopGym comprises two components. ShopArena provides a simulation environment populated with synthetic sandbox shops, along with a scalable pipeline that generates new sandbox shops from one or more live seed storefronts through specification synthesis followed by data and code g…
cs.AIarxiv:2605.16085v1Lead article

Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

Jingcheng Wu, Ratan Bahadur Thapa, Mojtaba Nayyeri, Lucas Etteldorf, Max Finkenbeiner

his paper proposes a hybrid deep learning architecture to better model relational databases by integrating Language Models (LMs) and Graph Neural Networks (GNNs). The method uses a fine-tuned BART encoder for intra-row semantics and a GraphSAGE GNN operating on a Relational Entity Graph (REG) to incorporate relational context. This approach significantly enhances the row embeddings, achieving competitive performance against established supervised baselines on relational benchmarks.

Overview of the hybrid architecture. A fine-tuned BART encoder generates row-level embeddings from linearized database rows, which serve as initial node features in the relational entity graph (REG). Node-type-specific linear layers project the 1024-dimensional BART embeddings to the 256-dimensional hidden space. Two shared SAGEConv layers then perform message passing across all edge types, and a linear decoder maps the enriched embeddings back to 1024 dimensions for reconstruction loss computation.
Overview of the hybrid architecture. A fine-tuned BART encoder generates row-level embeddings from linearized database rows, which serve as initial node features in the relational entity graph (REG). Node-type-specific linear layers project the 1024-dimensional BART embeddings to…
cs.AIarxiv:2605.16079v1Lead article

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao

ideoSeeker introduces a novel paradigm for instance-level video understanding by replacing text prompts with **native agentic tool invocation based on visual prompts**. This method allows Large Vision-Language Models (LVLMs) to **proactively perceive and retrieve precise spatiotemporal video segments** on demand, directly integrating visual evidence into the reasoning process. The core contribution is enabling more accurate and user-friendly instance localization by shifting interaction from purely linguistic to visually-grounded, agentic perception.

Overview of VideoSeeker. (A): Instance-level video understanding tasks require models to accurately locate and reason about specific instances in videos guided by visual prompts, given a video, a visual prompt frame, and a query. Compared to text-only prompts that require lengthy referential descriptions, visual prompts provide a more intuitive interaction method. (B): Pipeline overview. We design a four-stage pipeline to construct instance-level video data, followed by a two-stage training strategy to integrate multimodal instance-level video understanding capabilities.
Overview of VideoSeeker. (A): Instance-level video understanding tasks require models to accurately locate and reason about specific instances in videos guided by visual prompts, given a video, a visual prompt frame, and a query. Compared to text-only prompts that require lengthy…
cs.AIarxiv:2605.16035v1Lead article

Who Owns This Agent? Tracing AI Agents Back to Their Owners

Ruben Chocron, Doron Jonathan Ben Chayim, Eyal Lenga, Gilad Gressel, Alina Oprea

his paper formalizes the critical problem of **agent attribution**: reliably linking the observed actions of a deployed AI agent back to the specific user account that deployed it. The core contribution is defining this gap, which currently prevents accountability for both unintentional misuse and malicious deployment of vendor-hosted AI agents. The authors aim to establish a framework for tracing these autonomous agents to their responsible owners.

Figure 1. The novel problem of agent attribution introduced in this paper (top), and our canary-based protocol for the vendor-hosted LLM setting (bottom).
Figure 1. The novel problem of agent attribution introduced in this paper (top), and our canary-based protocol for the vendor-hosted LLM setting (bottom).
cs.CLarxiv:2605.16077v1Lead article

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Si-Belkacem Yamine Ketir, Lenard Paulo Tamayo, Shohei Hisada, Shaowen Peng, Shoko Wakamiya

his paper introduces an LLM-driven data augmentation framework to address limited data in cognitive assessment from speech. The method uses participants' written responses as semantic anchors to generate diverse, synthetic speech samples via GPT-5. The core contribution is demonstrating that similarity-guided augmentation, prioritizing semantically close synthetic data, effectively improves the prediction of cognitive scores (Hasegawa Dementia Scale) using speech embeddings.

Overview of the proposed LLM-driven data augmentation framework for cognitive score prediction from speech. Underlined terms indicate oral markers, and terms in red indicate stylistic features.
Overview of the proposed LLM-driven data augmentation framework for cognitive score prediction from speech. Underlined terms indicate oral markers, and terms in red indicate stylistic features.
cs.AIarxiv:2605.19988v1Lead article

A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

Hongyu Lin, Mingyu Li, Weichen Zhang, Yihang Lou, Mingjie Xing

his paper introduces **Agentic Tuning** via **PerfEvolve**, shifting system tuning from static documentation to dynamic action. PerfEvolve translates expert tuning methodologies into executable skills for LLM agents, enabling them to perform version verification, workload profiling, and joint optimization. This approach significantly outperforms documentation-driven tuning in PostgreSQL, achieving up to a 35.2% performance improvement.

Latency increase on TPC-H when applying PG-Official and PGTune rules (7 of 22 queries degraded by > > 10%). Both rule sets lead to worse latency on the same kinds of sort- and aggregation-intensive queries.
Latency increase on TPC-H when applying PG-Official and PGTune rules (7 of 22 queries degraded by > > 10%). Both rule sets lead to worse latency on the same kinds of sort- and aggregation-intensive queries.
cs.AIarxiv:2605.20084v1Lead article

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang

alanceRAG addresses the challenge of setting risk thresholds in cascaded RAG systems, where decisions are made sequentially by an LLM-only branch and a RAG fallback. The core method frames threshold pairs as operating points on a 2D lattice and uses sequential graphical testing to identify "safe" pairs that meet a target system-level risk. This allows for risk-adaptive calibration that retains more examples compared to conservative stage-by-stage tuning.

Distribution of the per-example score differences between RAG and LLM-only. S LLM ​ - ​ RAG S_{\( \mathrm \){LLM\( \text{-} \)RAG}} and S LLM ​ - ​ only S_{\( \mathrm \){LLM\( \text{-} \)only}} are the similarity scores between each path’s prediction and the ground-truth answer. The x-axis reports S LLM ​ - ​ RAG − S LLM ​ - ​ only S_{\( \mathrm \){LLM\( \text{-} \)RAG}}-S_{\( \mathrm \){LLM\( \text{-} \)only}} , with positive values favoring RAG and negative values favoring LLM-only, while the y-axis reports the number of examples. Colors distinguish whether both branches are correct, both are wrong, or only one branch is correct.
Distribution of the per-example score differences between RAG and LLM-only. S LLM ​ - ​ RAG S_{\( \mathrm \){LLM\( \text{-} \)RAG}} and S LLM ​ - ​ only S_{\( \mathrm \){LLM\( \text{-} \)only}} are the similarity scores between each path’s prediction and the ground-truth answer. …
cs.AIarxiv:2605.20049v1Lead article

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

Priyansh Trivedi, Olivier Schmitt

his paper investigates whether code cleanliness affects the performance of coding agents by introducing a controlled evaluation protocol using minimal pairs. These pairs are identical in functionality but differ only in code quality (style and complexity). The study found that while code cleanliness did not significantly alter the agent's final pass rate, it substantially impacted the agent's operational footprint, suggesting quality affects the *process* rather than just the *outcome*.

An example task in the benchmark, drawn from the genie pair. The agent reads an externally observable description (shown) and produces a code change that a hidden test suite, kept internal, exercises against the application’s public surface. This task asks the agent to add a structured failure-stage tag to Genie’s synchronous job-launch timer so that dashboards can attribute job-launch failures to a specific pipeline stage.
An example task in the benchmark, drawn from the genie pair. The agent reads an externally observable description (shown) and produces a code change that a hidden test suite, kept internal, exercises against the application’s public surface. This task asks the agent to add a stru…
cs.AIarxiv:2605.20104v1Lead article

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang

his paper introduces **Graft**, a hybrid tree construction method for speculative decoding that overcomes the trade-off between dense, high-overhead trees and pruned, lower-coverage trees. Graft couples **pruning** (to save budget) with **retrieval** (to recover lost coverage) as mutually reinforcing operations. This allows the system to achieve high acceptance rates comparable to dense trees while maintaining the low computational overhead of pruned trees, leading to better end-to-end speedups.

Speed-accepted-length tradeoff on Qwen3-32B HumanEval. Each point reports wall-time speedup and mean accepted length. Dense EAGLE3 gives the accepted-length upper point for pruning-only subtrees. Dynamic pruning methods such as DDD, SVIP, and ECHO move rightward by reducing draft cost, but their accepted length falls below the dense-tree bound. Graft uses retrieval to fill the slots released by pruning, introducing candidates beyond the original subtree and breaking this pruning trade-off under the same verification budget.
Speed-accepted-length tradeoff on Qwen3-32B HumanEval. Each point reports wall-time speedup and mean accepted length. Dense EAGLE3 gives the accepted-length upper point for pruning-only subtrees. Dynamic pruning methods such as DDD, SVIP, and ECHO move rightward by reducing draft…
cs.AIarxiv:2605.20149v1Lead article

Less Back-and-Forth: A Comparative Study of Structured Prompting

Saurav Ghosh, Gabriella Polach, Abdou Sow

his paper comparatively studies how structured prompting affects Large Language Model (LLM) output quality and user effort across different tasks and models. The core finding is that **checklist-improved prompts significantly outperform raw and clarifying-question prompts**, achieving the highest quality scores while using fewer interaction tokens. This suggests a simple checklist is an effective method for enhancing LLM performance and efficiency.

Study design overview. For each task, a raw prompt is evaluated alongside a checklist-improved prompt and a clarifying-question prompt. Each prompt condition is tested across multiple LLMs, and the resulting outputs are scored using the same rubric.
Study design overview. For each task, a raw prompt is evaluated alongside a checklist-improved prompt and a clarifying-question prompt. Each prompt condition is tested across multiple LLMs, and the resulting outputs are scored using the same rubric.
cs.AIarxiv:2605.19943v1Lead article

Probabilistic Tiny Recursive Model

Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau

he paper introduces Probabilistic Tiny Recursive Models (PTRM) to overcome the deterministic convergence issue in standard Tiny Recursive Models (TRMs). PTRM achieves this by injecting Gaussian noise during each recursive step, enabling parallel exploration of diverse solution paths. This task-agnostic method significantly boosts accuracy across complex reasoning benchmarks without requiring model retraining.

cs.AIarxiv:2605.19940v1Lead article

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

his paper reframes safety guardrails for foundation models in sensitive domains as a problem of **runtime behavioral control over interaction trajectories**, inspired by robotics. The core method introduces the **Grounded Observer framework** to enforce formal constraints during closed-loop interactions, moving beyond empirical risk reduction for individual outputs. This approach provides enforceable behavioral guarantees across real-world deployments like therapy and de-escalation.

Figure 1. Guardrails as Constraint Enforcement Over Interaction Trajectories. A deployed foundation model induces a trajectory τ = ( s 0 , a 0 , s 1 , a 1 , … ) \( \tau \)=(s_{0},a_{0},s_{1},a_{1},...) through state space 𝒮 \( \mathcal{S} \) . A safe set 𝒮 safe ⊆ 𝒮 \( \mathcal{S} \)_{\( \text{safe} \)}\( \subseteq \)\( \mathcal{S} \) defines acceptable behavioral states. At each timestep, the model proposes actions according to policy π θ ​ ( a t ∣ s t ) \( \pi_{\theta} \)(a_{t}\( \mid \) s_{t}) , but a “guardrail” restricts execution to the admissible action set 𝒜 safe ​ ( s t ) \( \mathcal{A} \)_{\( \text{safe} \)}(s_{t}) , ensuring that transitions s t + 1 s_{t+1} remain within 𝒮 safe \( \mathcal{S} \)_{\( \text{safe} \)} . This enforces forward invariance, preventing trajectories from entering unsafe regions rather than merely detecting violations after they occur.
Figure 1. Guardrails as Constraint Enforcement Over Interaction Trajectories. A deployed foundation model induces a trajectory τ = ( s 0 , a 0 , s 1 , a 1 , … ) \( \tau \)=(s_{0},a_{0},s_{1},a_{1},...) through state space 𝒮 \( \mathcal{S} \) . A safe set 𝒮 safe ⊆ 𝒮 \( \mathcal…
cs.AIarxiv:2605.20086v1Lead article

What Do Evolutionary Coding Agents Evolve?

Nico Pelleriti, Sree Harsha Nelaturu, Zhanke Zhou, Zongze Li, Max Zimmer

his paper investigates what evolutionary coding agents, driven by LLMs, actually evolve beyond just achieving a high final score. The core method involves introducing **EvoTrace**, a dataset of evolutionary coding traces, and **EvoReplay**, a replay-based methodology to analyze these traces. This allows the authors to distinguish between evolving new algorithmic structure, re-tuning strategies, recombining existing knowledge, or overfitting, rather than just observing the final outcome.

A taxonomy of edits performed by evolutionary coding agents. Each panel shows a representative parent–child diff (added lines in green, deleted lines in red) drawn from EvoTrace runs and labeled with one of nine recurring categories: Bug fix , External dependency , Architectural change , Composition , Local refinement , Pruning , Refactor , Efficiency , and Hyperparameter tuning . The categories range from minimal numeric edits (a single literal change) to structural rewrites (replacing a 14-gon with two concentric heptagons), and they form the basis of the LLM-as-judge edit annotation used throughout the paper. Edits are typically multi-label; we examine prevalence and per-edit utility in § 5.1 .
A taxonomy of edits performed by evolutionary coding agents. Each panel shows a representative parent–child diff (added lines in green, deleted lines in red) drawn from EvoTrace runs and labeled with one of nine recurring categories: Bug fix , External dependency , Architectural …
cs.AIarxiv:2605.21470v1Lead article

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis

his paper introduces **Agent Just-In-Time (JIT) Compilation** to overcome the high latency of sequential LLM-based web agents. The core method compiles natural language task descriptions directly into executable code, allowing for LLM calls, tool calls, and parallelization. This significantly improves performance by replacing the slow fetch-execute loop with optimized, compiled execution plans.

Competing Approaches to Computer-Use Agents. Automation of web-based tasks has relied on static scripts (RPA; Barman et al. , 2016 ) and static tool sets (CUA; Wang et al. , 2025 ). Our work introduces dynamic cost-optimizing planning and scheduling with cached, reusable tools.
Competing Approaches to Computer-Use Agents. Automation of web-based tasks has relied on static scripts (RPA; Barman et al. , 2016 ) and static tool sets (CUA; Wang et al. , 2025 ). Our work introduces dynamic cost-optimizing planning and scheduling with cached, reusable tools.
cs.AIarxiv:2605.21453v1Lead article

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

his paper empirically investigates the quality and security impact of AI-generated Python refactoring pull requests using the AIDev dataset. The authors quantify changes across five quality attributes using the ML-based tool PyQu, supplemented by static analysis tools (Pylint and Bandit) for quality and security assessment. The core finding is that while agentic commits improve quality attributes in about 22.5% of cases (most often usability), they also introduce security risks in a significant portion of changes.

Figure 1. Enhancement Rates by Agent and Quality Attribute.
Figure 1. Enhancement Rates by Agent and Quality Attribute.
cs.AIarxiv:2605.21486v1Lead article

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Dayal Singh Kalra, Maissam Barkeshli

his paper develops a framework with three metrics to quantify the quality of hyperparameter transfer, crucial for scaling LLMs. The authors investigate why the Maximal Update parameterization ($\mu$P) offers superior learning rate transfer compared to standard parameterization (SP) when using AdamW. They find that $\mu$P's benefit primarily stems from maximizing the learning rate of the embedding layer.

Computing the three transfer metrics for \( \mu \) P . (a) Loss vs. log learning rate \( \nu \) , with star marking the optimum ν ∗ ​ ( n ) \( \nu^{*} \)(n) , (b) Joint fit of the loss model ( Equation ˜ 6 , dashed lines), with a low predictability error ℰ = 0.0034 \( \mathcal{E} \)=0.0034 , (c) Loss curves in the normalized coordinates ( Equation ˜ 8 ), with κ = − 2.640 \( \kappa \)=-2.640 indicating robust transfer. (d-f) Scaling laws for optimal loss L ∗ ​ ( n ) L^{*}(n) , optimal log-learning-rate ν ∗ ​ ( n ) \( \nu^{*} \)(n) , and curvature H ​ ( n ) H(n) . In (d), the orange curve shows the best loss across parameterizations at each width, used for estimating the asymptotic loss gap ℛ ​ ( ∞ ) \( \mathcal{R} \)(\( \infty \)) .
Computing the three transfer metrics for \( \mu \) P . (a) Loss vs. log learning rate \( \nu \) , with star marking the optimum ν ∗ ​ ( n ) \( \nu^{*} \)(n) , (b) Joint fit of the loss model ( Equation ˜ 6 , dashed lines), with a low predictability error ℰ = 0.0034 \( \mathcal{E}…
cs.AIarxiv:2605.21295v1Lead article

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen

imeSRL is a two-stage LLM framework that improves time-series generalization by routing predictions through a semantic bottleneck, abstracting raw signals into natural language concepts before predicting outcomes. This approach forces reasoning over generalizable semantic concepts rather than cohort-specific raw data. The framework is optimized end-to-end using Reinforcement Learning (GRPO with RLVR) to learn outcome-aligned abstractions without requiring intermediate annotations, achieving state-of-the-art performance in cross-cohort mental health prediction.

Figure 1 . Overview of TimeSRL, a two-stage LLM framework for robust longitudinal behavioral time-series modeling, instantiated on behavioral health prediction. While traditional ML models overfit numerical regularities and direct-prediction LLMs struggle with long numeric trajectories, TimeSRL addresses these distribution shift challenges by routing inference through an explicit semantic bottleneck . In Stage 1, it abstracts raw numerical signals into natural-language behavioral descriptions; in Stage 2, it infers outcomes from this abstraction alone, enabling robust generalization across new populations. This paper focus on mental health prediction as a case study.
Figure 1 . Overview of TimeSRL, a two-stage LLM framework for robust longitudinal behavioral time-series modeling, instantiated on behavioral health prediction. While traditional ML models overfit numerical regularities and direct-prediction LLMs struggle with long numeric trajec…
cs.AIarxiv:2605.22645v1Lead article

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin

telierEval is introduced as the first unified benchmark to quantify the prompting proficiency of both humans and MLLMs in generating text-to-image prompts across 360 expert-crafted tasks. The core method involves using AtelierJudge, a skill-based, memory-augmented agentic evaluator, to produce reliable subjective and objective scores for prompt-image pairs. This contribution enables the systematic evaluation of the crucial upstream prompting component, which was previously unmeasured in T2I benchmarks.

MLLMs act as prompters in diverse T2I workflows, translating user intent into effective prompts.
MLLMs act as prompters in diverse T2I workflows, translating user intent into effective prompts.
cs.AIarxiv:2605.22732v1Lead article

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

Juergen Dietrich

his paper compares acoustic emotion models and LLMs for analyzing the Pathos dimension in political speech, using the TRUST LLM pipeline as a benchmark. The core finding is that the Gemini LLM, analyzing both audio and transcript, correlates strongly with the benchmark Pathos scores, while a standard acoustic SER model does not. This suggests LLMs are more effective proxies for complex emotional dimensions like Pathos than purely acoustic features alone.

cs.AIarxiv:2605.22579v1Lead article

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

his paper investigates "Hyperfitting," a phenomenon where extreme fine-tuning enhances LLM generation quality beyond simple distribution sharpening. The authors demonstrate that hyperfitting is fundamentally distinct from temperature scaling, as entropy-matched controls fail to replicate its diversity gains. Their core contribution is identifying that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism localized to a "Terminal Expansion" in the final transformer block.

The Rank Reordering Mechanism Enabling Late-Stage Efficiency. (A) Temperature scaling (T < < 1.0) sharpens the probability distribution but preserves the original ranking, leaving the repetitive token (Token A) as the winner. (B) Hyperfitting fundamentally alters the output distribution by reordering ranks — suppressing repetitive candidates and promoting diverse, context-dependent candidates (Token B) to the Top-1 position. This bidirectional effect distinguishes hyperfitting from simple temperature scaling and reveals that the generative capability is localized, motivating our parameter-efficient Late-Stage LoRA strategy.
The Rank Reordering Mechanism Enabling Late-Stage Efficiency. (A) Temperature scaling (T < < 1.0) sharpens the probability distribution but preserves the original ranking, leaving the repetitive token (Token A) as the winner. (B) Hyperfitting fundamentally alters the output distr…
cs.AIarxiv:2605.22786v1Lead article

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

CGuard is a framework designed to ensure safe latent communication via shared Key-Value (KV) caches in multi-agent LLM systems. It addresses the risk of sensitive information leakage by learning representation-level transformations on the KV caches before they are transmitted between agents. This acts as a "guard" to control the flow of potentially sensitive intermediate reasoning states encoded in the latent space.

Multi-agent communication topologies: sequential, hierarchical, and graph-based. Edges carry KV cache latent artifacts m i ​ j m_{ij} .
Multi-agent communication topologies: sequential, hierarchical, and graph-based. Edges carry KV cache latent artifacts m i ​ j m_{ij} .
cs.AIarxiv:2605.23772v1Lead article

Agentic Proving for Program Verification

Alessandro Sosso, Akhil Arora, Bas Spitters

his paper investigates the capability of agentic AI systems, specifically Claude Code, for program verification using the CLEVER benchmark in Lean 4. The core method involves evaluating the agent's performance across specification generation, implementation certification against ground truth, and end-to-end verification. The key contribution is demonstrating a high success rate (up to 98.1%) in this pipeline, alongside the agent's ability to provide high-quality self-correction feedback.

Schematic overview of our experimental pipeline. The four generation and proof tasks are identified in yellow: Clever ’s original intended setup spans horizontally in red across the top row, while other pathways represent our custom variations of the setup. The dashed portion in the top-right corner is the preliminary experimentation setup of Appendix A .
Schematic overview of our experimental pipeline. The four generation and proof tasks are identified in yellow: Clever ’s original intended setup spans horizontally in red across the top row, while other pathways represent our custom variations of the setup. The dashed portion in …
cs.AIarxiv:2605.23590v1Lead article

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang

o-ReAct introduces a framework where external rubrics act as step-level collaborators to guide ReAct agents during inference, moving beyond their typical role as post-hoc evaluators. By injecting the rubric into the agent's context at each decision point, Co-ReAct provides explicit, actionable targets for evidence seeking, reasoning, and action selection. This method aims to produce more targeted and less redundant reasoning trajectories in complex, search-intensive tasks.

Overview of Co-ReAct. (i) Collect: sample candidate next actions at each branching point and rank them with multi-judge expert consensus. (ii) Train: GRPO with a Spearman reward between the rubric-induced ranking and the expert ranking. (iii) Infer: the trained rubric drives a five-tuple (Rubric, Reason, Act, Verify, Observe) loop.
Overview of Co-ReAct. (i) Collect: sample candidate next actions at each branching point and rank them with multi-judge expert consensus. (ii) Train: GRPO with a Spearman reward between the rubric-induced ranking and the expert ranking. (iii) Infer: the trained rubric drives a fi…
cs.AIarxiv:2605.23655v1Lead article

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen

VSearch is a training-free framework that addresses the high-resolution image perception bottleneck in MLLMs by adaptively scheduling search strategies. It employs an "Assess-then-Search" workflow, prioritizing efficient expert-assisted search and only resorting to a novel semantic-aware scanning mechanism upon failure. This scanning uses Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, improving perception accuracy while maintaining efficiency.

cs.AIarxiv:2605.23897v1Lead article

ETCHR: Editing To Clarify and Harness Reasoning

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang

TCHR addresses the limitations of purely textual reasoning in multimodal LLMs by introducing a novel approach that couples a dedicated image editing model with an understanding model. The core method involves conditioning the image editor on the reasoning question to overcome the editor's inability to map abstract queries to visual transformations and to maintain edit correctness over deep reasoning steps. This decoupling allows for targeted visual manipulation to clarify and support complex visual reasoning tasks.

ETCHR vs. prior “think with images” paradigms. (a) Tool-based methods emit action tokens to a renderer, limiting edits to low-level operations and requiring VLM fine-tuning. (b) Unified models share one backbone for text and images, weakening both and producing noisy intermediates. (c) ETCHR decouples a question-conditioned editor from the understanding MLLM and adds a verify-and-reason step, enabling plug-and-play use across tasks. (d) Across nine benchmarks, ETCHR (with Qwen3-VL-8B and Kimi K2.5 1T) surpasses tool-based and unified-model baselines.
ETCHR vs. prior “think with images” paradigms. (a) Tool-based methods emit action tokens to a renderer, limiting edits to low-level operations and requiring VLM fine-tuning. (b) Unified models share one backbone for text and images, weakening both and producing noisy intermediate…
cs.AIarxiv:2605.23551v1Lead article

Goal-Conditioned Agents that Learn Everything All at Once

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher

he paper introduces Learning Everything All at Once (LEO), a method for goal-conditioned reinforcement learning that efficiently performs off-policy updates using every observed transition for *all* possible goals simultaneously. LEO achieves this by jointly outputting values and actions for every goal in a single forward pass, enabling massive parallelization and significant speed-ups over naive all-goals relabelling. This approach maximizes data efficiency and achieves strong performance across various control tasks.

cs.AIarxiv:2605.23572v1Lead article

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani

ARNESS-LM (HLM) is a three-phase training recipe designed to efficiently transfer the high retrieval quality of large SLM-based models into compact, production-ready student encoders. The method first trains a large teacher model, then distills its knowledge into a small student encoder using an L2 alignment objective, followed by a final contrastive refinement stage. This approach successfully bridges the gap between state-of-the-art retrieval performance and the low-latency requirements of sponsored search systems.

Figure 1 . HLM: A three-phase training framework for developing effective and compact SLM retrievers.
Figure 1 . HLM: A three-phase training framework for developing effective and compact SLM retrievers.
cs.AIarxiv:2605.23867v1Lead article

Human Decision-Making with Persuasive and Narrative LLM Explanations

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu

his paper investigates how the persuasiveness of Large Language Model (LLM) narrative explanations affects human decision-making accuracy in classification tasks. The core finding is that the persuasiveness level of these explanations did not significantly improve decision accuracy compared to a simple AI prediction alone. However, the narratives were found to increase reliance on the AI's output.

Left: Average participant accuracy across the two dataset conditions and four explanation conditions. Center: Average participant reliance rate across the two dataset conditions and four explanation conditions. Right: Predicted effects (level 2/overall results of multilevel model) of confidence ratings and dataset upon accuracy. Steeper, positively-sloped lines indicate better confidence calibration. Across all plots, error bars and shaded areas represent 95% confidence intervals.
Left: Average participant accuracy across the two dataset conditions and four explanation conditions. Center: Average participant reliance rate across the two dataset conditions and four explanation conditions. Right: Predicted effects (level 2/overall results of multilevel model…
cs.AIarxiv:2605.23861v1Lead article

Leveraging Foundation Models for Causal Generative Modeling

Aneesh Komanduri, Xintao Wu

his paper introduces **FM-CGM**, a modular framework that leverages pretrained foundation models for visual causal reasoning without requiring explicit causal constraint training. It formalizes the causal pipeline using a concept extractor, manipulator, and counterfactual generator, employing a large reasoning model for inference and a diffusion model for generation. The core contribution is enabling **zero-shot causal discovery and counterfactual generation** via a novel mechanism, Causal Semantic Guidance (CSG), which ensures semantic consistency during interventions.

Figure 1 . An overview of Foundation Model Powered Causal Generative Model (FM-CGM) consisting of a concept extractor, concept manipulator, and counterfactual generator enabled by foundation models
Figure 1 . An overview of Foundation Model Powered Causal Generative Model (FM-CGM) consisting of a concept extractor, concept manipulator, and counterfactual generator enabled by foundation models
cs.AIarxiv:2605.28607v1Lead article

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

his paper introduces an adaptive multimodal multi-agent framework for autonomous workflow execution that overcomes the limitations of fragmented, linear task processing. The core method involves an offline phase to construct a topological knowledge base from execution logs, which agents then leverage during inference. This approach enables agents to utilize Adaptive RAG over a fixed graph structure, facilitating better navigation of underlying workflow topology in dynamic environments.

Simplified schema of the proposed framework.
Simplified schema of the proposed framework.
cs.AIarxiv:2605.28655v1Lead article

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Shanghua Gao, Ada Fang, Marinka Zitnik

utoScientists is a decentralized system of self-organizing AI agents designed for long-running scientific experimentation. Agents collaboratively interpret shared state, form teams around promising hypotheses, critique proposals, and share results to avoid redundant work. This approach significantly improves performance across various domains compared to single-trajectory or centrally-planned AI methods under matched experimental budgets.

Self-organizing agent teams for long-running experimentation. Overview of AutoScientists . Agents identify promising research directions, organize into teams, and execute experiments in parallel.
Self-organizing agent teams for long-running experimentation. Overview of AutoScientists . Agents identify promising research directions, organize into teams, and execute experiments in parallel.
cs.AIarxiv:2605.28807v1Lead article

Calibrating Conservatism for Scalable Oversight

William Overman, Mohsen Bayati

he paper introduces **Calibrated Collective Oversight (CCO)**, a method for scalable oversight of advanced AI agents. CCO aggregates diverse auxiliary scores into a penalty that measures deviation from a conservative baseline, allowing high-utility actions to proceed unless overseer concern accumulates. This conservatism is calibrated online using Conformal Decision Theory to guarantee that undesirable outcomes remain below a user-specified threshold.

Overview of Calibrated Collective Oversight (CCO). Given a state s s , a primary agent either generates candidate actions { a 1 , a 2 , a 3 , … } \{a_{1},a_{2},a_{3},\( \ldots \)\} or receives a fixed set from the environment, assigning each a utility score U ​ ( s , a ) U(s,a) reflecting its own preferences; a conservative baseline a o a_{o} (e.g., defer or no-op) is always included. These candidates, which may include actions with hidden vulnerabilities or misaligned objectives, are evaluated by a collection of auxiliary overseers { q 1 , … , q n } \{q_{1},\( \ldots \),q_{n}\} , each assessing a different dimension such as scope, safety, or convention adherence. The aggregate penalty Δ ​ ( s , a ) = ∑ i | q i ​ ( s , a ) − q i ​ ( s , a o ) | \( \Delta \)(s,a)=\( \sum_{i} \)|q_{i}(s,a)-q_{i}(s,a_{o})| measures how much each action deviates from the baseline across all oversight signals. CCO selects actions by maximizing U ​ ( s , a ) − λ t ​ Δ ​ ( s , a ) U(s,a)-\( \lambda_{t} \)\( \Delta \)(s,a) , where the conservatism parameter \( \lambda_{t} \) is updated online via a conformal controller: after observing whether the selected action incurred a loss ℓ t \( \ell_{t} \) , the controller adjusts λ t + 1 = λ t + η ​ ( ℓ t − α ) \( \lambda_{t+1} \)=\( \lambda_{t} \)+\( \eta \)(\( \ell_{t} \)-\( \alpha \)) . This feedback loop ensures that realized violation rates converge to the user-specified target \( \alpha \) . In safe situations, CCO relaxes conservatism to permit high-utility actions; in risky situations, it increases conservatism to favor safer alternatives.
Overview of Calibrated Collective Oversight (CCO). Given a state s s , a primary agent either generates candidate actions { a 1 , a 2 , a 3 , … } \{a_{1},a_{2},a_{3},\( \ldots \)\} or receives a fixed set from the environment, assigning each a utility score U ​ ( s , a ) U(s,a) r…
cs.AIarxiv:2605.28787v1Lead article

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

his paper compares the effectiveness of two agentic data retrieval methods: one using LLMs to search the open web, and another using an LLM agent specifically leveraging structured **schema.org semantic metadata**. The core contribution is an **LLM-as-a-judge evaluation** framework, aligned with FAIR principles, to assess which approach yields more semantically relevant and computationally useful data for autonomous agents.

Comparative System Architecture. Similar agent logic is evaluated across unstructured Baseline Agent and Semantic Agent dataset search environments. Both feed a unified, FAIR-aligned evaluation of relevance, accessibility, and utility.
Comparative System Architecture. Similar agent logic is evaluated across unstructured Baseline Agent and Semantic Agent dataset search environments. Both feed a unified, FAIR-aligned evaluation of relevance, accessibility, and utility.
cs.AIarxiv:2605.30144v1Lead article

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu

gentSchool introduces an LLM-powered multi-agent simulation framework for educational research, moving beyond simple role-play. Its core method models learning as state transitions, utilizing cognitively growable student agents with detailed knowledge states and explicit misconceptions. This allows researchers to safely test and validate novel pedagogical interventions that might otherwise be ethically or practically constrained in real classrooms.

Scope boundary of the present paper. AgentSchool’s implemented substrate is examined through preliminary lesson and social simulations; institutional and policy-level uses are positioned as extensions rather than completed empirical claims.
Scope boundary of the present paper. AgentSchool’s implemented substrate is examined through preliminary lesson and social simulations; institutional and policy-level uses are positioned as extensions rather than completed empirical claims.
cs.AIarxiv:2605.05054v1Lead article

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Hongxu Chen, Yanghao Wang, Bowei Zhu, Hongxiang Li, Zhen Wang

his paper introduces Direct Product Flow Matching (DPFM) to improve few-shot adaptation in vision-language models by addressing geometric limitations in existing flow matching methods. DPFM decouples the radial and angular dynamics of cross-modal features using a polar decomposition perspective, resolving issues like angular distortion and radial dynamics neglect. This novel approach leads to more effective and stable adaptation by treating the radial and angular components independently during the continuous flow modeling process.

(a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angular speed induced by radial–angular coupling. However, (d). DP-FM follows a constant-speed angular geodesic due to decoupled radial and angular dynamics.
(a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angula…
§ III

Daily Issues This Month

2026-05-02 to 2026-05-31 30