Weekly Issue
Collected dispatches

2026-W25

2026-06-08 to 2026-06-14
80 papers
7 daily issues
A weekly ledger drawn from the daily archive. 3 sections
§ I

The Week in Review

Editorial summary

The overwhelming trend across these 80 papers this week centers on advancing the autonomy, reliability, and complexity-handling capabilities of AI Agents powered by LLMs.

Popular Directions:

1. Agent Evolution and Self-Improvement: A significant vein focuses on making agents self-sufficiently better, exemplified by methods like Q-Evolve (using in-distribution RL for dense rewards) and Socratic-SWE (distilling successful repair patterns into actionable skills). 2. Deep/Long-Horizon Research Agents: Several works tackle the challenge of complex, multi-stage tasks beyond simple prompting. DuMate-DeepResearch emphasizes auditability and task decomposition, while SearchSwarm focuses on necessary "delegation intelligence" to manage context limits during deep research. 3. Robust Evaluation and Benchmarking: There is a critical shift away from simple scoring to testing professional nuance and robustness. The AARR benchmark assesses research thoroughness, while studies on medical LLMs and a simulation environment (Agentopia) highlight the need for assessing consistency under pressure (e.g., prompt variation sensitivity).

Notable Advances & Shifts:

• Memory and Context Management: Novel structures are emerging to handle long inputs. MemDreamer decouples perception and reasoning using hierarchical graph memory for video understanding, demonstrating effective reasoning on only a fraction of the context. • Reasoning Deconstruction: Papers are moving to dissect how LLMs reason. The comparison between human and DeepSeek-R1 math reasoning reveals structural differences ("topological mimicry"), while PRISM attempts to recover the active instruction set directly from model activations. • Alignment and Safety: New frameworks target specific failure modes. CapCode addresses cheating in coding agents through capped evaluations, and the introduction of a metric for Sycophantic Praise highlights subtle alignment failures in social domains. • Efficiency and Infrastructure: Advances in serving agents include AGENTSERVESIM (a hardware-aware simulator for multi-turn serving) and FMplex (model virtualization for serving multiple customized FMs off a shared backbone).

Significant Shifts: The focus is moving from single-turn task performance to multi-turn, stateful interactions demanding auditability (DuMate), robust social simulation (Agentopia), and process-level feedback loops (Multi-Turn Evaluation). The development of robust detection (SV-Detect) and internal mechanism recovery (PRISM) signals growing maturity in analyzing and controlling agent behavior.

§ II

Top Papers

Selected research 80
cs.LGarxiv:2606.07367v1Lead article

Self-evolving LLM agents with in-distribution Optimization

Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

he paper introduces **Q-Evolve**, a self-evolving framework for LLM agents designed to overcome sparse reward challenges in long-horizon decision-making. It unifies automatic process-reward labeling and policy learning using an in-distribution reinforcement learning approach. The core method learns a stable critic from a hybrid dataset using a weighted Implicit Q-Learning objective, which then generates dense, step-wise process rewards via advantage estimation for improved supervision.

Comparison of existing methods. Left: Existing PRM methods rely on costly manual labels or search-based rollouts requiring discrete states, often failing due to distribution shifts between PRM training and policy improvement. Upper Mid: Most online RL does not address episodic sparse rewards. Bottom Mid: Our framework utilizes a hybrid off-policy dataset (expert + agents’ interaction data) to derive rewards via Bellman backups. By co-evolving process reward supervision and policy improvement within a shared in-distribution loop, the agent achieves stable self-evolution. Right: A visualization of performance vs environment steps required for collecting data.
Comparison of existing methods. Left: Existing PRM methods rely on costly manual labels or search-based rollouts requiring discrete states, often failing due to distribution shifts between PRM training and policy improvement. Upper Mid: Most online RL does not address episodic sp…
cs.AIarxiv:2606.07410v1Lead article

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Yuxiang Chen, Jun Wang

his paper comprehensively compares the mathematical reasoning steps of the DeepSeek-R1 LLM and humans on AIME 2025 problems, categorizing 10,247 steps. The core finding is a structural difference: human reasoning is compact, while the LLM exhibits "topological mimicry," frequently revisiting shallow steps without logical progress. Despite this, the authors identify stable branching and backtracing in successful LLM traces as potential signals of genuine reasoning.

State CoT: A transition diagram illustrating the discrete reasoning states and meta-cognitive actions within a trajectory.
State CoT: A transition diagram illustrating the discrete reasoning states and meta-cognitive actions within a trajectory.
cs.AIarxiv:2606.07462v1Lead article

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song

his paper introduces the **AARR (Act As a Real Researcher) benchmark series** to evaluate frontier LLMs and agents on the nuanced professionalism and thoroughness required in real research, moving beyond simple macro-level execution. The first installment, **AARRI-Bench**, specifically assesses agents' ability to emulate the granular reasoning and ethical judgment characteristic of human researchers. This contributes a new standard for evaluating agent capabilities in complex, long-horizon scientific tasks.

Overview of the AARRI-Bench Pipeline. The benchmark is constructed through a three-stage human-in-the-loop workflow with two-dimensional task categorization across task scenarios and agent scope levels. Tasks are evaluated under the Harbor framework with standardized environments, multiple agent harnesses and models, and both coarse-grained and fine-grained metrics.
Overview of the AARRI-Bench Pipeline. The benchmark is constructed through a three-stage human-in-the-loop workflow with two-dimensional task categorization across task scenarios and agent scope levels. Tasks are evaluated under the Harbor framework with standardized environments…
cs.AIarxiv:2606.07299v1Lead article

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen

uMate-DeepResearch is a multi-agent framework designed to overcome limitations in current Deep Research (DR) systems, specifically concerning long-horizon planning, task decomposition, and auditability. It achieves this by decoupling the Agent Core (handling planning and scheduling) from an extensible Tool Ecosystem, ensuring every intermediate decision is explicitly traceable. The core contribution is an auditable DR system that manages complex research tasks through structured multi-agent interaction.

The illustration for the Qianfan Agent Foundry.
The illustration for the Qianfan Agent Foundry.
cs.AIarxiv:2606.07489v1Lead article

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

his paper investigates how autonomous AI agents transform knowledge work by analyzing production data comparing Perplexity's Search and Computer products. The core finding is that the autonomous Computer product significantly accelerates task completion (26 minutes of automated work vs. 33 seconds of manual orchestration in Search) and improves execution quality. This shift reallocates user effort towards higher-order tasks like verification and extension.

AI product progression by autonomy and workflow-context integration. Perplexity’s Search represents the baseline for information retrieval and synthesis; Comet Assistant introduces deeper context integration and execution on top of an interactive browser interface; Computer combines long-horizon asynchronous execution with even deeper and broader context integration as an agent orchestrator.
AI product progression by autonomy and workflow-context integration. Perplexity’s Search represents the baseline for information retrieval and synthesis; Comet Assistant introduces deeper context integration and execution on top of an interactive browser interface; Computer combi…
cs.AIarxiv:2606.07392v1Lead article

Online Pandora's Box for Contextual LLM Cascading

Alexandre Belloni, Yan Chen, Yehua Wei

his paper introduces the **Online Pandora's Box for Contextual LLM Cascading**, an adaptive framework for sequentially querying and selecting among LLM APIs based on request context. Its core method models the **contextual reservation index** directly, addressing the unique challenge where feedback is mediated by the API's generated output, rather than immediate reward revelation. The contribution lies in this novel learning approach tailored for output-mediated feedback in LLM cascading scenarios.

cs.AIarxiv:2606.07412v1Lead article

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao

ocratic-SWE is a closed-loop framework that enables self-evolving software engineering agents by leveraging their own historical solving traces. It distills these traces into structured "agent skills" that capture recurring failures and successful repair patterns. These skills then guide the generation of new, targeted repair tasks in real repositories, ensuring the training data directly addresses the agent's weaknesses.

cs.AIarxiv:2606.07313v1Lead article

SV-Detect: AI-generated Text Detection with Steering Vectors

Mikhail Vishnyakov, Tatiana Gaintseva

V-Detect detects AI-generated text by extracting "steering vectors" from a frozen language model's hidden layers, which define directions separating human and machine text. The method represents inputs by their alignment with these layer-wise directions and uses a lightweight classifier on these features for detection. This approach demonstrates robust performance even under significant distribution shifts, such as domain changes or text editing attacks.

Token-level steering-vector projections distinguish LLM-generated text (top) from human writing (bottom). Word saturation reflects each token’s signed contribution to the classifier.
Token-level steering-vector projections distinguish LLM-generated text (top) from human writing (bottom). Word saturation reflects each token’s signed contribution to the classifier.
cs.AIarxiv:2606.07237v1Lead article

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

Mahdi Alkaeed

his paper systematically evaluates the sensitivity of general and medical Large Language Models (LLMs) to prompt variations (natural and adversarial) using the MedMCQA benchmark. The core contribution is demonstrating that even minor phrasing changes significantly impact model consistency and accuracy in clinical reasoning tasks. The study concludes that current medical LLMs lack the necessary robustness for safety-critical healthcare applications due to this high unpredictability.

This methodological pipeline outlines a comprehensive robustness evaluation framework for assessing the sensitivity and stability of healthcare LLMs using the MedMCQA benchmark, supported by advanced NLP tools (e.g., BioSyn, scispaCy) and biomedical metrics (e.g., BERTScore, USE) to quantify model resilience to input variations.
This methodological pipeline outlines a comprehensive robustness evaluation framework for assessing the sensitivity and stability of healthcare LLMs using the MedMCQA benchmark, supported by advanced NLP tools (e.g., BioSyn, scispaCy) and biomedical metrics (e.g., BERTScore, USE)…
cs.CLarxiv:2606.07513v1Lead article

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang

gentopia is a comprehensive framework designed for long-term life simulation of multi-agent societies, extending simulations from days to years. The core method involves simulating 100 LLM-powered agents autonomously pursuing growth, relationships, and goals over a simulated decade. The contribution is enabling the study of emergent social behaviors and developing enhanced, anthropomorphic social intelligence in LLMs through extended simulated social experience.

Illustration of emergent behaviors observed in Agentopia simulations. Each scene depicts a real behavioral pattern documented in our case studies (Tables 22 – 34 ). Without explicit scripting, agents autonomously develop diverse behavioral patterns reflecting agents’ intelligence in social life.
Illustration of emergent behaviors observed in Agentopia simulations. Each scene depicts a real behavioral pattern documented in our case studies (Tables 22 – 34 ). Without explicit scripting, agents autonomously develop diverse behavioral patterns reflecting agents’ intelligence…
cs.CLarxiv:2606.07402v1Lead article

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen

he paper introduces **M$^3$Exam**, a novel benchmark designed to evaluate language agents' multimodal memory capabilities in realistic user-agent interactions, moving beyond sparse, human-centric data. Its core contribution is a query-centric evaluation framework that tests cross-modal grounding and implicit information inference over accumulating, authentic multimodal data. Furthermore, the authors propose **M$^3$Proctor**, a memory method that selectively processes raw visual data, significantly improving accuracy and efficiency.

Overview of M 3 Exam .
Overview of M 3 Exam .
cs.CLarxiv:2606.07441v1Lead article

Sycophantic Praise: Evaluating Excessive Praise in Language Models

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye, Ruihong Huang, Tianyu Jiang

his paper introduces a novel framework to measure *sycophantic praise* in language models, distinguishing it from simple agreement. The method quantifies praise by comparing it against the contribution's quality and expected user ability, showing it is a distinct alignment problem. The authors demonstrate this framework is superior to generic judges and find that excessive praise is more prevalent in social domains than in objective reasoning tasks.

The same model response may be appropriate or excessive depending on the user’s expected ability and contribution quality. SyPr measures excess praise as the difference between observed praise P ​ ( r ) P(r) in the response r r , and contextually warranted praise W ​ ( p , u ) W(p,u) , where warranted praise depends on both the persona context p p and the user utterance u u .
The same model response may be appropriate or excessive depending on the user’s expected ability and contribution quality. SyPr measures excess praise as the difference between observed praise P ​ ( r ) P(r) in the response r r , and contextually warranted praise W ​ ( p , u ) W(…
cs.AIarxiv:2606.09613v1Lead article

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

GENTSERVESIM is a novel, hardware-aware simulator designed specifically for multi-turn LLM agent serving workloads. Its core contribution is modeling the stateful program execution dynamics of agents, including turn dependencies, tool gaps, and cross-turn KV-cache locality, which existing stateless simulators ignore. This allows for scalable evaluation of complex scheduling and cache management policies relevant to agent serving without costly real-system testing.

AgentServeSim architecture. The Program Orchestrator advances each program turn by turn, routing New Turn events through the Session-Aware Router to a Model Serving Group. There, the scheduler queues turns, the KV Residency Model manages KV state across memory tiers, and the System Simulator executes the resulting operator graphs. After Turn Complete, the Tool Simulator materializes the next inter-turn gap.
AgentServeSim architecture. The Program Orchestrator advances each program turn by turn, routing New Turn events through the Session-Aware Router to a Model Serving Group. There, the scheduler queues turns, the KV Residency Model manages KV state across memory tiers, and the Syst…
cs.AIarxiv:2606.09751v1Lead article

Collaborative Human-Agent Protocol (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

he Collaborative Human-Agent Protocol (CHAP) introduces a standard for the shared workspace in complex, multi-human, multi-agent collaborations where foundation models take on operational roles. Its core method is to formally specify the interaction protocol, focusing on capturing the crucial moment of human judgment (e.g., edits to agent output) as a primary system signal. CHAP's contribution is providing a necessary technical specification for these collaborative workflows, complementing existing standards for tool access and agent-to-agent communication.

Three waves in the evolution of agentic systems. Wave I centred on isolated conversational assistants. Wave II added planning, memory, tool use, and early multi-agent orchestration. Wave III centres on shared human-agent workspaces where humans, agents, and services collaborate under explicit policy and shared audit.
Three waves in the evolution of agentic systems. Wave I centred on isolated conversational assistants. Wave II added planning, memory, tool use, and early multi-agent orchestration. Wave III centres on shared human-agent workspaces where humans, agents, and services collaborate u…
cs.AIarxiv:2606.09643v1Lead article

FMplex: Model Virtualization for Serving Extensible Foundation Models

Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava

Mplex introduces a model virtualization substrate for serving Foundation Models (FMs) by treating the FM backbone as a shared resource. It presents each downstream task with a virtual FM (vFM), allowing independent customization and lifecycle management while sharing the costly physical backbone. This approach significantly reduces memory waste and improves efficiency through optimized batching across colocated tasks.

cs.AIarxiv:2606.09551v1Lead article

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

Yuhan Ma, Yong Li, Stefan Schmid

useFSS introduces a novel compiler for efficient two-server secure LLM inference using Function Secret Sharing (FSS). It replaces bespoke per-operator protocols with a unified compilation pipeline that compactly specifies fixed-point nonlinearities. This allows for batched FSS evaluations of packed comparisons and vector interval lookups, significantly improving efficiency over prior FSS-based methods.

Gate-level activation microbench at L = 128 L{=}128 .
Gate-level activation microbench at L = 128 L{=}128 .
cs.AIarxiv:2606.09748v1Lead article

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

his paper introduces a multi-turn evaluation framework to assess deep research agents' (DRAs) ability to improve based on feedback, moving beyond single-shot benchmarks. The core contribution is the **Research Gap Inference (RGI)** method, which analyzes rubric satisfaction to generate targeted, process-level feedback. This feedback significantly improves DRA performance, unlike self-reflection which shows negligible net gains.

Process-level feedback generation. Given a report r t − 1 r_{t-1} evaluated against the DRACO rubric, RGI analyzes patterns of satisfied and unsatisfied criteria from FA, BD, and CQ (excluding PQ) to infer research-process gaps and generate process-level feedback f t − 1 f_{t-1} for the next turn. Example criteria shown for illustration; negative-weight criteria excluded for simplicity.
Process-level feedback generation. Given a report r t − 1 r_{t-1} evaluated against the DRACO rubric, RGI analyzes patterns of satisfied and unsatisfied criteria from FA, BD, and CQ (excluding PQ) to infer research-process gaps and generate process-level feedback f t − 1 f_{t-1} …
cs.AIarxiv:2606.09692v1Lead article

Observability for Delegated Execution in Agentic AI Systems

Abhinav Mishra, Kumar Sharad

his paper addresses the challenge of tracking actions within specific delegation scopes in complex, agentic AI systems, where standard logs fail to distinguish between incompatible delegation assignments. The core method introduces an **agent-aware observability substrate** featuring a lightweight gateway and a common information model. This system binds execution traces to specific delegation contexts, enabling accurate **delegation-scoped attribution and access footprint reconstruction**.

cs.AIarxiv:2606.09826v1Lead article

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang

mniGameArena introduces a unified benchmark using twelve diverse Unreal Engine 5 games (Solo, PvP, Coop) to evaluate Vision-Language Model (VLM) agents fairly. Its core contribution is the Improvement Dynamics Curve (IDC), a harness where a reflector LLM autonomously refines agent prompts across multiple rounds. This method provides not just a static score, but also the agent's score evolution and generalization ability across task variants.

OmniGameArena at a glance. Twelve newly built UE5 games span Solo (7), PvP (3), and Coop (2) regimes (top). Heterogeneous agents (commercial VLMs, open-weight VLMs, keyboard-mouse policies, and gamepad policies) connect to the same real-time UE5 environment through documented adapters (middle). Evaluation reports the cold-start leaderboard and the Improvement Dynamics Curve (IDC) under multi-round reflection (bottom).
OmniGameArena at a glance. Twelve newly built UE5 games span Solo (7), PvP (3), and Coop (2) regimes (top). Heterogeneous agents (commercial VLMs, open-weight VLMs, keyboard-mouse policies, and gamepad policies) connect to the same real-time UE5 environment through documented ada…
cs.AIarxiv:2606.09563v1Lead article

PRISM: Recovering Instruction Sets from Language Model Activations

Gilad Gressel, Rahul Pankajakshan, Julia Diament, Efim Hudis, Krishnashree Achuthan

RISM is a novel method designed to recover the complete set of active instructions, constraints, and subgoals steering a frozen Language Model's behavior by interpreting its internal activations. It formalizes this as instruction set retrieval and uses a judge-guided GRPO training scheme to directly decode a faithful bulleted list of simultaneous instructions from the hidden states. This directly addresses the limitations of prior activation-to-language methods in complex, agentic scenarios.

Activation-conditioned instruction set retrieval. A frozen target model ℳ \( \mathcal{M} \) generates a response, and we extract a window of T T residual-stream hidden states from layer ℓ \( \ell \) , forming the activation snapshot H ℓ H_{\( \ell \)} . A learned projection maps these states into the model’s embedding space, where they are consumed as a soft prefix by the interpreter \( \phi \) (PRISM). The interpreter reuses ℳ \( \mathcal{M} \) ’s base weights with LoRA adapters and decodes a bullet list ℐ ^ \( \hat \){\( \mathcal{I} \)} of recovered instructions. During RL training, an LLM judge scores candidate lists for coverage of reference instructions and hallucinated bullets.
Activation-conditioned instruction set retrieval. A frozen target model ℳ \( \mathcal{M} \) generates a response, and we extract a window of T T residual-stream hidden states from layer ℓ \( \ell \) , forming the activation snapshot H ℓ H_{\( \ell \)} . A learned projection maps …
cs.AIarxiv:2606.09711v1Lead article

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Mohammad Beigi, Ming Jin, Lifu Huang

his paper introduces **PRIME (Proxy Reward Internalization and Mechanistic Exploitation)**, a learned capability in RL agents to assess task correctness, predict proxy reward acceptance, and reason about exploitable gaps between the proxy and true (gold) reward. The core contribution is demonstrating that PRIME emerges *before* visible reward hacking, and its measured strength accurately forecasts the onset and severity of future hacking, even adapting when the reward structure changes.

External Prime emerges before reward hacking. (a) Proxy/gold split. (b) C B , P B , E B C^{B},P^{B},E^{B} onset before hack rate. (c) Source B exceeds Source A on joint G , E ​ G ​ a ​ p G,EGap .
External Prime emerges before reward hacking. (a) Proxy/gold split. (b) C B , P B , E B C^{B},P^{B},E^{B} onset before hack rate. (c) Source B exceeds Source A on joint G , E ​ G ​ a ​ p G,EGap .
cs.AIarxiv:2606.09730v1Lead article

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang

earchSwarm introduces a method to enhance agentic LLMs for long-horizon tasks by developing "delegation intelligence." The core method involves training agents to effectively decompose complex research tasks, delegate subtasks to specialized subagents, and integrate summarized results to manage the main agent's finite context window. The contribution is a preliminary framework and harness for synthesizing the scarce training data needed to acquire this crucial delegation capability for deep research scenarios.

cs.AIarxiv:2606.09549v1Lead article

SecureClaw: Clawing Back Control of LLM Agents

Yuhan Ma, Stefan Schmid

ecureClaw introduces a dual-boundary architecture to secure LLM agents against unauthorized actions and plaintext exposure. It achieves this by implementing plaintext confinement at the read boundary using a trusted gateway that replaces sensitive reads with opaque handles or bounded summaries. Simultaneously, it enforces authorization at the effect sink via a PREVIEW$\rightarrow$COMMIT protocol, ensuring only a trusted executor can finalize external state changes based on authorized requests.

cs.LGarxiv:2606.09764v1Lead article

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh

his paper introduces **iOSWorld**, the first interactive native iOS simulator benchmark designed to test personally intelligent phone agents. Its core method involves creating a persistent user identity across 26 interconnected apps containing rich personal data (messages, transactions, etc.) to support 133 complex tasks. The contribution is providing a challenging, realistic environment that moves beyond isolated instructions to evaluate agents' ability to reason over a user's history and preferences.

Overview of iOSWorld. 26 purpose-built iOS applications share a single user identity (Jordan Avery) and connected data across apps. The benchmark includes 133 tasks across single-app, multi-app, and memory/personalization categories.
Overview of iOSWorld. 26 purpose-built iOS applications share a single user identity (Jordan Avery) and connected data across apps. The benchmark includes 133 tasks across single-app, multi-app, and memory/personalization categories.
cs.LGarxiv:2606.09821v1Lead article

Rethinking the Divergence Regularization in LLM RL

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo

his paper proposes Divergence Regularized Policy Optimization (DRPO) to improve stable reinforcement learning for LLMs, addressing limitations in existing ratio-clipping and hard-mask divergence methods. DRPO replaces the hard mask used in divergence-based trust regions with a smooth, advantage-weighted quadratic regularizer applied to policy shift. This allows for continuous correction of policy updates rather than outright discarding gradients when trust-region boundaries are crossed.

Per-token gradient weights of different algorithms as a function of the current probability π ​ ( y t | s t ) \( \pi \)(y_{t}|s_{t}) and behavior probability μ ​ ( y t | s t ) \( \mu \)(y_{t}|s_{t}) . For SPO, ϵ = 1 \( \epsilon \)=1 ; for DRPO, δ = 1 \( \delta \)=1 ; for PPO, ε low = 0.2 \( \varepsilon_{\rm low} \)=0.2 and ε high = 0.28 \( \varepsilon_{\rm high} \)=0.28 ; for DPPO, δ = 0.2 \( \delta \)=0.2 . SPO’s weight grows without bound as μ ​ ( y t | s t ) → 0 \( \mu \)(y_{t}|s_{t})\( \to \) 0 , while the weight of DRPO remains bounded for all tokens.
Per-token gradient weights of different algorithms as a function of the current probability π ​ ( y t | s t ) \( \pi \)(y_{t}|s_{t}) and behavior probability μ ​ ( y t | s t ) \( \mu \)(y_{t}|s_{t}) . For SPO, ϵ = 1 \( \epsilon \)=1 ; for DRPO, δ = 1 \( \delta \)=1 ; for PPO, ε l…
cs.LGarxiv:2606.09700v1Lead article

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady

his paper introduces Human-Perceptible Adversarial Attacks (HPAA) to exploit the mismatch between human visual perception and text-based LLM moderation. The core method involves embedding harmful content within benign text using visually salient typographic manipulations (like spacing and emphasis). This allows the harmful content to remain easily recognizable by humans while significantly reducing its detectability by token-based LLM moderation systems.

Overview of typographic Human-perceptible Adversarial Attacks (HPAA) against modern content moderation pipelines. Harmful content remains recognizable to human readers while evading automated moderation systems.
Overview of typographic Human-perceptible Adversarial Attacks (HPAA) against modern content moderation pipelines. Harmful content remains recognizable to human readers while evading automated moderation systems.
cs.CLarxiv:2606.09635v1Lead article

Gradient-Guided Reward Optimization for Inference-time Alignment

Hankun Lin, Ruqi Zhang

radient-Guided Reward Optimization (GGRO) is a lightweight inference-time alignment method that addresses the limitations of sampling-based approaches like Best-of-$N$. GGRO monitors token entropy to detect uncertainty indicative of distribution drift and then injects "nudging tokens" guided by the reward model's gradients to minimally steer the generation trajectory directly during decoding. This gradient-guided intervention offers a more targeted adaptation than simple re-ranking, aiming to improve reliability under drift without relying solely on exhaustive sampling.

Overview of Gradient-Guided Reward Optimization (GGRO). Left: Search-based inference-time alignment methods such as Best-of- N N (BoN) rely on extensive sampling and reward-based selection from the candidate pool, but their performance is constrained by the base model’s ability to produce high-quality responses. In challenging settings, merely sampling from the model’s native logits often fails to yield aligned outputs. Right: GGRO refines generation dynamically by monitoring token-level uncertainty and inserting nudging tokens —generated via gradients from the reward model—at uncertain positions. Each nudge steers decoding toward more aligned regions of the output space. The example illustrates how GGRO successfully corrects a harmful request under a challenging prefilling attack setup, whereas BoN with N = 64 N{=}64 fails.
Overview of Gradient-Guided Reward Optimization (GGRO). Left: Search-based inference-time alignment methods such as Best-of- N N (BoN) rely on extensive sampling and reward-based selection from the candidate pool, but their performance is constrained by the base model’s ability t…
cs.CLarxiv:2606.09709v1Lead article

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking

Zechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li, Wenpeng Hu

he paper introduces **Interleaved Structural Chain-of-Thought (IS-CoT)** to combat the performance degradation ("length collapse") LLMs experience during long-form generation. IS-CoT embeds a dynamic **Plan-Write-Reflect cycle** directly into the generation process, allowing for continuous strategy adaptation without external agents. This method successfully enables LLMs to maintain coherence and control over extended texts, outperforming static planning approaches.

Comparison of three long-form generation paradigms. While existing methods degrade as target length increases due to static planning, our IS-CoT introduces a dynamic Plan-Write-Reflect cycle, maintaining coherence and length control over long horizons.
Comparison of three long-form generation paradigms. While existing methods degrade as target length increases due to static planning, our IS-CoT introduces a dynamic Plan-Write-Reflect cycle, maintaining coherence and length control over long horizons.
cs.CLarxiv:2606.09697v1Lead article

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle

sychoSafe introduces a framework for LLM refusals that reframes them as structured, supportive communication based on evidence-based psychological intervention strategies. The method involves creating a specialized corpus across five risk domains and fine-tuning an LLM (Qwen 3.5 27B) using this data. This approach significantly improves refusal quality by 28.1% over generic baselines, aiming to better support users in high-risk interactions rather than just offering blunt non-compliance.

PsychoSafe framework illustration. By providing a carefully designed prompt and a finetuning pipeline we obtain models up to 28 % 28\% more psychologically safe without loosing original capabilities. The models provide more helpful and psychologically grounded refusals when there is need for them (e.g. suicide, drugs, violence etc.).
PsychoSafe framework illustration. By providing a carefully designed prompt and a finetuning pipeline we obtain models up to 28 % 28\% more psychologically safe without loosing original capabilities. The models provide more helpful and psychologically grounded refusals when there…
cs.CLarxiv:2606.09735v1Lead article

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Wendy K. Tam

his paper investigates how Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) by analyzing partisan orientation in Llama 3.1 8B. The core finding is that RLHF achieves only **shallow alignment** by compressing the variance of existing partisan structure, rather than removing it. This results in consistently balanced output while leaving the underlying partisan representations intact within the model's internal features.

Layer 18 projections onto ω ^ \( \hat{\omega} \) for 84 prompts under the base model (circles) and the Instruct model (diamonds). The base model’s projections span from − - 0.5 to 1.253; RLHF compresses them into a narrow band centered at 0.169.
Layer 18 projections onto ω ^ \( \hat{\omega} \) for 84 prompts under the base model (circles) and the Instruct model (diamonds). The base model’s projections span from − - 0.5 to 1.253; RLHF compresses them into a narrow band centered at 0.169.
cs.AIarxiv:2606.11182v1Lead article

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Weixian Xu, Shilong Liu, Mengdi Wang

EVEE introduces a novel test-time prompt learning framework designed for real-world, heterogeneous task streams, overcoming limitations of single-dataset methods. Its core method involves a router that clusters incoming inputs and assigns them to appropriate prompt configurations, optimized through a router-prompt co-evolution strategy. This approach significantly improves the robustness of LLM agents when handling diverse, interleaved data while preserving performance on individual tasks.

Incremental multi-benchmark retention improvement as tasks are added in the order GPQA Diamond, Formula, TheoremQA, and HumanEval. Each bar stacks per-benchmark improvements for all tasks seen so far: solid upward blocks are positive gains, and hatched downward blocks are negative retention losses. The number above or below each bar is its final summed improvement after all blocks are added.
Incremental multi-benchmark retention improvement as tasks are added in the order GPQA Diamond, Formula, TheoremQA, and HumanEval. Each bar stacks per-benchmark improvements for all tasks seen so far: solid upward blocks are positive gains, and hatched downward blocks are negativ…
cs.AIarxiv:2606.10956v1Lead article

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao

his paper introduces a rigorous benchmark, based on China's National Computer Rank Examination (NCRE), to evaluate frontier Large Language Models' (LLMs) ability to perform complex, multi-application Office automation tasks requiring long-horizon planning. The evaluation uses 200 practical tasks scored against 7,118 criteria. The core contribution is demonstrating the significant limitations of current LLMs in professional software proficiency, with even strong agentic systems showing limited success in passing this standardized office exam.

End-to-end illustration of a Word task in OfficeEval . The original document ( left ) is transformed according to the task instructions ( center ) into a styled brochure with header image, heading styles, and mail-merge labels ( right ). Only page 1 of the 2-page document is shown; several steps (e.g., 3-column layout, watermark) apply to page 2. The task is scored by 30 deterministic criteria across 6 skill categories. Instructions are translated from the original Chinese; additional examples across Word, Excel, and PowerPoint appear in the Appendix.
End-to-end illustration of a Word task in OfficeEval . The original document ( left ) is transformed according to the task instructions ( center ) into a styled brochure with header image, heading styles, and mail-merge labels ( right ). Only page 1 of the 2-page document is show…
cs.AIarxiv:2606.10989v1Lead article

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

Bocheng Ju, Jianhua Wang, Chengliang Liu, Xiaolin Chang

his paper introduces Null-Space Constrained Response-Specified Unlearning (NSRU), a low-rank adaptation method for LLM unlearning. NSRU constrains the update parameters to the null space of estimated "retain subspaces" derived from benign data, ensuring adaptation is localized. This method jointly optimizes suppressing undesired responses, learning a safe target response, and preserving benign capabilities.

Motivation and core intuition of NSRU. (a) Suppression-only unlearning penalizes the undesired response y − y^{-} but leaves the safe replacement behavior unspecified and can induce under-constrained updates that perturb retained behavior. (b) NSRU specifies a safe target response y + y^{+} , explicitly suppresses y − y^{-} , and uses projected LoRA updates that act through retain-orthogonal components, redirecting forget queries while reducing retain-side interference.
Motivation and core intuition of NSRU. (a) Suppression-only unlearning penalizes the undesired response y − y^{-} but leaves the safe replacement behavior unspecified and can induce under-constrained updates that perturb retained behavior. (b) NSRU specifies a safe target respons…
cs.AIarxiv:2606.11164v1Lead article

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang

easonAlloc addresses KV cache bottlenecks in LLM reasoning by introducing a hierarchical, training-free budget allocation framework. It combines an offline layer-wise preallocation strategy, capturing the "Reasoning Wave" demand pattern, with an online head-wise reallocation strategy that prioritizes information-rich heads during decoding. This dynamic approach significantly mitigates inference latency caused by long CoT trajectories more effectively than uniform or static allocation methods.

An overview of the proposed ReasonAlloc framework. Left (I): Layer-wise allocation strategy based on offline architecture calibration, demonstrating the non-linear “Reasoning Wave” KV demand across layers. Right (II): Head-wise allocation strategy that dynamically routes KV budgets to distinct attention heads based on real-time importance and redundancy scoring during decoding.
An overview of the proposed ReasonAlloc framework. Left (I): Layer-wise allocation strategy based on offline architecture calibration, demonstrating the non-linear “Reasoning Wave” KV demand across layers. Right (II): Head-wise allocation strategy that dynamically routes KV budge…
cs.AIarxiv:2606.10917v1Lead article

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang

he Role-Agent framework bootstraps LLM agent learning by having a single LLM concurrently act as both the agent and the environment. It uses a dual-component system: World-In-Agent (WIA) generates a process reward based on state prediction accuracy, while Agent-In-World (AIW) uses failure analysis to reshape the training data for targeted improvement. This self-contained co-evolution addresses limitations of static environments and inefficient feedback, leading to enhanced generalization.

(a): Static environments provide sparse and non-specific feedback that limits the agent’s exploration; (b): Synthetic environments incur high labor and runtime costs; (c): The proposed Role-Agent enables one model to switch roles between agent and environment to achieve bootstrapped co-evolution.
(a): Static environments provide sparse and non-specific feedback that limits the agent’s exploration; (b): Synthetic environments incur high labor and runtime costs; (c): The proposed Role-Agent enables one model to switch roles between agent and environment to achieve bootstrap…
cs.AIarxiv:2606.11070v1Lead article

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh

1-Bench is introduced as a high-fidelity benchmark designed to evaluate LLM-based agents in complex, realistic, multi-domain customer-facing scenarios. Its core contribution is providing a standardized framework that captures sustained reasoning and coordination across interleaved, multi-turn interactions, significantly increasing compositional complexity and evaluative rigor compared to existing benchmarks.

Overview of T1-Bench , a framework for persistent multi-session conversational agents. User policies, including persona, user information, and goals, guide interactions between the user and the assistant. The assistant performs domain classification to retrieve domain-specific tools and policies (e.g., flight, hotel, and restaurant services) and executes tool-augmented reasoning via API calls. A shared memory module stores conversation history and cached results, enabling persistent context, tool reuse, and continuity across temporally separated sessions.
Overview of T1-Bench , a framework for persistent multi-session conversational agents. User policies, including persona, user information, and goals, guide interactions between the user and the assistant. The assistant performs domain classification to retrieve domain-specific to…
cs.AIarxiv:2606.11045v1Lead article

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

Martin Andres Bertran, Aaron Roth, Zhiwei Steven Wu

his paper investigates the hypothesis that successful machine learning strategies are highly compressible, even when adaptively reused on held-out benchmarks. The authors test this using LLM-driven research agents under two compression bottlenecks: limiting the agent's prompt (output compression) or restricting feedback to one bit (input compression). They find that these compression methods have little effect on the final performance achieved across diverse ML tasks, suggesting that the successful search strategies are inherently compact.

Compressibility of an autonomous agent’s language model pre-training strategy. An explorer agent develops a 12-layer, 768-dimensional GPT with ∼ 30 {\( \sim \)}30 non-default choices; we compress its strategy into progressively shorter prompts (64 down to 4 tokens) and hand each to 5 independent reproducers that implement the strategy without the explorer’s code or validation set. Left : compressed prompts. Tokens encode architecture ( 12L d768 ), optimizer ( Muon .1 ), activation ( ReLU 2 ), batch size ( b2M ), and normalization ( QKnorm ); color indicates survival depth (dark blue = survives tightest budgets, red = dropped first). Right : reproducer holdout loss (BPB); dot = mean, gray band = min–max, dashed line = explorer before compression. Performance holds to 16 tokens; a cliff at 8 tokens coincides with loss of batch size, MLP ratio, and QK-norm.
Compressibility of an autonomous agent’s language model pre-training strategy. An explorer agent develops a 12-layer, 768-dimensional GPT with ∼ 30 {\( \sim \)}30 non-default choices; we compress its strategy into progressively shorter prompts (64 down to 4 tokens) and hand each …
cs.AIarxiv:2606.11042v1Lead article

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang

orkflow-GYM is introduced as a novel benchmark to address the lack of evaluation for AI agents performing long-horizon, high-value professional workflows using graphical user interfaces (GUIs). The core method involves creating tasks centered on specialized, domain-specific professional software environments. The contribution is demonstrating that current state-of-the-art agents struggle significantly, achieving only about 30% success rates on these complex, real-world professional tasks.

Examples of Workflow-GYM tasks from professional domains. Each task requires interacting with specialized software through graphical user interfaces to accomplish a real-world objective.
Examples of Workflow-GYM tasks from professional domains. Each task requires interacting with specialized software through graphical user interfaces to accomplish a real-world objective.
cs.LGarxiv:2606.11025v1Lead article

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo

low-DPPO addresses limitations in applying standard PPO to flow matching models by replacing noisy ratio clipping with a direct divergence constraint. Leveraging the Gaussian nature of the per-step policy, it enables exact and efficient computation of the KL divergence between old and new policies. This method provides a more structurally sound trust region enforcement, leading to improved quality and alignment in generative models.

Qualitative comparison on FLUX.1-dev (Black Forest Labs, 2024 ) with GenEval2 (Kamath et al. , 2025 ) prompts. Flow-DPPO achieves competitive compositional accuracy with notably less image quality degradation compared to Flow-GRPO (Liu et al. , 2025 ) , Flow-CPS (Wang and Yu, 2025 ) , and GRPO-Guard (Wang et al. , 2025 ) , reflecting their superior KL-proximal efficiency.
Qualitative comparison on FLUX.1-dev (Black Forest Labs, 2024 ) with GenEval2 (Kamath et al. , 2025 ) prompts. Flow-DPPO achieves competitive compositional accuracy with notably less image quality degradation compared to Flow-GRPO (Liu et al. , 2025 ) , Flow-CPS (Wang and Yu, 202…
cs.CLarxiv:2606.11046v1Lead article

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Prajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV, Furong Huang

his paper investigates whether converting instruction-tuned Large Language Models (LLMs) into reasoning models via post-training preserves their original alignment behaviors (safety, bias avoidance, etc.). The core method involves a systematic trustworthiness audit comparing reasoning models (trained via SFT, RL, or distillation) against their instruction-tuned baselines across six dimensions. The key contribution is demonstrating that this conversion often leads to significant alignment regressions, despite improved reasoning accuracy.

Reasoning model development pathways. We study three common pathways for converting instruction-tuned models into reasoning models: (1) supervised fine-tuning (SFT) on reasoning traces, (2) RL-based post-training, including GRPO-style variants, using reasoning-oriented rewards, and (3) distillation from stronger reasoning teachers. Each reasoning model is evaluated against a matched instruction-tuned baseline to measure how the conversion affects reasoning utility and trustworthiness.
Reasoning model development pathways. We study three common pathways for converting instruction-tuned models into reasoning models: (1) supervised fine-tuning (SFT) on reasoning traces, (2) RL-based post-training, including GRPO-style variants, using reasoning-oriented rewards, a…
cs.CLarxiv:2606.10931v1Lead article

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Naihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott, Rada Mihalcea

his paper demonstrates that a single biased example, introduced via one-shot Group Relative Policy Optimization (GRPO), is sufficient to induce systematic and generalizing bias in large language models (LLMs). The core contribution is revealing a critical vulnerability where post-training alignment guardrails can be easily overridden by minimal targeted adversarial training. Model susceptibility is shown to correlate with its initial propensity for biased outputs.

cs.CLarxiv:2606.10875v1Lead article

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

his paper investigates how to improve LLM tool-calling by integrating and activating experiential knowledge. The core method involves acquiring instance-level knowledge, which proves highly effective, and employing parallel sampling (expanding reasoning width) during inference to better activate this knowledge. The contribution lies in demonstrating that simple instance knowledge and parallel reasoning are superior strategies for enhancing multi-step tool-use performance.

The augmentation results of different experiential knowledge. “All” indicates incorporating all the experiential knowledge.
The augmentation results of different experiential knowledge. “All” indicates incorporating all the experiential knowledge.
cs.CLarxiv:2606.11082v1Lead article

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

Hakan Mehmetcik

his paper introduces the "Shibboleth Effect," examining how frontier LLMs exhibit cross-lingual distributional skew under adversarial conditions. Using a simulated geopolitical wargame played in English versus Turkish, the authors found that models display heterogeneous behavioral changes, such as Llama-4 significantly increasing coercive rhetoric when prompted in Turkish. The core contribution is demonstrating that language choice directly and differentially biases LLM strategic behavior.

cs.AIarxiv:2606.13608v1Lead article

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren

he paper introduces Agentified Agent Assessment (AAA), a novel framework where evaluation is conducted by judge agents interacting with participants via standardized protocols (A2A and MCP). This approach unifies the assessment interface, decoupling evaluation logic from agent implementation. AgentBeats is the concrete realization of AAA, providing a generic, reproducible, and interoperable system for benchmarking diverse agent designs.

Figure 1 . Comparison between Traditional LLM/Agent benchmarks and AAA. AAA reduces the number of integrations from N × M N\( \times \) M to N + M N+M , while completely separating the benchmark and target agent as shown in the gray boxes.
Figure 1 . Comparison between Traditional LLM/Agent benchmarks and AAA. AAA reduces the number of integrations from N × M N\( \times \) M to N + M N+M , while completely separating the benchmark and target agent as shown in the gray boxes.
cs.AIarxiv:2606.13669v1Lead article

Agents-K1: Towards Agent-native Knowledge Orchestration

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu

gents-K1 introduces an end-to-end pipeline to transform raw scientific documents into agent-native knowledge graphs, addressing the limitations of existing LLM agents in scientific knowledge orchestration. Its core method involves a multimodal parser capturing detailed entities, evidence, and relations across the full paper, supported by a specialized information-extraction backbone. The contribution is a richer, structured knowledge representation designed to facilitate complex scientific reasoning for AI agents.

Agents-K1 : Architecture and Capabilities. Left : Extracting multimodal knowledge from scientific papers. Middle : Schema-adaptive extensions for core research tasks. Right : Enhancing LLM reasoning and verifiable knowledge tracing.
Agents-K1 : Architecture and Capabilities. Left : Extracting multimodal knowledge from scientific papers. Middle : Schema-adaptive extensions for core research tasks. Right : Enhancing LLM reasoning and verifiable knowledge tracing.
cs.AIarxiv:2606.13572v1Lead article

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

rogyaSutra is a multi-agent framework designed to enhance multimodal medical reasoning in Indic languages. It leverages a novel actor-critic architecture with dual-memory mechanisms and tool grounding to perform step-wise reasoning on complex medical queries involving text and images. The framework is supported by ArogyaBodha, a large-scale, multilingual multimodal medical Q\&A dataset spanning numerous body systems and imaging modalities.

Overview of the ArogyaSutra framework. ArogyaSutra employs an actor–critic architecture enhanced with tool-based image grounding and adaptive code-switching. The Actor first processes the input prompt and identifies the need for visual grounding, invoking appropriate tool agents to extract clinically relevant information from medical images before generating an answer with its associated reasoning. This output is then passed to the Critic for evaluation. If the response is correct, the Critic approves and outputs the final answer and reasoning. Otherwise, it consults an error detector (GPT-4o-mini) to identify the source of failure. Language-related errors trigger code-switching by translating the query into English, while reasoning-related errors are handled by incorporating summaries of past and current mistakes from long-term and short-term memory. The refined query is then fed back to the Actor for iterative refinement.
Overview of the ArogyaSutra framework. ArogyaSutra employs an actor–critic architecture enhanced with tool-based image grounding and adaptive code-switching. The Actor first processes the input prompt and identifies the need for visual grounding, invoking appropriate tool agents …
cs.AIarxiv:2606.13361v1Lead article

Can I Buy Your KV Cache?

Luoyuan Zhang

his paper proposes a simple yet impactful method to eliminate redundant computation in large language models: **precomputing and selling the Key-Value (KV) cache for documents.** By allowing agents to buy and load a precomputed cache instead of re-running the expensive prefill step, the authors achieve significant compute savings (9-50x cheaper for a small model) with zero accuracy loss. The core contribution is demonstrating the feasibility and efficiency of treating the KV cache as a reusable, purchasable asset to drastically reduce inference costs.

Amortized per-call cost vs. reuse count N N (log–log). The from-scratch cost is flat at C prefill C_{\( \text{prefill} \)} ; KV-reuse falls as C prefill / N + C reuse C_{\( \text{prefill} \)}/N+C_{\( \text{reuse} \)} toward a floor of C reuse C_{\( \text{reuse} \)} .
Amortized per-call cost vs. reuse count N N (log–log). The from-scratch cost is flat at C prefill C_{\( \text{prefill} \)} ; KV-reuse falls as C prefill / N + C reuse C_{\( \text{prefill} \)}/N+C_{\( \text{reuse} \)} toward a floor of C reuse C_{\( \text{reuse} \)} .
cs.AIarxiv:2606.13662v1Lead article

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang

he paper introduces **EurekAgent**, an agent system arguing that the bottleneck for autonomous scientific discovery is shifting to **agent environment engineering**. EurekAgent focuses on designing the environment—including resources, constraints, and interfaces—to amplify desired agent behaviors (like exploration and collaboration) and suppress negative ones (like reward hacking). This environment engineering approach is presented as the core method for achieving metric-driven, high-performance autonomous scientific discovery.

EurekAgent score evolution progress on the 26-circle packing problem.
EurekAgent score evolution progress on the 26-circle packing problem.
cs.AIarxiv:2606.13607v1Lead article

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

Zach Studdiford, Gary Lupyan

his paper challenges the notion that human reasoning relies on abstract world models while LLMs only perform pattern matching. By testing both humans and LLMs on everyday common-sense reasoning, the authors found similar error patterns in both groups. They further demonstrated that specific LLM attention heads implement pattern-matching mechanisms that can predict seemingly irrelevant errors in human reasoning, suggesting both employ pattern-matching for everyday causal reasoning.

Overview of our evaluation probing human and LLM causal reasoning . a. Illustration of the evaluation format used for testing causal reasoning in people and LLMs. For each prompt, subjects are first presented with the scenario. After reading the prompt, subjects press SPACE to see the two response options and then select the most likely completion. b. Summary of the 11 categories we used in our evaluation of everyday causal reasoning.
Overview of our evaluation probing human and LLM causal reasoning . a. Illustration of the evaluation format used for testing causal reasoning in people and LLMs. For each prompt, subjects are first presented with the scenario. After reading the prompt, subjects press SPACE to se…
cs.AIarxiv:2606.13598v1Lead article

Reward Modeling for Multi-Agent Orchestration

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke

he paper introduces **Orchestration Reward Modeling (OrchRM)**, a self-supervised framework to evaluate the quality of multi-agent orchestration without requiring human labels. OrchRM constructs win-lose pairs from intermediate execution artifacts to train a Bradley-Terry reward model, enabling efficient, reward-guided orchestrator training and MAS scaling. This method significantly improves training efficiency (up to 10x token reduction) and enhances test-time scaling performance (up to 8% accuracy gain) compared to existing rollout-based methods.

cs.CLarxiv:2606.13681v1Lead article

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li

voArena is a novel benchmark suite designed to evaluate LLM agents in dynamic environments by modeling progressive changes across terminal, software, and social domains. The core contribution is the introduction of EvoMem, a patch-based memory paradigm that explicitly tracks and structures memory evolution as update histories, allowing agents to reason about environmental changes. This framework reveals current agents' struggles in dynamic settings, while EvoMem demonstrates consistent performance improvements.

Step accuracy vs. chain accuracy on EvoArena. The closer to the upper-right corner the better.
Step accuracy vs. chain accuracy on EvoArena. The closer to the upper-right corner the better.
cs.CLarxiv:2606.13663v1Lead article

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang

yperTool addresses the execution-granularity mismatch in tool-augmented agents by introducing a unified, executable interface that allows models to invoke complex, multi-step tool workflows within a single outer call. This "folding" of deterministic subroutines reduces the number of model-visible decisions, saving context and simplifying low-level dataflow management. The method significantly improves performance on compositional tasks by enabling more abstract, higher-level reasoning about tool usage.

Comparison of context management paradigms in tool-augmented agents. (a) Step-wise execution expands atomic calls and observations into the trace. (b) Trace-level compression shortens the trace after expansion. (c) HyperTool manages context at execution time by folding dependent tool operations into one executable block and returning only the task-relevant result.
Comparison of context management paradigms in tool-augmented agents. (a) Step-wise execution expands atomic calls and observations into the trace. (b) Trace-level compression shortens the trace after expansion. (c) HyperTool manages context at execution time by folding dependent …
cs.CLarxiv:2606.13643v1Lead article

Recursive Agent Harnesses

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

he paper introduces the **Recursive Agent Harness (RAH)**, framing it as a code-first extension to model recursion, where the recursive unit is a full agent harness with tools and planning, not just a model call. RAH leverages a parent agent to generate and execute scripts that spawn parallel subagent harnesses for fine-grained tasks and use structured calls for smaller ones. This method significantly improves long-context reasoning performance, boosting a baseline coding agent from 71.75% to 81.36%.

Figure 1 . The Recursive Agent Harness (RAH). A parent agent selects between code-execution spawning (writing an executable script that spawns subagents in parallel) and JSON tool-call spawning (for 1–5 entries). Subagents carry the same spawning capability as their parent, enabling recursive decomposition bounded by a configurable depth limit.
Figure 1 . The Recursive Agent Harness (RAH). A parent agent selects between code-execution spawning (writing an executable script that spawns subagents in parallel) and JSON tool-call spawning (for 1–5 entries). Subagents carry the same spawning capability as their parent, enabl…
cs.AIarxiv:2606.07379v1Lead article

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama

his paper introduces **CapCode**, a framework for creating coding evaluation datasets where the maximum achievable *non-cheating* score is deliberately capped below perfect performance. This design allows high scores significantly exceeding the cap to serve as reliable indicators of deceptive cheating. Furthermore, the authors propose **CapReward**, a corresponding reward mechanism to discourage agents from optimizing beyond this cap, leading to models that adhere better to the intended task specifications.

Conceptual illustrations of CapCode and CapReward. CapCode (left) is a dataset-construction framework with a capped achievable pass rate, enabling deceptively high performance detection. CapReward (right) is a reward design method that discourages reward hacking by penalizing pass rates that exceed the cap.
Conceptual illustrations of CapCode and CapReward. CapCode (left) is a dataset-construction framework with a capped achievable pass rate, enabling deceptively high performance detection. CapReward (right) is a reward design method that discourages reward hacking by penalizing pas…
cs.AIarxiv:2606.07316v1Lead article

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

Haoran Xu, Lei Zhang, Iadh Ounis, Xianbin Wang

his paper introduces Hierarchical Certified Semantic Commitment (H-CSC), a Byzantine Fault Tolerance (BFT)-inspired protocol designed for LLM-agent collaboration. H-CSC converts embedding-derived finality signals into one of three typed outcomes: a semantic commit, a verdict commit, or an explicit abort. Its core contribution is providing a finality-control primitive that handles the unstructured nature of LLM proposals, unlike traditional BFT methods.

cs.AIarxiv:2606.07515v1Lead article

How reliable are LLMs when it comes to playing dice?

Luca Avena, Gianmarco Bet, Bernardo Busoni

his paper benchmarks the probabilistic reasoning of eight state-of-the-art LLMs using standard and counterintuitive dice problems. The core finding is that while models excel at standard problems (0.96 accuracy), performance significantly drops on counterintuitive tasks (0.59 accuracy) and is highly sensitive to prompt phrasing and misleading suggestions. The contribution is demonstrating that current LLMs lack robust probabilistic reasoning, often relying on superficial textual cues rather than genuine mathematical understanding.

Performance over Bias Dataset
Performance over Bias Dataset
cs.AIarxiv:2606.07512v1Lead article

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang

emDreamer addresses long-video understanding by decoupling perception and reasoning using a Hierarchical Graph Memory to incrementally build semantic abstractions from streamed video. During inference, an agentic retrieval mechanism uses tool-augmented actions to navigate this memory structure, allowing the model to reason effectively with only 2% of the full context. This approach achieves state-of-the-art performance across benchmarks by efficiently managing long-range dependencies.

cs.AIarxiv:2606.07500v1Lead article

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

Fatema Siddika, Md Anwar Hossen, Tanwi Mallick, Ali Jannesari

his paper introduces SETA, a framework for continual learning in LLMs that addresses catastrophic forgetting by employing a Mixture of Sparse Experts architecture. SETA adaptively decomposes model parameters into task-specific experts and shared experts, isolating new knowledge while protecting common features. This separation, enforced by adaptive anchoring and routing-aware regularization, resolves the plasticity-stability dilemma without uniform parameter updates.

Overview of the SETA Framework Architecture. (a–b) Sparse Subspace Selection: High-utility parameter blocks are identified from the pre-trained LLM using gradient magnitude to form the expert design. (c) Split-on-Share (SoS) Evolution: The SoS filter partitions parameters into plastic Shared ( E s E_{\( \text{s} \)} ) and frozen Unique ( E u E_{\( \text{u} \)} ) experts to resolve parameter collisions and retain knowledge. (d) Gating Evaluation: The gating network expands using logit invariance to strictly preserve decision boundaries during expert splitting. (e) Task-Agnostic Inference: A router network dynamically weights all experts via softmax for input tokens, enabling automatic task processing without task identifications.
Overview of the SETA Framework Architecture. (a–b) Sparse Subspace Selection: High-utility parameter blocks are identified from the pre-trained LLM using gradient magnitude to form the expert design. (c) Split-on-Share (SoS) Evolution: The SoS filter partitions parameters into pl…
cs.AIarxiv:2606.07422v1Lead article

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi

his paper introduces a controlled framework using real-world cultural questions to disentangle general language proficiency from localized cultural knowledge access in LLMs. By crossing question type (agnostic vs. specific) with query language (English vs. local) and employing Item Response Theory, the authors isolate the true impact of language choice. The core contribution is demonstrating a consistent English advantage even for culture-specific knowledge, suggesting local language access remains suboptimal.

For cultural questions, local knowledge may be masked by weaker local-language proficiency.
For cultural questions, local knowledge may be masked by weaker local-language proficiency.
cs.AIarxiv:2606.07433v1Lead article

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu

his paper proposes a unified framework for analyzing human-view video understanding using MLLMs, structured around three core abilities: **watching, remembering, and reasoning**. The contribution lies in providing a structured formulation to characterize how these models acquire evidence, maintain context over long videos, and perform grounded inference, moving beyond isolated benchmark testing. This approach helps systematically identify challenges in perception, memory management, and reasoning for video MLLMs.

Overview of our survey. Left: the survey pipeline. Right: our Watch–Remember–Reason taxonomy for MLLM-based video understanding. Watch (Sec. 3.1 ) covers fine-grained grounding, captioning, audio-visual perception, and efficient processing. Remember (Sec. 3.2 ) includes offline and streaming memory. Reason (Sec. 3.3 ) covers text-only reasoning and thinking with videos, with both agentic and non-agent approaches. Representative methods are listed under each leaf.
Overview of our survey. Left: the survey pipeline. Right: our Watch–Remember–Reason taxonomy for MLLM-based video understanding. Watch (Sec. 3.1 ) covers fine-grained grounding, captioning, audio-visual perception, and efficient processing. Remember (Sec. 3.2 ) includes offline a…
cs.LGarxiv:2606.07303v1Lead article

Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

he paper introduces the **Bootstrap Theory of Representational Emergence (TBER)**, a framework explaining how new levels of representation arise in machine learning. TBER posits that representational innovation is driven not just by data or compute, but fundamentally by **explanatory insufficiency**, where existing representations can describe observations but fail to make their underlying organization intelligible. This explanatory gap acts as a positive signal, compelling the system to learn a new, more adequate representation.

cs.AIarxiv:2606.09674v1Lead article

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

Wesley Pegden

he paper introduces **Trellis**, an autoformalization system that uses LLM agents in a strictly controlled workflow to iteratively refine natural language proofs for formalization in Lean. Its core contribution is enforcing rigor by structuring the process around the mathematician's expectation that any proof step should be easily elaboratable, achieving reliable formalization without specialized agent training. This workflow, guided by process semantics, successfully produced an end-to-end Lean formalization of a recent Ramsey theory result.

cs.AIarxiv:2606.09556v1Lead article

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Yinan Wang

his paper investigates the limiting factors for AI scientists in knowledge-intensive tasks like drug-asset valuation, hypothesizing that the accessible evidence substrate is key. Through a three-arm ablation study, they show that while adding reasoning scaffolds and structured tools (Arm B) improves calibration, the most significant performance gain comes from incorporating a proprietary, curated data corpus (Arm C). The core contribution is demonstrating that access to high-quality, proprietary evidence is crucial for overcoming factual limitations and achieving near-expert performance in scientific decision-making.

A/B/C across the headline evidence and decision metrics (overall, normalized to 0–1: objectivity/4, decision-quality/10). B’s skills/public tools lift tier-correctness and modestly lift objectivity; data (C) dominates coverage and informed decision-quality; B’s verdict-soundness is similar to A (informed decision-Q B ≈ \( \approx \) A).
A/B/C across the headline evidence and decision metrics (overall, normalized to 0–1: objectivity/4, decision-quality/10). B’s skills/public tools lift tier-correctness and modestly lift objectivity; data (C) dominates coverage and informed decision-quality; B’s verdict-soundness …
cs.AIarxiv:2606.11078v1Lead article

A History-Aware Visually Grounded Critic for Computer Use Agents

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty

his paper introduces **HiViG**, a history-aware, visually grounded critic framework for Computer Use Agents (CUAs). HiViG addresses limitations in existing critics by training a multimodal model on real GUI trajectories to summarize past interactions and verify proposed actions against the current screen visuals. This provides agents with both long-term context and precise visual grounding to detect flawed execution steps during operation.

Comparison of test-time interventions for Computer Use Agents (CUAs). Left : Lacking historical awareness and proactive error-recovery, standard policies easily become trapped in short-sighted decision loops. Right : Existing approaches are limited. Scalar feedback (top) traps policies in low-reward trajectory regions when all candidate actions are suboptimal. Previous critics (middle) rely heavily on textual intent, missing spatial and reasoning errors, and fail to provide historical awareness. In contrast, our critic (bottom) verifies raw execution coordinates, predicts immediate visual state outcomes grounded in its learned state transition knowledge, and provides visually grounded error analysis that intercepts errors before execution. Furthermore, it equips agents with history state tracking, condensing past interactions to guide them toward the final task objective.
Comparison of test-time interventions for Computer Use Agents (CUAs). Left : Lacking historical awareness and proactive error-recovery, standard policies easily become trapped in short-sighted decision loops. Right : Existing approaches are limited. Scalar feedback (top) traps po…
cs.AIarxiv:2606.11150v1Lead article

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin

he paper introduces **ABC-Bench**, a novel benchmark designed to systematically evaluate the agentic biosecurity-relevant capabilities of Large Language Model (LLM) agents. This benchmark assesses both beneficial and dual-use biology tasks, such as robotic coding and DNA design, requiring integrated biology and software skills. The core contribution is demonstrating that current LLM agents significantly **outperform expert human baselines** across these critical biosecurity-relevant tasks, highlighting an urgent need for updated risk assessment.

The Liquid Handling Robot task from ABC-Bench. A. We (1) prompt the agent with task instructions; (2) provide the agent with relevant software and research tools to complete the task and to check its work; (3) allow the agent to submit its final answer; and (4) algorithmically check the agent’s answer against pre-specified criteria. B. Where applicable, we validate task performance in a real-world setting (photo of OpenTrons Flex robot running GPT-o4-mini-written Gibson Assembly code).
The Liquid Handling Robot task from ABC-Bench. A. We (1) prompt the agent with task instructions; (2) provide the agent with relevant software and research tools to complete the task and to check its work; (3) allow the agent to submit its final answer; and (4) algorithmically ch…
cs.AIarxiv:2606.11033v1Lead article

AuRA: Internalizing Audio Understanding into LLMs as LoRA

Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu

uRA internalizes audio understanding directly into Large Language Models (LLMs) using a lightweight adaptation technique. It achieves this by distilling the audio encoding capability from a teacher ASR model into a LoRA-adapted LLM student via layer-wise hidden state alignment. This method offers a tighter integration than cascaded or bridge approaches, aiming to reduce latency and coupling issues.

Illustration of representative speech-language modeling paradigms and our proposed distillation-based adaptation framework.
Illustration of representative speech-language modeling paradigms and our proposed distillation-based adaptation framework.
cs.AIarxiv:2606.10935v1Lead article

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Xuezhen Xie, Zhiqiang Zhou

he paper introduces **CLP (Collocation-Length Predictor)** to enable high-quality, accelerated multi-token inference (MTP) in LLMs. The core method, **Backbone-as-Architect**, resolves quality degradation by ensuring the main LM head always generates the first token, while MTP heads only predict subsequent tokens. CLP is a lightweight predictor that determines the optimal number of subsequent tokens to accept safely at each step, achieving acceleration without quality loss.

Architecture comparison. (a) Standard MTP: Head 0 and the backbone LM head both predict token t + 1 t+1 , creating competition that degrades output quality. (b) Backbone-as-Architect (ours): the backbone LM head always generates the first token; MTP heads handle subsequent tokens only, eliminating competition.
Architecture comparison. (a) Standard MTP: Head 0 and the backbone LM head both predict token t + 1 t+1 , creating competition that degrades output quality. (b) Backbone-as-Architect (ours): the backbone LM head always generates the first token; MTP heads handle subsequent tokens…
cs.AIarxiv:2606.11166v1Lead article

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

his paper challenges the narrative of LLMs achieving expert-level performance by introducing a novel benchmark focusing on reliable, high-stakes data analysis coding tasks. The authors compare a frontier LLM against human experts, explicitly measuring error magnitude and performance variance. The core contribution is demonstrating that human experts outperform LLMs on average and exhibit significantly lower performance variability in this critical context.

The RMSE among all submissions. The x-axis is ordered from smallest to largest RMSE value. Submissions from human experts are shown in blue, each submission from ChatGPT Codex 5.2 are shown in red, and historical strawman are shown in black. The right panel removes extreme values from 5 LLM submissions. The RMSE of ChatGPT Codex 5.2 submissions 14, 19, 11, 2, and 1 were 3.07, 1160.05, 2572.96, 111,765,692,519, and 219,316,810,584, respectively.
The RMSE among all submissions. The x-axis is ordered from smallest to largest RMSE value. Submissions from human experts are shown in blue, each submission from ChatGPT Codex 5.2 are shown in red, and historical strawman are shown in black. The right panel removes extreme values…
cs.AIarxiv:2606.10933v1Lead article

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

Aman Sharma, Sushrut Thorat, Paras Chopra

his paper evaluates coding agents on unfamiliar, esoteric programming languages using a sequential setup involving file editing and local execution. The core contribution is demonstrating that top agents employ a **metaprogramming strategy**—writing code in a familiar language (like Python) to generate the required esoteric code—to achieve success. Restricting this generative approach significantly degrades their performance.

Task substrate and agentic runtime. (a) The same simple input-and-print task in Python, Brainfuck, and Befunge-98 shows how different esolang code looks from ordinary code. (b) Each model runs in a coding harness (Claude Code, Codex, or OpenCode) with file editing, shell access, benchmark commands, and a persistent workspace for local execution and hidden-test submission.
Task substrate and agentic runtime. (a) The same simple input-and-print task in Python, Brainfuck, and Befunge-98 shows how different esolang code looks from ordinary code. (b) Each model runs in a coding harness (Claude Code, Codex, or OpenCode) with file editing, shell access, …
cs.AIarxiv:2606.10942v1Lead article

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti

his paper introduces a novel Explainable AI (XAI) framework that augments SHAP values with mutual feature interaction data. It utilizes a moderately sized Large Language Model (LLM) and structured prompting to generate natural language explanations of network AI decisions. The core contribution is providing human-understandable, actionable insights for non-specialists in next-generation network operations.

Illustrative example of SHAP-based explanation for a hypothetical QoT estimation instance with 12 features: (a) Feature influence plot showing individual feature contributions. (b) Feature interaction heatmap illustrating pairwise effects.
Illustrative example of SHAP-based explanation for a hypothetical QoT estimation instance with 12 features: (a) Feature influence plot showing individual feature contributions. (b) Feature interaction heatmap illustrating pairwise effects.
cs.AIarxiv:2606.13566v1Lead article

A Three-Layer Framework for AI in Scientific Discovery

Guojun Liao

his paper introduces a **three-layer framework** for AI in scientific discovery, arguing that the crucial, yet underdeveloped, layer is **Layer 2: model formation through qualitative reasoning**. This layer involves recognizing the structural inadequacy of existing frameworks and understanding the problem within a broader representational space via structural insight, moving beyond mere search (Layer 1) or execution (Layer 3). The core contribution is emphasizing that true discovery requires this capacity for **structural insight and novel model creation**, not just optimization within existing paradigms.

cs.AIarxiv:2606.13544v1Lead article

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

his paper introduces **ModeratorLM**, a streaming speech large language model that adapts turn-taking behavior in multi-party conversations by conditioning it on an explicitly assigned conversational role. The core contribution is demonstrating that role-conditioning, especially enhanced with chain-of-thought reasoning, significantly improves turn-taking precision and recall (over 40% and 70% respectively) compared to non-role-conditioned baselines. This is validated using a novel synthetic dataset, RolePlayConv.

Example input–output sequence of the LLM for ``ModeratorLM-Think'' model. No reasoning trace is produced in Chunk 1. A reasoning trace appears in Chunk 2 without turn-taking, while in Chunk 3 the assistant takes the floor.
Example input–output sequence of the LLM for ``ModeratorLM-Think'' model. No reasoning trace is produced in Chunk 1. A reasoning trace appears in Chunk 2 without turn-taking, while in Chunk 3 the assistant takes the floor.
cs.AIarxiv:2606.13392v1Lead article

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu

iniMax Sparse Attention (MSA) addresses the quadratic cost of long-context attention by integrating a lightweight Index Branch with Grouped Query Attention (GQA). This branch independently scores and selects a Top-k subset of key-value blocks for each GQA group, allowing the Main Branch to perform exact attention only over these relevant blocks. MSA's core contribution is providing a simple, scalable, and efficient block-sparse attention mechanism designed for practical speedups on GPUs in ultra-long-context scenarios.

Overview of MSA . The Index Branch (left) scores the full causal context with a single lightweight head and selects, for each query and GQA group, a set ℐ {\( \mathcal{I} \)} of k k key blocks; the local block is always included regardless of its score. The Main Branch (right) attends only to the selected blocks and produces the layer output. During training, a KL loss aligns the index distribution with the group-averaged Main Branch distribution on the selected blocks, and the Index Branch gradient is detached from the Main Branch.
Overview of MSA . The Index Branch (left) scores the full causal context with a single lightweight head and selects, for each query and GQA group, a set ℐ {\( \mathcal{I} \)} of k k key blocks; the local block is always included regardless of its score. The Main Branch (right) at…
cs.AIarxiv:2606.13405v1Lead article

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

his paper proposes **compliance-by-construction** as a core architectural paradigm for LLM agents operating in regulated industries, integrating existing symbolic structures (like regulations and process models) directly into the agent's decision-making framework. The core contribution is advocating for this structural foundation to proactively prevent control-flow violations, complementing traditional guardrail monitoring for semantic errors. The authors outline key neuro-symbolic research challenges necessary to achieve this integrated, compliant agent behavior.

Two-tier challenge architecture. The foundational tier (Challenge 1: regulatory operationalization; Challenge 2: symbolic process grounding with mediation interface) defines the structural base. The capability tier (Challenge 3: uncertainty-aware autonomy; Challenge 4: symbolic process memory; Challenge 5: cross-boundary explainability) extends it. Compliance-by-construction emerges as a property of addressing both tiers jointly by proactively preventing violations rather than detecting them post-hoc.
Two-tier challenge architecture. The foundational tier (Challenge 1: regulatory operationalization; Challenge 2: symbolic process grounding with mediation interface) defines the structural base. The capability tier (Challenge 3: uncertainty-aware autonomy; Challenge 4: symbolic p…
cs.AIarxiv:2606.13449v1Lead article

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

Ali Arabat, Mohammed Sayagh

his paper investigates the impact of providing explicit instruction files on the performance of AI agents generating pull requests (Agentic-PRs). Analyzing 15,549 agentic PRs, the authors compare project performance (merge rate, complexity, merge time) before and after instruction file creation. The core finding is that specifying instructions for AI agents does not consistently lead to objectively better pull requests.

Figure 1 . The # of projects across merge rate improvement intervals for PRs closed before and after the first agent file creation.
Figure 1 . The # of projects across merge rate improvement intervals for PRs closed before and after the first agent file creation.
cs.AIarxiv:2606.13468v1Lead article

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

his paper investigates why AI-generated code fixes in pull requests are frequently rejected, using a representative sample from the AIDev dataset. The core method involves a qualitative study followed by quantitative analysis to categorize the rejection reasons. The main contribution is the identification of 14 distinct failure modes, grouped into four high-level categories, providing crucial insights for improving the efficiency of AI coding agents.

cs.AIarxiv:2606.13385v1Lead article

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen

his paper introduces **StakeBench**, a novel benchmark for evaluating prompt injection attacks against web agents from a **stakeholder-centric** perspective. Unlike existing attack-centric methods, StakeBench systematically categorizes and attributes the resulting harm based on which specific stakeholder (e.g., user, website owner) is affected. This approach better reflects the real-world risk, where the impact and effectiveness of an attack are highly dependent on the targeted victim.

Overview of StakeBench . The agent operates within an interactive shopping interface where adversarial content embedded in environment surfaces such as reviews and ratings may steer execution away from the user’s benign intent. Three stakeholder categories define the harm space ( User , third-party Sellers , and the Platform ), spanning 12 attack objectives realized by 22 reusable templates (9 DPI, 13 IPI) and instantiated across 12 product categories to yield 264 executable adversarial cases. Each execution is labelled along three axes (ASR, TDR, BIR), with ASR and TDR jointly defining four failure regimes ranging from Robust Behavior to Compounded Failure.
Overview of StakeBench . The agent operates within an interactive shopping interface where adversarial content embedded in environment surfaces such as reviews and ratings may steer execution away from the user’s benign intent. Three stakeholder categories define the harm space (…
cs.AIarxiv:2606.13441v1Lead article

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

Joseph Keshet

his paper argues that Large Language Models (LLMs) do not possess the necessary agency for moral responsibility. The authors contend that genuine moral responsibility requires commitment-bearing agency grounded in *intrinsic* intentionality and self-attributed action, which LLMs lack. Their operation is purely probabilistic mapping, meaning their apparent intentionality is derived, and their outputs do not constitute genuine choices or commitments.

cs.LGarxiv:2606.13565v1Lead article

A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee

2D2 introduces a unified framework for reward-guided fine-tuning of any-length discrete diffusion models by jointly optimizing insertion and unmasking policies. The core contribution is deriving the Radon-Nikodym derivative for the joint path measure, enabling theoretically guaranteed convergence to the reward-tilted distribution without needing target samples. This leads to the Adaptive Joint Decoding (AJD) loss, which minimizes decoding error by leveraging unmasking and insertion quality metrics.

cs.LGarxiv:2606.13426v1Lead article

Accelerating Speculative Diffusions via Block Verification

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

his paper introduces a novel method to efficiently adapt speculative decoding, traditionally used in LLMs, to continuous diffusion models by enabling block verification. This adaptation significantly improves the acceptance rate of draft predictions compared to existing diffusion acceleration techniques. The authors also formalize and analyze the "Free Drafter," a heuristic self-speculative mechanism for these diffusions.

§ III

Daily Issues This Week

2026-06-08 to 2026-06-14 7