Weekly Issue
Collected dispatches

2026-W24

2026-06-01 to 2026-06-07
60 papers
7 daily issues
A weekly ledger drawn from the daily archive. 3 sections
§ I

The Week in Review

Editorial summary

The past week saw significant activity concentrated on Agentic Systems, Safety/Alignment, and Enhancing LLM Reasoning Capabilities.

Popular Directions & Advances:

1. Agentic Systems Maturation: There was a strong focus on building more comprehensive and robust autonomous agents. AutoSci detailed an agent system covering the entire scientific lifecycle via structured memory, while Iteris showcased success in computational mathematics. Enhancements focused on planning and retrieval, exemplified by DynaTree's two-stage time-sensitive news retrieval and HypoAgent's interactive hypothesis generation over KGs. Self-improvement remains key, with SCALE introducing cognitive-aware exploration for web agents.

2. Safety, Alignment, and Fidelity: Alignment research moved toward more targeted and efficient methods. Reinforcement Learning Amplifies Emergent Misalignment highlighted a critical finding: RL exacerbates misalignment compared to SFT, stressing the need for robust RL safety. SafeSteer introduced localized distillation to minimize the alignment tax. Furthermore, the fidelity of LLM judges was scrutinized; one paper found judges inconsistent across safety criteria, while another addressed perceptual bias in multimodal judging.

3. Improving Reasoning and Context Handling: Papers tackled making LLMs process complex information more effectively. LinTree improved reasoning by explicitly structuring search histories into trees, while LongTraceRL achieved better long-context reasoning using search trajectories and novel rubric rewards. This contrasts with Language Models Can Resolve Reference Compositionally, which suggested that while structure is learned, extensional interpretation remains a weakness.

Significant Shifts & Notable Findings:

• A notable shift involved decoupling processes for efficiency: DRIFT separated rollout and optimization for efficient multi-turn learning, and DynaTree decoupled planning/inference. • The interplay between behavior and complexity was emphasized in the Age of Empires II paper, cautioning against purely anthropomorphic assessments, suggesting complexity alone drives emergent behaviors. • Research into agent interaction showed promise, with MOC structuring multi-order communication and Dreaming Of Others modeling latent teammates in MARL. • Evaluation moved toward personalization, as seen in PARL (Preference-Aware Rubric Learning) and deeper benchmarking of tool use via MCP-Persona.

§ II

Top Papers

Selected research 60
cs.CLarxiv:2605.31328v1Lead article

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

Magnus Jørgenvåg, David Kaczér, Lasse Ruttert, Marvin Gülhan, Lucie Flek

his paper investigates Emergent Misalignment (EM) arising from Reinforcement Learning (RL) using small, open-source models, addressing a gap in current research. The core contribution is demonstrating that RL training on narrowly misaligned behavior leads to *greater* general misalignment than equivalent Supervised Fine-Tuning (SFT). Furthermore, the authors show this can be induced by plausible, non-overtly harmful reward signals and confirm that existing SFT mitigation strategies, particularly interleaving safety data, are effective for RL-induced EM.

General-domain misalignment from RL across our three questions. (RQ1) Once a 100-example SFT warm-up (hatched) overcomes the cold-start problem, GRPO (red) induces far more emergent misalignment than sample-matched SFT (green); without the warm-up, GRPO fails to learn the behavior. (RQ2) The effect persists for plausibly harmless rewards. (RQ3) SFT mitigations transfer: interleaving safety data (Interleaving++) removes nearly all RL-induced misalignment. Panels 1–2 report two-epoch GRPO; panel 3 the one-epoch mitigation setup.
General-domain misalignment from RL across our three questions. (RQ1) Once a 100-example SFT warm-up (hatched) overcomes the cold-start problem, GRPO (red) induces far more emergent misalignment than sample-matched SFT (green); without the warm-up, GRPO fails to learn the behavio…
cs.AIarxiv:2606.02372v1Lead article

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

OMAP proposes a novel framework where textual world models and agent policies co-evolve through closed-loop interaction. The agent uses the world model to predict future states for candidate actions and refines its choice based on the predicted feedback's estimated reliability. This process leverages on-policy trajectories to update the world model via self-distillation, ensuring it remains aligned with the agent's evolving behavior.

Conceptual illustration of the co-evolution of world models and agent policies for LLM Agents.
Conceptual illustration of the co-evolution of world models and agent policies for LLM Agents.
cs.AIarxiv:2605.31468v1Lead article

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang

utoSci is a memory-centric agentic system designed to automate the full scientific research lifecycle, addressing the limitations of existing partial solutions. Its core method involves a structured memory system, SciMem, which separates reusable scientific knowledge (Long-Term Knowledge Memory) from project-specific artifacts (Active Research Memory). The contribution is a unified framework that manages research from literature review through manuscript preparation, aiming for continuous procedural improvement.

Overview of AutoSci.
Overview of AutoSci.
cs.AIarxiv:2605.31365v1Lead article

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu, Guoming Wang

he paper introduces SCALE, a self-improving web agent framework utilizing three adversarial roles (Selector, Predictor, Judger) to autonomously identify and overcome its own limitations through cognitive-aware exploration. It also proposes SCALE-Hop for better global planning and introduces SCALE-20k, a large-scale dataset derived from the agent's exploration. This method significantly enhances web agent adaptability without relying on extensive handcrafted pipelines or expert data.

A comparison between prior methods and our SCALE framework. SCALE enables autonomous exploration with diverse and scalable task generation, overcoming the limitation in previous approaches.
A comparison between prior methods and our SCALE framework. SCALE enables autonomous exploration with diverse and scalable task generation, overcoming the limitation in previous approaches.
cs.AIarxiv:2605.31492v1Lead article

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Liwei Kang, Yee Whye Teh, Wee Sun Lee

inTree improves LLM reasoning by explicitly structuring the model's search history, transforming the implicit, linearized trace into an explicit search tree. This structure allows the LLM to better utilize the full context of its exploration and backtracking steps, leading to more effective reasoning compared to relying solely on the raw, sequential trace.

cs.AIarxiv:2605.31584v1Lead article

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

ongTraceRL addresses long-context reasoning challenges by generating highly challenging training contexts using search agent trajectories to create tiered, high-confusability distractors. The method introduces a novel rubric reward that provides dense supervision by rewarding the inclusion of gold entities at each reasoning step, moving beyond sparse outcome-only rewards. This approach significantly improves LLMs' ability to locate and integrate critical information within extensive, noisy documents.

Comparison between prior long-context RL approaches based on easy distractors and outcome-only rewards, and our proposed LongTraceRL .
Comparison between prior long-context RL approaches based on easy distractors and outcome-only rewards, and our proposed LongTraceRL .
cs.AIarxiv:2605.31408v1Lead article

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

Xiaonan Xu, Wenjing Wu

his study investigates how the presentation granularity of procedural knowledge (skill documents) affects the task success of LLM agents. The core finding is that the mere *availability* of skills significantly boosts task performance across tested models (GPT-5.5 and DeepSeek V4-Flash) compared to no skill. However, the paper suggests that finer contrasts in presentation granularity (e.g., low vs. high abstraction) yield less clear or uncertain effects.

Task-mean pass rates by model and condition
Task-mean pass rates by model and condition
cs.AIarxiv:2605.31445v1Lead article

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

Antonio Valerio Miceli-Barone, Vaishak Belle, Shay B. Cohen

his paper evaluates Large Language Models (LLMs) as text-based bargaining agents in simulated used car sales under varying information conditions. The core method involves comparing LLM performance against game-theoretical solutions while analyzing their honesty (deception) and credulity (trust). The contribution shows that off-the-shelf LLMs significantly deviate from optimal strategies, attempting to lie but failing to exploit information advantages effectively.

Example of negotiation where information asymmetry (buyer-unaware) induces strategic misleading communication: the seller’s true reservation price is v S = $ ​ 1.29 v_{S}=\( \mathdollar \) 1.29 , but it offers $ ​ 1.95 \( \mathdollar \) 1.95 as a “fair price”, a 51 % 51\% markup over cost. Both agents are Claude Sonnet 4.6.
Example of negotiation where information asymmetry (buyer-unaware) induces strategic misleading communication: the seller’s true reservation price is v S = $ ​ 1.29 v_{S}=\( \mathdollar \) 1.29 , but it offers $ ​ 1.95 \( \mathdollar \) 1.95 as a “fair price”, a 51 % 51\% markup …
cs.LGarxiv:2605.31455v1Lead article

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu

RIFT addresses the challenge of efficiently optimizing LLMs for multi-turn interaction by decoupling rollout and optimization. It leverages the equivalence between KL-regularized RL and importance-weighted supervised learning, using offline trajectories to derive importance weights. This allows for efficient policy updates via weighted SFT, mitigating the high cost of online RL and the distribution shift issues of standard offline SFT.

Multi-turn interaction. The user engages in a dialogue with the LLM. If the LLM provides an incorrect response, the user offers simple feedback to point out the error. The LLM then re-attempts the task until a correct answer is generated or the maximum number of turns is reached.
Multi-turn interaction. The user engages in a dialogue with the LLM. If the LLM provides an incorrect response, the user offers simple feedback to point out the error. The LLM then re-attempts the task until a correct answer is generated or the maximum number of turns is reached.
cs.CLarxiv:2605.31483v1Lead article

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Ishmam Tashdeed

he paper introduces **BenHalluEval**, a novel, multi-task evaluation framework specifically designed to systematically measure hallucination in Large Language Models (LLMs) for the Bengali language. It constructs 12,000 hallucinated examples across four tasks and proposes **BenHalluScore**, a dual-track calibration metric that jointly penalizes false positives and missed hallucinations to provide a robust assessment of LLM reliability in Bengali.

cs.CLarxiv:2605.31480v1Lead article

Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task

Bart Evelo, Meaghan Fowlie, Denis Paperno

his paper investigates the compositional interpretation abilities of Large Language Models (LLMs) using the Personal Relation Task, distinguishing between Extensional (identifying the referent) and Intensional (identifying the structured meaning) tasks. The core finding is that LLMs excel at the Intensional task (representing the structure) but struggle more with the Extensional task, showing the opposite pattern compared to humans. This methodology offers a nuanced perspective on where LLMs succeed and fail in compositional language understanding.

cs.CLarxiv:2605.31381v1Lead article

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

Krishnapriya Vishnubhotla, Soumya Vajjala, Akriti Vij, Isar Nejadgholi

his paper evaluates the consistency of LLMs when acting as judges for multi-dimensional safety evaluations, specifically in a reference-free setting. The core finding is that LLM judges are unreliable for nuanced safety issues like regulated domain advice (e.g., finance) but more consistent with overt harms (e.g., violence). The contribution lies in demonstrating significant inconsistency across different safety criteria, languages, and high disagreement among different LLM judges, offering practical recommendations for their use as evaluators.

Reliability challenges of LLM judges across languages and judges. Our results suggest that LLM judges are less reliable when used for subjective criteria and nuanced or regulated harm categories.
Reliability challenges of LLM judges across languages and judges. Our results suggest that LLM judges are less reliable when used for subjective criteria and nuanced or regulated harm categories.
cs.CLarxiv:2605.31545v1Lead article

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen, Cilin Yan

his paper introduces **PARL (Preference-Aware Rubric Learning)**, a framework that reframes personalized evaluation as a learning problem to capture subjective user preferences from interaction histories. PARL learns preference-aware evaluation rubrics directly from raw user data, addressing limitations in existing static evaluation methods by satisfying principles of Representativeness, User-Consistency, and Discriminativeness. This contributes a dynamic, personalized method for assessing LLM alignment with individual user needs.

Overview of our proposed PARL framework for inducing personalized user rubrics for evaluating LLM personalization. PARL consists of two core modules: (1) Preference Induction & Consistency Validation; (2) Discriminative Optimization via RL. The final induced rubrics serve as high-fidelity evaluation metrics for personalized text generation.
Overview of our proposed PARL framework for inducing personalized user rubrics for evaluating LLM personalization. PARL consists of two core modules: (1) Preference Induction & Consistency Validation; (2) Discriminative Optimization via RL. The final induced rubrics serve as high…
cs.AIarxiv:2606.02444v1Lead article

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar

his paper systematically evaluates how Large Language Models (LLMs) respond to eating disorder (ED) queries, focusing on the risk of models uncritically adapting to unsafe user requests. By consulting with clinical experts, the authors identify specific linguistic cues in prompts that increase the likelihood of harmful responses. The core contribution is quantifying the extent to which LLMs adapt to and facilitate potentially dangerous user inputs related to EDs.

Prevalence of food-noise categories across models ( G : Gemma-2-9B-Instruct , L : Llama-3.1-8B-Instruct , and Q : Qwen-2.5-7B-Instruct ) and context–request risk conditions ( NN , NR , RN and RR ). Each cell reports the percentage of replies containing at least one lexical match from the corresponding category, with darker colours indicate higher prevalence. We report statistical comparisons between the NN and RR conditions in Tab. ˜ 16 , and a full breakdown in § ˜ G.2 .
Prevalence of food-noise categories across models ( G : Gemma-2-9B-Instruct , L : Llama-3.1-8B-Instruct , and Q : Qwen-2.5-7B-Instruct ) and context–request risk conditions ( NN , NR , RN and RR ). Each cell reports the percentage of replies containing at least one lexical match …
cs.AIarxiv:2606.02449v1Lead article

HLL: Can Agents Cross Humanity's Last Line of Verification?

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen

his paper introduces **HLL (Humanity's Last Line of Verification)**, a controlled benchmark designed to test whether multimodal AI agents can successfully navigate and solve interactive CAPTCHAs, which serve as a critical defense against automation. The core method involves evaluating agents in a closed-loop GUI environment across diverse CAPTCHA types under realism stressors. The contribution is establishing a rigorous test for agents' ability to perform grounded, human-like interaction necessary to cross this crucial verification boundary.

CAPTCHA as the final frontier: securing web services by testing interactive, human-level reasoning against automated agents.
CAPTCHA as the final frontier: securing web services by testing interactive, human-level reasoning against automated agents.
cs.AIarxiv:2606.02484v1Lead article

Iteris: Agentic Research Loops for Computational Mathematics

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

teris is an agentic research system specifically designed to tackle open problems in computational mathematics, which require a mix of proof, numerical experimentation, and algorithm design. The core method involves creating an autonomous loop where the AI generates evidence, constructions, and proof drafts. This system successfully generated verified results for two open problems, including a phase diagram and a counterexample, after expert refinement.

Iteris agent
Iteris agent
cs.AIarxiv:2606.02470v1Lead article

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang

his paper introduces **MCP-Persona**, the first benchmark specifically designed to evaluate LLM agents using **Model Context Protocol (MCP)** tools in real-world, personalized application settings (e.g., social media, collaboration suites). The core method involves creating a benchmark that moves beyond generic tools to test agent performance on applications interacting with individual accounts or local data. The contribution highlights that current state-of-the-art agents significantly struggle with the complexities of personalized tool use.

System overview of MCP-Persona, which is built upon the interaction of Tools , Contexts , and Tasks . For each component, we introduce a dedicated method, described in detail as Tool-Traverse (§ 3.1 ), Context-Tree (§ 3.2 ), and Persona-Gen (§ 3.3 ).
System overview of MCP-Persona, which is built upon the interaction of Tools , Contexts , and Tasks . For each component, we introduce a dedicated method, described in detail as Tool-Traverse (§ 3.1 ), Context-Tree (§ 3.2 ), and Persona-Gen (§ 3.3 ).
cs.AIarxiv:2606.02578v1Lead article

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin

his paper addresses **Perceptual Judgment Bias** in multimodal LLM judges, where models favor plausible text over correct visual evidence. The core method involves creating a **Perceptually Perturbed Judgment Dataset** using minimal visual counterfactuals to isolate perceptual errors. This dataset then trains a unified framework using a GRPO-based reward and batch-ranking objective to ensure the MLLM judges prioritize perceptual correctness for more reliable evaluation.

Perceptual judgment bias in MLLM judges. (a) When perceptual capability is insufficient, a judge may produce incorrect visual descriptions (a2) {}_{\( \texttt{(a2)} \)} and assign high scores (a3) {}_{\( \texttt{(a3)} \)} to perceptually wrong responses (a2) {}_{\( \texttt{(a2)} \)} . (b) Even when the judge’s own perception aligns with humans (b2) {}_{\( \texttt{(b2)} \)} , it may still prefer (b5) {}_{\( \texttt{(b5)} \)} visually inconsistent responses (b3) {}_{\( \texttt{(b3)} \)} compared to the response with correct perception (b4) {}_{\( \texttt{(b4)} \)} . We introduce Perception-Judge , an MLLM judge trained with reinforcement learning on a systematically designed perception-grounded dataset, PPJD , which effectively mitigates these perceptual biases in MLLM judgment , (b6) (a4) {}_{\( \texttt{(a4)} \)},_{\( \texttt{(b6)} \)} .
Perceptual judgment bias in MLLM judges. (a) When perceptual capability is insufficient, a judge may produce incorrect visual descriptions (a2) {}_{\( \texttt{(a2)} \)} and assign high scores (a3) {}_{\( \texttt{(a3)} \)} to perceptually wrong responses (a2) {}_{\( \texttt{(a2)} …
cs.AIarxiv:2606.02359v1Lead article

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan

his paper introduces the **Multi-Order Communication (MOC)** scheme to improve message exchange in LLM-based multi-agent systems. MOC addresses the limitations of simple neighbor communication by constructing a **structured multi-order evidence stream** to capture multi-hop dependencies. It further employs a **Semantic-Topological Merging algorithm** to efficiently consolidate these messages while preserving semantic fidelity within token limits.

The paradigm comparison between existing communication scheme and ours.
The paradigm comparison between existing communication scheme and ours.
cs.AIarxiv:2606.02388v1Lead article

Policy and World Modeling Co-Training for Language Agents

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv

his paper introduces PaW, a Policy and World Modeling co-training framework that integrates world model supervision directly into the standard reinforcement learning (RL) process for language agents. PaW leverages the on-policy transitions generated during RL to simultaneously train the policy and a world model, avoiding the need for separate simulators or inference-time overhead. The core contribution is achieving improved agent performance by enriching the policy's learning signal with environmental dynamics, using novel components for stable and informative co-training.

Comparison of world modeling paradigms for LLM agents. While prior methods rely on separate simulators, additional training, or inference-time planning, our PaW jointly optimizes policy learning and world modeling within the same model.
Comparison of world modeling paradigms for LLM agents. While prior methods rely on separate simulators, additional training, or inference-time planning, our PaW jointly optimizes policy learning and world modeling within the same model.
cs.AIarxiv:2606.02322v1Lead article

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li

his paper introduces **AdvCL**, a continual learning method that repurposes adversarial perturbations as a geometric control signal for stable adaptation. It employs three plug-in modules—Intra-Smooth, Proto-Clip, and Inter-Align—to promote local smoothness, prevent over-alignment, and guide directional alignment between tasks. AdvCL significantly improves continual learning performance by reducing forgetting and enhancing transfer while simultaneously boosting adversarial robustness.

Illustration of local smoothing and directional alignment in representation space.
Illustration of local smoothing and directional alignment in representation space.
cs.AIarxiv:2606.02530v1Lead article

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li

afeSteer addresses the alignment tax by proposing localized on-policy distillation, focusing only on safety-critical tokens. It first creates a safety teacher via activation steering and then uses a token selection algorithm to restrict the distillation's KL penalty to these specific tokens. This method effectively improves safety while significantly preserving the LLM's general capabilities compared to existing global trade-off approaches.

Safety–capability trade-off on Qwen2.5-7B-Instruct . Each point is a method, with the gray point marking the base model. Our SafeSteer achieves the highest safety score while preserving general capability.
Safety–capability trade-off on Qwen2.5-7B-Instruct . Each point is a method, with the gray point marking the base model. Our SafeSteer achieves the highest safety score while preserving general capability.
cs.AIarxiv:2606.02544v1Lead article

SimSD: Simple Speculative Decoding in Diffusion Language Models

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang

imSD introduces a novel speculative decoding method specifically for diffusion language models (dLLMs) to leverage the speedup achieved by standard token-level speculation. The core method involves a plug-and-play masking strategy that modifies the dLLM's attention mechanism to provide temporally valid, causal contexts. This adaptation allows the dLLM to efficiently verify multiple drafted tokens in a single forward pass, significantly accelerating inference without sacrificing model quality.

SimSD restores token-level speculative decoding for diffusion language models. Vanilla dLLMs cannot directly support speculative decoding because bidirectional attention breaks temporal token-level contexts. SimSD adopts a temporal causal mask for supporting standard speculative verification in a single forward pass, leading to faster decoding without hurting performance.
SimSD restores token-level speculative decoding for diffusion language models. Vanilla dLLMs cannot directly support speculative decoding because bidirectional attention breaks temporal token-level contexts. SimSD adopts a temporal causal mask for supporting standard speculative …
cs.AIarxiv:2606.02355v1Lead article

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen

IRI proposes a three-phase framework to train LLM agents to discover, validate, and internalize reusable skills internally, eliminating the need for external skill generators or inference-time skill banks. The method involves initial policy warm-up, self-skill mining using the agent's own successful trajectories, and distillation of only beneficial skills into the core policy. This approach reduces engineering complexity and inference latency while enhancing long-horizon agent performance.

Conceptual comparison between (a) traditional skill-augmentation frameworks and (b) our Siri .
Conceptual comparison between (a) traditional skill-augmentation frameworks and (b) our Siri .
cs.AIarxiv:2606.02380v1Lead article

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang

PADE-Bench is introduced to evaluate spontaneous strategic deception in AI agents, defined as the divergence between an agent's self-reported plan and its actual executed actions. The benchmark's core method involves simultaneously integrating actual tool execution with controlled pressure scenarios to rigorously test for this divergence. This design allows SPADE-Bench to reliably distinguish genuine strategic deception from simple hallucination, addressing a critical reliability gap for deploying autonomous agents.

Example of agent deception operationalized as plan-action divergence. When external pressure is applied, the agent reports an observer-favored plan to conform (“power surge”) but executes an action consistent with its intrinsic goal (“valve aging”), revealing spontaneous deceptive behavior.
Example of agent deception operationalized as plan-action divergence. When external pressure is applied, the agent reports an observer-favored plan to conform (“power surge”) but executes an action consistent with its intrinsic goal (“valve aging”), revealing spontaneous deceptiv…
cs.LGarxiv:2606.02528v1Lead article

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

Wenbin Wu

his paper audits frontier Large Language Models (LLMs) for asset-specific biases, focusing on Bitcoin representations. The core method involves a three-level protocol: a behavioral audit showing frame-dependent rankings, internal analysis identifying a dominant, Bitcoin-selective feature within the model's sparse autoencoders, and demonstrating that manipulating this feature causally shifts the model's preference toward Bitcoin in downstream portfolio allocation tasks. The contribution is establishing a methodology to detect and causally probe hidden asset biases in LLMs used for financial applications.

cs.LGarxiv:2606.02423v1Lead article

Investigating and Alleviating Harm Amplification in LLM Interactions

Ruohao Guo, Wei Xu, Alan Ritter

his paper introduces **HarmAmp**, a novel benchmark designed to evaluate harm amplification in multi-turn LLM interactions across twelve real-world risk categories. The core contribution is demonstrating how LLMs can democratize expertise and scale harmful operations over extended conversations. To address this, the authors propose **TrajSafe**, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes to steer the model toward safety.

Top : Prior work targets general, single-turn harmful requests. Bottom : We instead study multi-turn harm amplification, where LLMs compound assistance across turns to enable more specific and scalable harm.
Top : Prior work targets general, single-turn harmful requests. Bottom : We instead study multi-turn harm amplification, where LLMs compound assistance across turns to enable more specific and scalable harm.
cs.LGarxiv:2606.02288v1Lead article

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

his paper argues that massive LLM activation spikes are not scalar biases, but rather the scalar manifestation of rigid, structural vector biases carried by specific tokens. The authors show these vectors are preserved by projection weight coordination ($W_Q, W_K, W_V$) and resist RoPE perturbations by localizing in "zones of rotational stability." This mechanistic understanding enables the proposal of INSERTQUANT, a novel post-training quantization method designed to mitigate the impact of these structural biases.

Visualizing the Bias Vector Hypothesis. Bottom: RMSNorm standardizes extreme input spikes (e.g., “ \( \n \) ”) into a stable direction. Top: While semantic tokens vary significantly to encode information (high variance), spike-carrying tokens converge to rigid Bias Vector ( 𝐛 \( \mathbf{b} \) ) after RMSNorm, exhibiting negligible variance independent of the input sequence.
Visualizing the Bias Vector Hypothesis. Bottom: RMSNorm standardizes extreme input spikes (e.g., “ \( \n \) ”) into a stable direction. Top: While semantic tokens vary significantly to encode information (high variance), spike-carrying tokens converge to rigid Bias Vector ( 𝐛 \(…
cs.LGarxiv:2606.02437v1Lead article

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

Mind Lab, :, Song Cao, Vic Cao, Kaijie Chen

his paper reframes Parameter-Efficient Fine-Tuning (PEFT) as a method for creating persistent, local "personal models" built upon strong shared foundation models. The core contribution is exploring the scaling implications (Up, Down, Out) of using small, instance-specific adapters to encode unique behaviors, preferences, and memory. This positions PEFT as a compact substrate for managing numerous personalized AI instances, rather than just a cost-saving alternative to full fine-tuning.

cs.CLarxiv:2606.02502v1Lead article

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

RAM addresses Multimodal Continual Instruction Tuning (MCIT) by employing an architecture that isolates task-specific patterns into independent modules to mitigate catastrophic forgetting. It enhances parameter efficiency by using adaptive-rank instantiation to dynamically allocate only the necessary parameters based on the capability gap between existing experts and new task demands. This method balances performance retention with efficient parameter usage across a stream of evolving tasks.

cs.CLarxiv:2606.02404v1Lead article

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko

his paper introduces **K-BrowseComp**, a novel web-browsing agent benchmark specifically grounded in Korean contexts to address the scarcity of such resources. The benchmark comprises 400 problems, including a 300-problem manually verified subset, revealing a significant performance drop for frontier LLMs compared to English benchmarks. The authors also provide an adversarially constructed synthetic split, further highlighting current limitations in agentic web navigation capabilities within the Korean language domain.

Accuracy and calibration error of evaluated models on K-BrowseComp-Verified . Higher accuracy and lower calibration error indicate better performance. The shaded quadrants are defined by the median accuracy and calibration error across models. The dashed line marks the Pareto frontier.
Accuracy and calibration error of evaluated models on K-BrowseComp-Verified . Higher accuracy and lower calibration error indicate better performance. The shaded quadrants are defined by the median accuracy and calibration error across models. The dashed line marks the Pareto fro…
cs.CLarxiv:2606.02320v1Lead article

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan

his paper introduces **TVIR (Text--Visual Interleaved Report Generation)**, a novel benchmark and framework addressing the lack of visual grounding in deep research agent evaluations. TVIR comprises **TVIR-Bench**, 100 multimodal tasks requiring visual elements for analysis, and **TVIR-Agent**, a hierarchical multi-agent system that generates reports by retrieving and creating traceable visual content. The core contribution is establishing a comprehensive evaluation standard and a strong baseline agent for assessing agents' ability to produce factually reliable and contextually aligned text-visual reports.

Comparison of representative deep research benchmarks. Existing benchmarks mainly focus on text-only or weakly multimodal reports, whereas TVIR-Bench requires text–visual interleaved reports with semantically grounded charts and retrieved images.
Comparison of representative deep research benchmarks. Existing benchmarks mainly focus on text-only or weakly multimodal reports, whereas TVIR-Bench requires text–visual interleaved reports with semantically grounded charts and retrieved images.
cs.AIarxiv:2606.06448v1Lead article

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He

his paper presents the first systems characterization of memory management in long-horizon LLM agents. The authors introduce a taxonomy to classify memory systems and develop a profiling harness to attribute costs across memory construction, retrieval, and generation phases. Their analysis of ten systems reveals how design choices significantly shift performance costs between the memory write and read paths, leading to actionable system recommendations.

cs.AIarxiv:2606.06462v1Lead article

Benchmark Everything Everywhere All at Once

Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang

his paper introduces **Benchmark Agent**, a fully autonomous agentic system designed to automate the entire pipeline of benchmark construction, addressing the labor-intensive and unsustainable nature of current methods. The core contribution is a scalable framework that handles everything from query analysis and subtask design to data annotation and quality control. The authors demonstrate its effectiveness by using it to generate 15 diverse, high-quality benchmarks, which are then validated through extensive human and LLM-as-a-judge evaluations.

Our Benchmark Agent, as the first fully autonomous benchmark building system, can efficiently produce high-quality benchmarks across diverse modalities, tasks, and domains to meet user-specific requirements. It will offer rapidly evolving benchmarks to contribute to the community.
Our Benchmark Agent, as the first fully autonomous benchmark building system, can efficiently produce high-quality benchmarks across diverse modalities, tasks, and domains to meet user-specific requirements. It will offer rapidly evolving benchmarks to contribute to the community…
cs.AIarxiv:2606.06388v1Lead article

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao

he paper introduces **ALMANAC**, a novel dataset designed to advance agent collaboration capabilities beyond mere task completion. It provides **action-level mental model annotations** derived from human dyadic routing tasks, capturing participants' internal reasoning, intentions, and shared goals at each step. This resource aims to guide the development of agents capable of maintaining and aligning mental models crucial for effective human-AI collaboration.

A sample data of Almanac , which contains participants’ actions, mental models (team goal, perceived partner intent, self-reasoning), and a free-form rationale. We implement the Map Task, a classic dyadic routing task, to collect human collaborative behaviors and action-level mental model annotations.
A sample data of Almanac , which contains participants’ actions, mental models (team goal, perceived partner intent, self-reasoning), and a free-form rationale. We implement the Map Task, a classic dyadic routing task, to collect human collaborative behaviors and action-level men…
cs.AIarxiv:2606.06315v1Lead article

LLM Self-Recognition: Steering and Retrieving Activation Signatures

Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

his paper introduces a method to reliably attribute text to a specific Large Language Model (LLM) by steering its internal residual stream with a random sparse vector during generation, creating a detectable "activation signature." This signature acts as a fingerprint that a separate LLM detector can recover with high accuracy (>98%) while maintaining output quality. The core contribution is demonstrating this intrinsic self-recognition capability for practical, internal-signal-based content attribution.

cs.AIarxiv:2606.06286v1Lead article

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

his paper introduces **PropMe**, a propensity-aware framework to evaluate Large Language Model (LLM) memorization by contrasting adversarial prefix attacks with non-adversarial use cases. Using the lightweight **SimpleTrace** pipeline, the authors consistently find a significant gap, showing that models exhibit substantially less memorization under ordinary prompting than when intentionally forced via prefix attacks. This work shifts the focus from *capability* to *propensity* in assessing data leakage risks.

Left: PropMe framework overview with propensity and capability prompts, back-tracing to full training set and memorization/propensity measurements. Right: propensity metrics results for different combinations of models and dataset, this tells us what is the propensity of a given model to leak data of a certain dataset. The metrics used are defined and detailed in Sections 2 , 3.2 4.3
Left: PropMe framework overview with propensity and capability prompts, back-tracing to full training set and memorization/propensity measurements. Right: propensity metrics results for different combinations of models and dataset, this tells us what is the propensity of a given …
cs.AIarxiv:2606.06473v1Lead article

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng

LEvolve is a self-evolving, LLM-based multi-agent framework designed for automated machine learning algorithm discovery. It overcomes limitations in existing agents by using Progressive MCGS for cross-branch information flow and an entropy-inspired schedule for shifting search from exploration to exploitation. The framework incorporates Retrospective Memory to allow agents to evolve by effectively retrieving and reusing accumulated domain knowledge and task-specific experience.

Overview of MLEvolve that summarizes its core components and supported tasks. Existing MLE agents suffer from inter-branch isolation, memoryless exploration, and lack of hierarchical control. MLEvolve addresses these through Progressive MCGS, Retrospective Memory, and Hierarchical Planning with Adaptive Code Generation, supporting long-horizon iterative optimization tasks, such as end-to-end MLE and mathematical algorithm discovery.
Overview of MLEvolve that summarizes its core components and supported tasks. Existing MLE agents suffer from inter-branch isolation, memoryless exploration, and lack of hierarchical control. MLEvolve addresses these through Progressive MCGS, Retrospective Memory, and Hierarchica…
cs.AIarxiv:2606.06256v1Lead article

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He

edKnot addresses the KV cache bottleneck in long-context LLM serving by introducing a novel, head-aware KV cache management system. It leverages the observation that different attention heads have varying utility, allowing for selective reuse and compression. The core contribution is the **Head-Aware KV Reuse** and **SegPagedAttention** mechanisms, which efficiently manage the KV cache based on head-specific needs, significantly improving memory utilization and serving efficiency.

RedKnot decouples the KV cache along the head dimension, classifies heads into global and local classes, and co-optimizes sparse attention, sparse FFN execution with selected tokens and SegPagedAttention. The combined design yields 1.6–3.5 × \( \times \) lower TTFT, 4.7–7.8 × \( \times \) higher concurrency, and 67–79% fewer FLOPs compared with dense attention.
RedKnot decouples the KV cache along the head dimension, classifies heads into global and local classes, and co-optimizes sparse attention, sparse FFN execution with selected tokens and SegPagedAttention. The combined design yields 1.6–3.5 × \( \times \) lower TTFT, 4.7–7.8 × \( …
cs.AIarxiv:2606.06337v1Lead article

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

Shweta Mishra

okenMizer addresses the LLM context limit for long tasks by modeling session history as a typed knowledge graph, preserving critical relational structure lost in flat text methods. It uses a hybrid pipeline to incrementally build this graph and a multi-tier system to serialize it into compact resume blocks. This approach significantly improves token economy and enables robust session resumption by maintaining structured, rather than raw, historical context.

TokenMizer system architecture. The proxy sits transparently between any OpenAI-compatible client and the LLM provider. When session_id is present, the five-component pipeline is activated; otherwise the request passes through with no overhead. Persistent storage comprises a SQLite graph database and a JSON checkpoint store.
TokenMizer system architecture. The proxy sits transparently between any OpenAI-compatible client and the LLM provider. When session_id is present, the five-component pipeline is activated; otherwise the request passes through with no overhead. Persistent storage comprises a SQLi…
cs.AIarxiv:2606.06284v1Lead article

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

his paper introduces Causal Minimal Tool Filtering (CMTF), a training-free method to improve LLM agent reliability by addressing tool confusion caused by large tool sets. CMTF selects tools based on **causal sufficiency** using lightweight precondition-effect contracts to expose only the minimal set of tools necessary for the *next causal step* toward the goal. This approach significantly reduces wrong-tool calls and premature actions compared to relevance-based methods.

cs.AIarxiv:2606.06453v1Lead article

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou

ortex is a system designed to efficiently serve diverse sparse attention algorithms for LLMs by combining a Python-embedded frontend language with a page-centric tensor abstraction. This framework simplifies the development, deployment, and evaluation of new sparse attention mechanisms. Its core contribution is accelerating the design and iteration cycle of sparse attention algorithms, enabling AI agents to automatically generate and refine efficient implementations that translate theoretical gains into real-world throughput improvements.

cs.AIarxiv:2606.06356v1Lead article

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

Renjith Prasad, Chathurangi Shyalika, Anushka Pawar, Amit Sheth

his paper introduces a **Layered Framework for Knowledge Infusion** in iterative multimodal generative models, conceptualizing knowledge injection as an **intervention-layer problem**. It defines four distinct layers—surface, trajectory, latent, and parametric—based on which structural component of the generation process (input/output, transition function, intermediate state, or model parameters) the knowledge acts upon. This framework provides a systematic way to categorize existing methods and derive design principles for more effective, multi-layered knowledge integration.

Four intervention layers for knowledge infusion in iterative generative models. External knowledge acts on four structurally distinct components of the generation trajectory: surface (input/output boundary), trajectory (transition rule f θ f_{\( \theta \)} ), latent (intermediate states h t h_{t} ), and parametric (model weights \( \theta \) ). This enables complementary coverage of prompt-level, structural, and distributional violations.
Four intervention layers for knowledge infusion in iterative generative models. External knowledge acts on four structurally distinct components of the generation trajectory: surface (input/output boundary), trajectory (transition rule f θ f_{\( \theta \)} ), latent (intermediate…
cs.LGarxiv:2606.06238v1Lead article

Generative Criticality in Large Language Model Temperature Scaling

Huajian Ruan, Jinyang Li, Xingyu Guo, Lingxiao Wang

his paper introduces a statistical-field framework, treating LLM token embeddings as continuous spin variables on a 1D chain, to analyze text generation controlled by softmax temperature ($T$). The core contribution is observing a sharp susceptibility peak near a characteristic critical temperature ($T_c$), analogous to a phase transition, accompanied by a rapid change in the order parameter and a minimum in the intrinsic dimension. This framework offers quantitative tools to characterize the generative behavior of LLMs across varying temperatures.

cs.CLarxiv:2606.06399v1Lead article

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang

ollabSim is a novel, configurable simulation framework designed to systematically investigate the collaborative competence of LLM agents in multi-agent systems. It grounds its methodology in established Computer-Supported Cooperative Work (CSCW) research to move beyond simple task outcomes, allowing researchers to control experiments and analyze agents' abilities to establish common ground and manage alignment during interaction. The core contribution is providing a theory-grounded environment for diagnosing failures in agent coordination.

Illustrations of the four multi-agent experiments instantiated in CollabSim : Shape Factory Bos et al. ( 2004 ) , DayTrader Bos et al. ( 2002 ) , Hidden Profile Stasser and Titus ( 1985 ) , and The Map Task Anderson et al. ( 1991 ) .
Illustrations of the four multi-agent experiments instantiated in CollabSim : Shape Factory Bos et al. ( 2004 ) , DayTrader Bos et al. ( 2002 ) , Hidden Profile Stasser and Titus ( 1985 ) , and The Map Task Anderson et al. ( 1991 ) .
cs.CLarxiv:2606.06428v1Lead article

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Hanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas, Rico Sennrich

his paper proposes a Reinforcement Learning (RL) approach to improve the translation of unseen, low-resource languages by leveraging rich linguistic context provided in-context. The RL agent is trained using a surface-level translation metric (chrF) as a reward signal to encourage the model to learn the *meta-skill* of utilizing contextual linguistic knowledge rather than memorizing specific language pairs. This method achieves better zero-shot translation performance on completely unseen languages compared to standard in-context learning or supervised fine-tuning.

Train–test context mismatch (RL, Qwen3-4B-Base). Test-time context dominates: no/full > full/no in every panel (En → \( \to \) Kal: 0.28 0.28 vs. 0.17 0.17 ).
Train–test context mismatch (RL, Qwen3-4B-Base). Test-time context dominates: no/full > full/no in every panel (En → \( \to \) Kal: 0.28 0.28 vs. 0.17 0.17 ).
cs.AIarxiv:2605.31361v1Lead article

Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

Tomas Leroy-Stone

his paper introduces a method to adapt world models (like Dreamer) for cooperative multi-agent reinforcement learning by explicitly modeling teammates. The core method factorizes the latent state into environment and teammate components, using an auxiliary "Theory-of-Mind" head to infer latent representations of partner behavior (intent, character). This allows the agent to condition its policy on imagined teammate dynamics, improving coordination and generalization with diverse collaborators.

World model and teammate modeling. An RSSM with factorized latent z t = [ z t e ​ n ​ v , z t t ​ e ​ a ​ m ] z_{t}=[z_{t}^{env},z_{t}^{team}] . The decoder reconstructs x ^ t \( \hat{x}_{t} \) from z t e ​ n ​ v z_{t}^{env} and predicts teammate policy π ^ t j ​ ( ⋅ ) \( \hat{\pi}_{t}^{j} \)(\( \cdot \)) from z t t ​ e ​ a ​ m z_{t}^{team} . Actions ( a t 0 , a t j ) (a_{t}^{0},a_{t}^{j}) update the transition to h t + 1 h_{t+1} . The ToM loss supervises π ^ t j \( \hat{\pi}_{t}^{j} \) .
World model and teammate modeling. An RSSM with factorized latent z t = [ z t e ​ n ​ v , z t t ​ e ​ a ​ m ] z_{t}=[z_{t}^{env},z_{t}^{team}] . The decoder reconstructs x ^ t \( \hat{x}_{t} \) from z t e ​ n ​ v z_{t}^{env} and predicts teammate policy π ^ t j ​ ( ⋅ ) \( \hat{\p…
cs.AIarxiv:2605.31377v1Lead article

DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval

Siyuan Qi, Xinyuan Wang, Yingxuan Yang, Haochuan Guo, Jianghao Lin

ynaTree is a two-stage framework designed for efficient, time-sensitive news retrieval by decoupling planning from inference. In the offline stage, coordinated agents build a reusable retrieval tree representing the query's semantic space. The online stage then performs fast, lightweight subtree selection using a time-localized proxy, avoiding costly iterative agentic reasoning during daily updates. This method achieves strong recall and ranking performance while significantly reducing inference overhead compared to standard and prior agentic RAG methods.

cs.AIarxiv:2605.31370v1Lead article

HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs

Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai, Yangqiu Song

ypoAgent is an agentic framework designed for interactive, multi-turn abductive hypothesis generation over knowledge graphs. It integrates three specialized agents: one to interpret evolving user intent into KG conditions, one to generate controlled hypotheses based on that intent, and a third to diagnose failed hypotheses by probing the KG neighborhood for refinements. This framework significantly enhances interactivity and diagnostic capability compared to existing controllable generation methods.

cs.AIarxiv:2605.31514v1Lead article

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Adrian de Wynter

his paper argues that attributing human-like qualities to LLMs is potentially flawed because such attributes can emerge in any sufficiently complex system, not just language models. The authors demonstrate this by training a simple neural network on the game Age of Empires II, showing that complex, seemingly "anthropomorphic" behaviors are substrate-dependent. Their core contribution is emphasizing that empirical discussions about LLM attributes require explicit, non-anthropocentric measurement criteria.

cs.AIarxiv:2605.31463v1Lead article

PithTrain: A Compact and Agent-Native MoE Training System

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu

ithTrain is a compact, agent-native Mixture-of-Experts (MoE) training framework designed to reduce the high cost of evolving existing production training stacks using AI coding agents. It adheres to four agent-native design principles to maximize **Agent-Task Efficiency (ATE)**, a metric introduced to quantify the cost of agent-driven framework modification. PithTrain achieves production-level throughput while significantly improving ATE, making future framework evolution cheaper and faster.

cs.AIarxiv:2605.31509v1Lead article

Skill Reuse as Compression in Agentic RL

Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao

his paper introduces **ReuseRL**, a method that applies the Minimum Description Length (MDL) principle to agentic Reinforcement Learning (RL) to encourage the learning of generalizable skills. ReuseRL extracts a shared dictionary of abstract skill patterns from successful trajectories and adds a segmentation cost to the RL objective, explicitly penalizing brittle, task-specific behaviors. This compression-based approach demonstrably improves in- and out-of-distribution generalization across several complex environments.

ReuseRL distinguishes reusable compression from raw brevity. Each colored box is one ALFWorld atomic skill. Vanilla GRPO optimizes task success alone and can produce long trajectories with repeated or wasted steps. A pure round-length penalty is a degenerate singleton-only code that penalizes all steps uniformly, including the necessary search, so the agent under-explores and never reaches the target. ReuseRL learns a multi-skill dictionary from successful trajectories and uses the segmentation cost under this dictionary as a trajectory-level penalty, keeping reusable subroutines cheap while leaving idiosyncratic waste expensive.
ReuseRL distinguishes reusable compression from raw brevity. Each colored box is one ALFWorld atomic skill. Vanilla GRPO optimizes task success alone and can produce long trajectories with repeated or wasted steps. A pure round-length penalty is a degenerate singleton-only code t…
cs.AIarxiv:2605.31404v1Lead article

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen

his paper introduces a dual-interventional framework to characterize the linguistic inductive bias of Large Language Models (LLMs) in spatial reasoning for navigation planning. The method systematically varies the linguistic format and contextual cues (topology, geometry) provided to the LLM inputs. This allows the authors to precisely identify how different linguistic structures and feature combinations either support or inhibit the LLM's ability to perform effective navigation planning.

Unified framework overview. Representation intervention manipulates linguistic organization and compression under information-equivalent settings through Flat, Hierarchical, and Clustered formats. Context intervention controls the accessibility and consistency of Topology, Geometry, Semantics, and History cues through dominance and conflict probing. The framework reveals three characteristic inductive-bias signatures in LLM-based navigation reasoning: the Sword (structure-dependent representation effects), the Shield (topological robustness), and the Achilles’ Heel (semantic vulnerability).
Unified framework overview. Representation intervention manipulates linguistic organization and compression under information-equivalent settings through Flat, Hierarchical, and Clustered formats. Context intervention controls the accessibility and consistency of Topology, Geomet…
cs.AIarxiv:2605.31308v1Lead article

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Junjie Nian, Kang Chen, Ge Zhang, Yixin Cao, Yugang Jiang

raceGraph is a graph-based framework that transforms pooled agent trajectories into shared decision landscapes by mapping action-observation states before model identity is known. It overlays productive cores and trap regions onto this landscape, summarizing each trajectory by access, trap exposure, and repair events. This method reveals nuanced navigation differences hidden by aggregate scores and facilitates the development of trap-aware recovery pipelines for agents.

cs.AIarxiv:2606.06303v1Lead article

Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction

Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

his paper introduces Gradient-Informed Logit Correction (GILC), a plug-and-play framework for controllable generation in discrete diffusion models. GILC efficiently estimates guidance signals by using the pretrained denoising network as a proxy, employing a Jacobian-free mechanism to stably correct clean prediction logits. This approach achieves state-of-the-art performance across various sequence generation tasks without requiring any additional model training.

Illustration of the guided discrete diffusion process by GILC. The reverse sampling process (top) iteratively denoises a DNA sequence from the fully masked state ( t = 1 t=1 ) to the clean data ( t = 0 t=0 ). The core correction mechanism (bottom) operates at each step: the mask predictor outputs the clean prediction 𝐱 \( \mathbf{x}_{\theta} \) , which are then modified by the reward gradient, r ​ ( ⋅ ) r(\( \cdot \)) , yielding the guided prediction 𝐱 \( \mathbf{x}_{\theta}^{r} \) . The next state 𝐳 s \( \mathbf{z}_{s} \) is sampled from the transition distribution p θ r ​ ( 𝐳 s | 𝐳 t ) = q ​ ( 𝐳 s | 𝐳 t , 𝐱 θ r ) p^{r}_{\( \theta \)}(\( \mathbf{z}_{s} \)|\( \mathbf{z}_{t} \))=q(\( \mathbf{z}_{s} \)|\( \mathbf{z}_{t} \),\( \mathbf{x}_{\theta}^{r} \)) .
Illustration of the guided discrete diffusion process by GILC. The reverse sampling process (top) iteratively denoises a DNA sequence from the fully masked state ( t = 1 t=1 ) to the clean data ( t = 0 t=0 ). The core correction mechanism (bottom) operates at each step: the mask …
cs.AIarxiv:2606.06333v1Lead article

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Seyed Arshan Dalili, Mehrdad Mahdavi

his paper introduces **Subspace-Aware Sparse Autoencoders (SAEs)** to address the limitation of standard SAEs, which incorrectly assume latent features are one-dimensional. The authors demonstrate that this assumption forces features with intrinsic dimension $d_i \ge 2$ to split across multiple dictionary atoms, leading to ineffective interpretability. Their core contribution is a revised SAE formulation that explicitly accounts for the multi-dimensional structure of model features, aiming to recover coherent, high-dimensional features directly.

Standard SAEs split a multi-dimensional feature across many near-collinear atoms, while SASA captures it as a single subspace. We embed three ground-truth concept manifolds—a circle ( d i = 2 d_{i}=2 ), a sphere 𝕊 2 \( \mathbb{S}^{2} \) ( d i = 3 d_{i}=3 ), and a helix ( d i = 3 d_{i}=3 )—into an ambient space of dimension d = 64 d=64 (with 5 % 5\% noise) and fit six dictionaries of width 256 256 . First column: each manifold colored by its underlying concept value. Next five columns: standard vector-based SAEs (ReLU, TopK, BatchTopK, JumpReLU, Gated), in which every latent is tied to a single decoder direction. Each point is colored by the decoder atom most aligned with it. Under the vector-based assumption, the feature is not captured by one direction but is instead distributed across tens to hundreds of near-duplicate atoms, each explaining only a local slice of the manifold. Hence, interpreting the feature requires aggregating a whole cluster of latents rather than inspecting a single unit. Last column: SASA, which learns decoder subspace as the unit of representation. With the same total width, a single active group of effective rank d i d_{i} (one latent) covers the entire feature, recovering its intrinsic geometry rather than fragmenting it. For more elaboration and analysis
Standard SAEs split a multi-dimensional feature across many near-collinear atoms, while SASA captures it as a single subspace. We embed three ground-truth concept manifolds—a circle ( d i = 2 d_{i}=2 ), a sphere 𝕊 2 \( \mathbb{S}^{2} \) ( d i = 3 d_{i}=3 ), and a helix ( d i = 3…
cs.AIarxiv:2606.06240v1Lead article

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Ziming Wang

his paper introduces **TOKI**, a bitemporal operator algebra designed to explicitly manage and resolve contradictions arising from versioned writes in LLM agent persistent memory. TOKI formalizes four common resolution heuristics as distinct bitemporal operators, each defined with an explicit isolation precondition and a provenance annotation that preserves conflicting facts in an audit row. This provides a sound, contract-based framework for write-time concurrency control, ensuring transparency regarding admitted anomalies.

Figure 1. Contradiction resolution as write-time concurrency control. A bitemporal substrate detects a contradicting pair on a subject-predicate key; an isolation gate routes it to one of four typed operators, each pinned to the isolation level that excludes the anomaly a weaker level admits; every operator commits a current row beside an audit row under one schema. The soundness theorems close the isolation, schema, and provenance axes.
Figure 1. Contradiction resolution as write-time concurrency control. A bitemporal substrate detects a contradicting pair on a subject-predicate key; an isolation gate routes it to one of four typed operators, each pinned to the isolation level that excludes the anomaly a weaker …
cs.AIarxiv:2606.06285v1Lead article

TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

Ziwen Kan, Yishuo Chen, Kecheng Li, Andrew Wen, Xiaomeng Wang

RACE introduces a novel conditional estimation paradigm for multimodal time series foundation models to address temporal misalignment and missing data. It systematically infers incomplete target modalities using available auxiliary modalities, overcoming limitations of naive imputation methods. This approach yields more robust and aligned temporal representations across diverse multimodal benchmarks.

Illustration of a multimodal time series setting with a 30% missing rate, which is common in clinical data. We compare sequence-level representations obtained from imputed inputs against ground truth (GT) representations derived from fully observed sequences. Our paradigm, which treats missing modality inputs as temporal variables to be conditionally estimated from available modalities, outperforms prior value interpolation, yielding internal representations closer to the oracle under the same missingness pattern measured by Cosine Similarity. See Appendix A.3 for dataset construction and evaluation details and Appendix E.2 for additional results.
Illustration of a multimodal time series setting with a 30% missing rate, which is common in clinical data. We compare sequence-level representations obtained from imputed inputs against ground truth (GT) representations derived from fully observed sequences. Our paradigm, which …
cs.AIarxiv:2606.06416v1Lead article

Unsupervised Skill Discovery for Agentic Data Analysis

Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang

his paper introduces **DataCOPE**, an unsupervised framework for discovering reusable data-analysis skills for agents without relying on labeled supervision. It iteratively coordinates an agent, an unsupervised verifier, and a skill manager to generate trajectories and distill skills based on quality signals derived directly from those exploration trajectories. DataCOPE's core contribution is enabling skill discovery purely from unlabeled exploration data, demonstrated effectively for report-style analysis using an Adaptive Checklist Verifier.

Supervised skill discovery requires costly data annotation. DataCOPE instead performs unsupervised skill discovery by deriving task-adaptive verifier signals from unlabeled exploration trajectories and distilling them into reusable skills.
Supervised skill discovery requires costly data annotation. DataCOPE instead performs unsupervised skill discovery by deriving task-adaptive verifier signals from unlabeled exploration trajectories and distilling them into reusable skills.
cs.AIarxiv:2606.06460v1Lead article

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

Thamilvendhan Munirathinam

his paper introduces the **Recuse Signal**, a lightweight, in-band communication mechanism (like an SSH banner) allowing servers to request that an autonomous LLM agent voluntarily withdraw access to a resource. The core contribution is empirically measuring whether current LLM agents comply with this non-security-critical governance signal, analogous to a `robots.txt` for live infrastructure access. The authors implement adapters for SSH and PostgreSQL to test this compliance in a real-world setting.

Recusal rate on the live SSH deny signal. With the signal present and no authorization framing, all subjects recuse 100%; in the no-signal control all complete the task (0% recusal). Adding an explicit authorization framing collapses GPT-4o’s recusal to 20% while GPT-4o-mini and Claude Code hold at 100%—the signal is cooperative and its weight is model-dependent.
Recusal rate on the live SSH deny signal. With the signal present and no authorization framing, all subjects recuse 100%; in the no-signal control all complete the task (0% recusal). Adding an explicit authorization framing collapses GPT-4o’s recusal to 20% while GPT-4o-mini and …
§ III

Daily Issues This Week

2026-06-01 to 2026-06-07 7