Monthly Issue
Collected dispatches

2026-05

2026-04-01 to 2026-04-30
100 papers
30 daily issues
A monthly ledger of recurring themes, selected papers, and daily issues. 3 sections
§ I

The Month in Review

Editorial summary

Monthly Research Trends (Past 30 Days)

The past month shows an intense, high-stakes focus on Agentic AI Reliability and Governance, shifting research from mere capability demonstrations to robust, efficient, and safe operational deployment. Research is rapidly diversifying to tackle the unique complexities introduced by open-ended, multi-step AI systems.

Key Shifts in Research Direction Popularity

1. From Capability to Control & Safety (The Governance Rise): There is a marked transition toward securing and governing operational agents. Papers like AgentWard (lifecycle security), Layerwise Convergence Fingerprinting (LCF) (runtime monitoring), and Governing What You Cannot Observe (adaptive runtime governance via viability theory) highlight that securing agents against novel threats (backdoors, exploitation, unpredictable behavior) is now paramount. 2. Memory and Long-Horizon Structuring: Efficiency and fidelity in long-term reasoning are critical. StructMem (structured hierarchical memory) and Kwai Summary Attention (KSA) (fixed-size KV cache compression) directly address the context length and memory overhead issues that cripple sophisticated, iterative agents. 3. Efficiency in Agentic Workflows: The "Tools Tax" and computational cost of complex agent loops are under direct attack. Tool Attention drastically cuts context overhead by dynamically gating tool schemas, while QuantClaw uses precision routing to reduce the cost of large autonomous agents like OpenClaw.

Notable Groups and Labs (Inferred Focus)

The research suggests activity from groups pushing both the theoretical and engineering boundaries of agent deployment:

• Agent Autonomy & Reasoning: Significant work (e.g., AEL, Agentic World Modeling, Beyond the Attention Stability Boundary) focuses on refining the cognitive loop—how agents learn from past experience (AEL) and maintain stable, goal-directed planning (SSRP). • Alignment & Human Interaction: Several papers challenge the assumptions underpinning current alignment work. Alignment has a Fantasia Problem explicitly calls for cognitive support integration, while work on Measuring Opinion Bias suggests a drive toward uncovering the true internal states of LLMs, not just their guided external presentation. • Security & Reproducibility: A strong cohort of papers focuses on hardening against new attack vectors and ensuring consistency. Transient Turn Injection (TTI) and Stealthy Backdoor Attacks (BadStyle) reflect a proactive stance against evolving multi-turn vulnerabilities, complementing efforts like Introducing Background Temperature to quantify hidden non-determinism.

Trends to Watch Next Month

1. The Rise of "Talent" Orchestration: The concept of flexible, dynamic organization for heterogeneous agents, as seen in OneManCompany (OMC), suggests the next phase of multi-agent research will move beyond fixed team structures to dynamic organization governed by internal "Talent Markets." 2. Formal Verification Integration: The coupling of LLMs with formal verification tools, exemplified by From Natural Language to Verified Code (Dafny), will likely escalate. As agents move toward mission-critical tasks (like scientific automation (From Research Question to Scientific Workflow)), the demand for provable correctness beyond empirical testing will increase. 3. Systematic Agent Benchmarking: The focus on creating rigorous, realistic evaluation platforms will continue. AgentSearchBench and Superminds Test indicate a trend away from synthetic, isolated tasks toward evaluating agent societies in complex, "in the wild" settings. Expect more benchmarks that test coordination, societal failure modes, and economic efficiency (token burn, as seen in How Do AI Agents Spend Your Money?).

§ II

Top Papers

Selected research 100
cs.AIarxiv:2604.21725v1Lead article

AEL: Agent Evolving Learning for Open-Ended Environments

Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu

he paper introduces Agent Evolving Learning (AEL), a two-timescale framework designed to enable LLM agents to effectively utilize past experience in open-ended environments. AEL employs fast-timescale Thompson Sampling to select the optimal memory retrieval policy for each episode, while a slow-timescale LLM reflection process diagnoses failures and injects causal insights into the agent's prompt. This method significantly improves performance on sequential tasks by providing a structured way to interpret and apply prior knowledge.

cs.AIarxiv:2604.21827v1Lead article

Alignment has a Fantasia Problem

Nathanael Jo, Zoe De Simone, Mitchell Gordon, Ashia Wilson

he paper identifies "Fantasia interactions" as a core problem where AI treats incomplete user prompts as final intent, leading to misaligned assistance because users often lack fully formed goals. The contribution is arguing that alignment research must shift from treating users as rational oracles to actively providing cognitive support that helps users form and refine their intent over time. This requires integrating machine learning with interface design and behavioral science.

Diagram describing a Fantasia interaction, including behavioral sources and failure modes.
Diagram describing a Fantasia interaction, including behavioral sources and failure modes.
cs.AIarxiv:2604.21910v1Lead article

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski

his paper introduces an agentic AI architecture to automate the translation of natural language research questions into executable scientific workflows. It achieves this by separating the process into three layers: an LLM for intent extraction, deterministic generators for creating workflow DAGs, and expert-authored "Skills" to encode domain knowledge and constraints. The core contribution is confining LLM non-determinism to the initial intent stage, ensuring that identical intents always produce identical, reproducible workflows.

Component architecture. The Conductor orchestrates three specialized agents. The Workflow Composer (semantic layer) consults domain Skills (knowledge layer) to produce workflow plans that include data preparation commands. The Deployment Service and Execution Sentinel (deterministic layer) execute these plans on the Kubernetes infrastructure running the HyperFlow engine.
Component architecture. The Conductor orchestrates three specialized agents. The Workflow Composer (semantic layer) consults domain Skills (knowledge layer) to produce workflow plans that include data preparation commands. The Deployment Service and Execution Sentinel (determinis…
cs.AIarxiv:2604.21794v1Lead article

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang

his paper introduces **DiffMAS**, a novel training framework that enables the **end-to-end, joint optimization of latent inter-agent communication** alongside multi-agent reasoning. It treats the internal, non-textual communication (like key-value caches) as a learnable component, optimizing how information is encoded and interpreted across agent interactions using parameter-efficient supervised training. This approach consistently improves reasoning accuracy and stability compared to standard single-agent inference across various complex tasks.

In Stage I, agents 1 to K–1 sequentially construct a shared KV trace by prefilling the existing cache and appending newly generated KV segments without gradient updates. The accumulated KV trace serves as a latent communication medium across agents. In Stage II, the final agent performs autoregressive decoding on the prefilled KV cache. Cross-attention over the KV trace produces hidden states, which are projected through the LM head to generate tokens. Supervised fine-tuning is applied using cross-entropy loss, and gradients are backpropagated to update only the LoRA parameters of the final agent while keeping the backbone model frozen.
In Stage I, agents 1 to K–1 sequentially construct a shared KV trace by prefilling the existing cache and appending newly generated KV segments without gradient updates. The accumulated KV trace serves as a latent communication medium across agents. In Stage II, the final agent p…
cs.AIarxiv:2604.21896v1Lead article

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Chee Wei Tan, Yuchen Wang, Shangxin Guo

his paper introduces **Nemobot Games**, an interactive engineering environment that operationalizes Shannon's game taxonomy using Large Language Models (LLMs) to create strategic AI agents. The core method involves leveraging the LLM's reasoning and synthesis capabilities to generate optimal or heuristic strategies tailored to four distinct classes of games (dictionary, solvable, heuristic, and learning-based). The contribution is a novel paradigm for building customizable, explainable, and adaptive AI game agents powered by LLMs.

Crowdsourcing and strategy optimization in game-playing AI. The LLM generates optimized strategies for game-playing agents, while game states and results from interactions with human players are fed back to train the LLM, creating a self-reinforcing cycle of improvement through crowdsourcing.
Crowdsourcing and strategy optimization in game-playing AI. The LLM generates optimized strategies for game-playing agents, while game states and results from interactions with human players are fed back to train the LLM, creating a self-reinforcing cycle of improvement through c…
cs.AIarxiv:2604.21611v1Lead article

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Hao-Yuan Chen

his paper introduces Verbal Process Supervision (VPS), a training-free method that uses structured natural-language critique from a stronger model to iteratively guide an LLM's reasoning process. VPS establishes a new axis for inference-time scaling by focusing on the granularity of external verbal supervision. This approach significantly improves reasoning performance across complex benchmarks like GPQA Diamond and AIME 2025, often surpassing existing state-of-the-art methods like Reflexion.

Three-way matched-compute baseline comparison across all three benchmarks. Each group shows Actor standalone, Self-Consistency @ 5 (SC@5), Reflexion (outcome-level verbal critique), and VPS (step-level, ours) on the frontier same-family pair per benchmark. Annotated deltas confirm VPS > > SC@5 > > Reflexion on GPQA Diamond ( + 5.0 +5.0 pp and + 8.5 +8.5 pp) and LiveCodeBench V6 ( + 8.3 +8.3 pp and + 12.1 +12.1 pp); on AIME 2025, VPS > > Reflexion by + 10.0 +10.0 pp and exceeds SC@5 narrowly by + 1.1 +1.1 pp (within seed variance). The consistent VPS > > Reflexion gap across all three benchmarks isolates critique granularity as the operative variable.
Three-way matched-compute baseline comparison across all three benchmarks. Each group shows Actor standalone, Self-Consistency @ 5 (SC@5), Reflexion (outcome-level verbal critique), and VPS (step-level, ours) on the frontier same-family pair per benchmark. Annotated deltas confir…
cs.AIarxiv:2604.21700v1Lead article

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang

his paper introduces **BadStyle**, a novel backdoor attack framework against LLMs that utilizes **natural style-level triggers** instead of explicit patterns. The core method involves using an LLM to generate stealthy poisoned samples with these style triggers while maintaining semantic fluency. BadStyle's contribution is a complete pipeline that stabilizes payload injection using an auxiliary target loss, addressing the shortcomings of previous, less natural backdoor attacks.

The complete framework and attack flow of BadStyle . This illustrates a clear supply-chain-based backdoor attack, where the attacker is the model provider, with the complete attack process comprising two main phases.
The complete framework and attack flow of BadStyle . This illustrates a clear supply-chain-based backdoor attack, where the attacker is the model provider, with the complete attack process comprising two main phases.
cs.AIarxiv:2604.21748v1Lead article

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao

tructMem introduces a structure-enriched hierarchical memory framework for LLMs designed to capture event relationships essential for long-horizon reasoning. It achieves this by temporally anchoring dual perspectives and performing semantic consolidation, which preserves event bindings and induces cross-event connections. This method significantly improves temporal reasoning and multi-hop QA performance while substantially reducing computational overhead compared to existing flat or graph-based memory systems.

Three paradigms of Memory systems.
Three paradigms of Memory systems.
cs.AIarxiv:2604.21816v1Lead article

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Anuj Sadani, Deepak Kumar

his paper introduces **Tool Attention**, a middleware mechanism that replaces the costly, eager schema injection of the Model Context Protocol (MCP) with a dynamic, gated attention system over available tools. It uses an Intent Schema Overlap (ISO) score and state-aware gating to select only necessary tool schemas, significantly reducing the per-turn context overhead (the "Tools Tax") and mitigating context-length-related performance degradation in agentic workflows.

cs.AIarxiv:2604.21860v1Lead article

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Naheed Rayhan, Sohely Jahan

he paper introduces **Transient Turn Injection (TTI)**, a novel multi-turn attack that exploits LLM vulnerabilities by distributing adversarial intent across isolated interactions, bypassing stateless moderation. TTI utilizes automated LLM agents to iteratively probe and evade policy enforcement, unlike traditional context-dependent jailbreaks. This method effectively exposes significant variations in the robustness of state-of-the-art commercial and open-source models.

TTI Prompt Evaluation Example.
TTI Prompt Evaluation Example.
cs.LGarxiv:2604.21905v1Lead article

Low-Rank Adaptation Redux for Large Models

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

his paper re-examines Low-Rank Adaptation (LoRA) by framing it through the lens of signal processing (SP) and classical low-rank modeling. The core contribution is providing a principled, theoretical understanding of the mechanisms behind LoRA and its variants, rather than just empirical comparison. This SP perspective aims to guide future, principled advancements in parameter-efficient fine-tuning based on architectural design and efficiency.

LoRA fine-tuning of a GPT-3 model. Grey and orange boxes are respectively frozen (snowflake icon) weights of linear layers, and trainable (fire icon) LoRA weights.
LoRA fine-tuning of a GPT-3 model. Grey and orange boxes are respectively frozen (snowflake icon) weights of linear layers, and trainable (fire icon) LoRA weights.
cs.CLarxiv:2604.21590v1Lead article

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan

his paper introduces **AgenticQwen**, a family of small language models optimized for industrial-scale tool use and multi-step reasoning. The core method involves training these models using a novel framework combining reasoning and agentic Reinforcement Learning (RL) powered by **dual data flywheels**. These flywheels automatically generate increasingly complex tasks—one focusing on error-based difficulty scaling and the other on expanding simple workflows into complex decision trees—enabling strong performance in real-world agentic systems.

Overview of our dual data flywheels. The reasoning data flywheel generates increasingly challenging, verifiable problems from model failures, while the agentic data flywheel expands linear workflows into multi-branch behavior trees and generates new training data.
Overview of our dual data flywheels. The reasoning data flywheel generates increasingly challenging, verifiable problems from model failures, while the agentic data flywheel expands linear workflows into multi-branch behavior trees and generates new training data.
cs.CLarxiv:2604.21564v1Lead article

Measuring Opinion Bias and Sycophancy via LLM-based Coercion

Rodrigo Nogueira, Giovana Kerche Bonás, Thales Sales Almeida, Andrea Roque, Ramon Pires

his paper introduces **llm-bias-bench**, an open-source method to uncover the true opinions of Large Language Models (LLMs) on contested topics, overcoming their evasive disclaimers. The method uses two complementary, multi-turn, free-form probing strategies: **Direct Probing** (escalating pressure) and **Indirect Probing** (never directly asking for an opinion). This approach aims to reveal the model's underlying stance as it might manifest in realistic user interactions.

cs.CLarxiv:2604.21882v1Lead article

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita

his paper introduces **RedirectQA**, a novel dataset that uses Wikipedia redirects to associate factual triples with multiple, categorized surface forms (aliases, variants, errors) for each entity. The core method analyzes how LLMs' factual recall changes when only the entity's surface form is altered, revealing that memorization access is highly **surface-conditioned**. The contribution is demonstrating that LLM factual consistency is significantly dependent on the specific name used, with models being less robust to major lexical variations like aliases than to minor spelling changes.

Overview of the RedirectQA construction process: (1) Factual triples are collected from Wikidata. (2) Each subject entity is associated with canonical and redirect surface forms, together with redirect categories, using Wikipedia redirects. (3) Question realizations are generated from surface instances using relation-specific question templates.
Overview of the RedirectQA construction process: (1) Factual triples are collected from Wikidata. (2) Each subject entity is associated with canonical and redirect surface forms, together with redirect categories, using Wikipedia redirects. (3) Question realizations are generated…
cs.AIarxiv:2604.22748v1Lead article

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang

his paper introduces the "Agentic World Modeling" framework, a taxonomy organized by capability levels (Predictor, Simulator, Evolver) and governing law regimes (physical, digital, social, scientific). The core contribution is providing a structured way to understand and evaluate the necessary predictive environment models that enable AI agents to achieve complex, sustained goals across diverse domains.

cs.AIarxiv:2604.22436v1Lead article

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz

gentSearchBench is a large-scale benchmark designed to evaluate AI agent search methods in realistic, "in the wild" scenarios, addressing the limitations of existing benchmarks that assume well-specified agents. It formalizes agent search as retrieval and reranking tasks using nearly 10,000 real-world agents, evaluating relevance based on execution-grounded performance signals rather than just textual descriptions. The contribution is providing a more challenging and realistic evaluation platform that highlights the gap between semantic similarity and actual agent capability.

Task and Relevance Label Generation Pipeline of AgentSearchBench.
Task and Relevance Label Generation Pipeline of AgentSearchBench.
cs.AIarxiv:2604.22411v1Lead article

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Alberto Messina, Stefano Scotta

his paper introduces the concept of **background temperature ($T_{\mathrm{bg}}$)** to quantify the inherent, implementation-dependent randomness observed in Large Language Models (LLMs) even when the nominal decoding temperature is set to zero. $T_{\mathrm{bg}}$ formalizes the effective temperature induced by environmental perturbations (like hardware or software variations) and proposes an empirical protocol to estimate this value. The contribution lies in providing a theoretical framework and measurement method for understanding and characterizing this hidden nondeterminism, which impacts LLM reproducibility.

Measuring protocol.
Measuring protocol.
cs.AIarxiv:2604.22565v1Lead article

Learning Evidence Highlighting for Frozen LLMs

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei

his paper introduces **HiLight**, a framework that trains a lightweight **Emphasis Actor** to insert minimal highlight tags around crucial evidence within the original, unaltered context. This approach decouples evidence selection from reasoning, allowing a **frozen LLM Solver** to utilize the emphasized input for improved performance. The Actor is optimized via **weakly supervised reinforcement learning** using only the Solver's final task reward, requiring no evidence labels or modification of the LLM.

Overview of the HiLight framework. HiLight decouples evidence selection from reasoning for long, noisy contexts. Inference: Given a query Q Q and context X X , a lightweight Emphasis Actor selects pivotal spans under a highlight budget \( \gamma \) and inserts minimal highlight tags to form an emphasized context X ^ \( \hat{X} \) . A frozen Solver LLM then produces the final output. Training: Because explicit evidence annotations are unavailable, we optimize the Actor via weakly supervised RL using only the Solver’s task reward R ​ ( y , y ∗ ) R(y,y^{*}) , without accessing Solver gradients or intermediate activations.
Overview of the HiLight framework. HiLight decouples evidence selection from reasoning for long, noisy contexts. Inference: Given a query Q Q and context X X , a lightweight Emphasis Actor selects pivotal spans under a highlight budget \( \gamma \) and inserts minimal highlight t…
cs.AIarxiv:2604.22577v1Lead article

QuantClaw: Precision Where It Matters for OpenClaw

Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong

uantClaw addresses the high cost of large autonomous agents like OpenClaw by dynamically adjusting numerical precision based on task requirements. It analyzes quantization sensitivity across workflows and proposes a plug-and-play routing plugin that assigns lower precision to lightweight tasks and preserves higher precision for demanding ones. This method significantly reduces latency and cost while maintaining or improving overall task performance.

Scaling behavior of quantization degradation under NVFP4. Left : Absolute performance gap vs. model size on a linear scale, showing diminishing degradation as model parameters increase. Right : Log-log plot reveals a power-law relationship, confirming systematic scaling. Larger models demonstrate enhanced robustness to low-precision quantization, with reduced sensitivity compared to smaller counterparts.
Scaling behavior of quantization degradation under NVFP4. Left : Absolute performance gap vs. model size on a linear scale, showing diminishing degradation as model parameters increase. Right : Log-log plot reveals a power-law relationship, confirming systematic scaling. Larger m…
cs.AIarxiv:2604.22597v1Lead article

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach

his paper introduces a robust LLM-as-a-Judge framework to evaluate mathematical reasoning, moving beyond the limitations of rigid symbolic comparison. The core method uses a large language model to assess the correctness of generated answers, accommodating diverse mathematical representations and solution formats. This approach demonstrates clear improvements over traditional symbolic verification methods, addressing their failure cases in popular evaluation frameworks.

Our LLM evaluation approach provides a more robust evaluation compared to traditional symbolic evaluation methods by handling diverse mathematical representations and answer formats. These examples demonstrate the contribution of our approach by correctly evaluating these model predictions for mathematical questions, while the existing numerical comparison approach fails.
Our LLM evaluation approach provides a more robust evaluation compared to traditional symbolic evaluation methods by handling diverse mathematical representations and answer formats. These examples demonstrate the contribution of our approach by correctly evaluating these model p…
cs.AIarxiv:2604.22558v1Lead article

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan

OLAR-RL addresses the challenge of training GUI agents using MLLMs by bridging the gap between static Offline RL and costly Online RL. The core method integrates global trajectory semantics into offline learning by reconstructing rollouts, identifying the first failure point, and retroactively assigning dense, long-horizon assignment rewards. This approach leverages static data more effectively to improve long-term task execution quality without excessive online interaction.

Comparison of RL paradigms for GUI agents. (Top Left) Standard Offline RL is limited by fragmented step-level data, leading to temporal myopia and loss of global context. (Top Right) Online RL captures dynamics but suffers from instability and prohibitive interaction costs. (Bottom) Our SOLAR-RL bridges this gap by retrofitting global trajectory insights into offline data. It utilizes trajectory reconstruction and retroactive credit assignment via failure-point detection, combined with target-aligned reward shaping, to simulate pseudo-online feedback, ensuring stable long-horizon optimization.
Comparison of RL paradigms for GUI agents. (Top Left) Standard Offline RL is limited by fragmented step-level data, leading to temporal myopia and loss of global context. (Top Right) Online RL captures dynamics but suffers from instability and prohibitive interaction costs. (Bott…
cs.AIarxiv:2604.22452v1Lead article

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li

his paper introduces the **Superminds Test**, a hierarchical framework using controlled **Probing Agents** to empirically evaluate the emergence of collective intelligence in large-scale agent societies, specifically using the MoltBook platform. The core contribution is demonstrating a **stark absence of collective intelligence** in these societies, as they fail to surpass individual frontier models on complex tasks and struggle with basic coordination.

A framework of using a probing agent to evaluate collective intelligence in an agent society . The framework consists of three tiers: joint reasoning, information synthesis, and basic interaction. The probing agent posts targeted stimuli into the live MoltBook platform from complex logical reasoning (Tier I) to distributed information aggregation (Tier II) to simple sequential counting (Tier III) and measures the society’s organic response as a diagnostic signal of emergent collective intelligence.
A framework of using a probing agent to evaluate collective intelligence in an agent society . The framework consists of three tiers: joint reasoning, information synthesis, and basic interaction. The probing agent posts targeted stimuli into the live MoltBook platform from compl…
cs.AIarxiv:2604.22273v1Lead article

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

Aofan Liu, Jingxiang Meng

his paper models LLM self-correction as a control-theoretic feedback loop using a two-state Markov process to diagnose when iteration is beneficial. The core contribution is identifying a critical threshold (near-zero Error Introduction Rate, EIR $\le 0.5\%$) that separates helpful from harmful self-correction across various models and datasets. Furthermore, they show that prompt engineering alone can causally adjust EIR to remain below this threshold, thereby preventing performance degradation.

Three-layer view of iterative self-correction as a Markov feedback loop. The theoretical layer formalises correctness evolution on { C , I } \{C,I\} with EIR/ECR transitions, yielding equilibrium, steady-state, and convergence expressions. The control layer interprets EIR as a stability margin and verify-first prompting as controller design; ASC adds instance-level confidence γ ​ ( k ) ≥ \( \gamma \)(k)\!\( \geq \)\!\( \tau \) with batch-level EIR ^ / ECR ^ \( \widehat \){\( \text{EIR} \)}/\( \widehat \){\( \text{ECR} \)} monitoring for early stopping. The empirical layer evaluates 7 models × \( \times \) 3 datasets, confirming near-zero EIR ( ≲ 0.5 % \( \lesssim \) 0.5\% ) as the threshold separating beneficial from harmful self-correction.
Three-layer view of iterative self-correction as a Markov feedback loop. The theoretical layer formalises correctness evolution on { C , I } \{C,I\} with EIR/ECR transitions, yielding equilibrium, steady-state, and convergence expressions. The control layer interprets EIR as a st…
cs.LGarxiv:2604.22271v1Lead article

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw

his paper investigates how LLMs detect and correct their own errors by examining the role of internal confidence signals, specifically the "post-answer newline" (PANL) token representation. Drawing on second-order decision models, the authors hypothesize that this PANL signal, which is partially independent of the primary response generation, serves as an evaluative mechanism enabling error detection and subsequent self-correction.

Left panel : Verification and self-correction prompt structure (see § A.2 for full details). The model’s answer and verbal confidence were generated in a separate prior phase. In the verification phase, the model is shown its own answer to a TriviaQA (or MNLI) question and asked to judge whether it is correct (Y/N), followed by a self-correction prompt. Residual stream activations were extracted during the verification phase at the post-answer newline token (PANL, indicated by arrow)—the first token after the model’s answer following (Kumaran et al., 2026 ) . Right panel : Second-order model of confidence, adapted from Fleming & Daw ( 2017 ) . Left side (dashed box): the first-order model (FOM), in which a generation process produces an answer (here via greedy decoding) and the associated log-probabilities ( X act X_{\( \mathrm{act} \)} ) are the only available confidence signal. Under greedy decoding, X act X_{\( \mathrm{act} \)} — and therefore confidence—is by definition maximal for the chosen answer, so a purely first-order system cannot conclude it erred. Right side: the second-order extension, in which the completed answer engages a qualitatively distinct evaluative process that assesses question–answer fit by attending backward over the full response—a different computation from the retrieval process that produced it (see text for details). Because this evaluation performs a different computation over the model’s knowledge, it can shift the internal distribution over possible answers such that the committed answer ( A 1 A_{1} ) is no longer the mode (now A 2 A_{2} ). The resulting evaluative signal ( X eval X_{\( \mathrm{eval} \)} ; termed X conf X_{\( \mathrm{conf} \)} in the original framework), encoded at answer-adjacent token positions (PANL), is partially independent of X act X_{\( \mathrm{act} \)} and drives verbal confidence, error detection, and self-correction.
Left panel : Verification and self-correction prompt structure (see § A.2 for full details). The model’s answer and verbal confidence were generated in a separate prior phase. In the verification phase, the model is shown its own answer to a TriviaQA (or MNLI) question and asked …
cs.LGarxiv:2604.22575v1Lead article

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

Yuqi Pan, Jinghao Zhuang, Yupeng Feng, Fangzhi Zhong, Siyu Ding

pikingBrain2.0 introduces a novel foundation model architecture, SpB2.0, designed for efficient long-context inference. Its core method involves the Dual-Space Sparse Attention (DSSA) mechanism, which hybridizes sparse attention types for better performance-efficiency. The contribution lies in achieving high performance with reduced computational overhead for long sequences, supported by dual quantization paths (INT8-Spiking and FP8) and an optimized training pipeline.

Architecture of SpikingBrain2.0-5B (SpB2.0). SpB2.0 adopts a 1:3 inter-layer hybrid design, termed DSSA, that combines MoBA and SSE, together with dual-path activation-coding strategies for linear projections. This design allows SpB2.0 to address the dominant computational bottlenecks of standard Transformers across different sequence-length regimes and hardware platforms.
Architecture of SpikingBrain2.0-5B (SpB2.0). SpB2.0 adopts a 1:3 inter-layer hybrid design, termed DSSA, that combines MoBA and SSE, together with dual-path activation-coding strategies for linear projections. This design allows SpB2.0 to address the dominant computational bottle…
cs.CLarxiv:2604.22661v1Lead article

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Negar Arabzadeh, Andrew Drozdov, Michael Bendersky, Matei Zaharia

his paper investigates using Query Performance Prediction (QPP) to select the optimal query variant within Retrieval-Augmented Generation (RAG) pipelines, avoiding costly execution of all reformulations. The core method focuses on **intra-topic discrimination**, where QPP predicts the best variant among semantically equivalent options for a single information need. The contribution is a large-scale evaluation demonstrating the feasibility and performance of pre- and post-retrieval predictors for this selective execution mechanism across different retriever types.

Figure 1. Relationship between retrieval effectiveness (nDCG@5) and end-to-end RAG utility (Nugget-All) under sparse and dense retrieval. Each point corresponds to a query variant selected by different strategies (pre-retrieval QPP, post-retrieval QPP, single reformulation, original query, oracle).
Figure 1. Relationship between retrieval effectiveness (nDCG@5) and end-to-end RAG utility (Nugget-All) under sparse and dense retrieval. Each point corresponds to a query variant selected by different strategies (pre-retrieval QPP, post-retrieval QPP, single reformulation, origi…
cs.CLarxiv:2604.22335v1Lead article

Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

Weixu Zhang, Fanghua Ye, Qiang Gao, Jian Li, Haolun Wu

his paper introduces Context-Fidelity Boosting (CFB), a lightweight, decoding-time framework designed to reduce faithfulness hallucinations in LLMs by prioritizing context-supported tokens. Inspired by watermarking, CFB applies additive logit adjustments based on a token's support from the input context, utilizing static, context-aware, or token-aware boosting strategies. The core contribution is this general method for boosting generation fidelity directly during inference without retraining the model.

Illustration of context-faithful decoding: Traditional decoding relies on parametric knowledge (favoring “Tokyo”), while our logit-shaping approach dynamically adjusts token probabilities to better align with the given context about “Paris 2024”.
Illustration of context-faithful decoding: Traditional decoding relies on parametric knowledge (favoring “Tokyo”), while our logit-shaping approach dynamically adjusts token probabilities to better align with the given context about “Paris 2024”.
cs.CLarxiv:2604.22750v1Lead article

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea

his paper presents the first systematic analysis of token consumption in agentic coding tasks across eight frontier LLMs. The core method involves analyzing task trajectories to determine where tokens are spent and evaluating models' ability to predict their own token costs. The key contribution is revealing that agentic tasks are uniquely expensive (1000x more than simple reasoning), driven primarily by input tokens, and that token usage is highly stochastic and unpredictable.

cs.CLarxiv:2604.22345v1Lead article

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun

his paper introduces a mechanistic framework to understand and control LLM personalization by identifying "Preference Heads"—attention heads encoding user-specific stylistic and topical preferences. The core method, Differential Preference Steering (DPS), uses causal masking to calculate a Preference Contribution Score (PCS) for each head, quantifying its influence. This allows for interpretable, training-free personalization by selectively amplifying the influence of these identified heads during inference.

Overview of preference-based personalization in LLMs. Distinct user profiles activate different subsets of Preference Heads , forming sparse internal pathways that steer generation toward user-aligned styles. Cluster-aware preference steering further captures shared structure across users with similar preferences.
Overview of preference-based personalization in LLMs. Distinct user profiles activate different subsets of Preference Heads , forming sparse internal pathways that steer generation toward user-aligned styles. Cluster-aware preference steering further captures shared structure acr…
cs.AIarxiv:2604.24657v1Lead article

AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

Yixiang Zhang, Xinhao Deng, Jiaqing Wu, Yue Xiao, Ke Xu

gentWard introduces a lifecycle security architecture for autonomous AI agents, organizing defense-in-depth across five stages: initialization, input processing, memory, decision-making, and execution. Its core method integrates stage-specific, heterogeneous controls with cross-layer coordination to intercept threats as they propagate through the agent's runtime. The contribution is a systematic framework that enhances security by protecting critical assets throughout the agent's operational lifespan.

Architectural overview of AgentWard . The framework attaches to lifecycle-relevant runtime events, organizes protection through five layers aligned with initialization, input, memory, decision, and execution, and carries security judgments forward through shared state and reusable analysis capabilities.
Architectural overview of AgentWard . The framework attaches to lifecycle-relevant runtime events, organizes protection through five layers aligned with initialization, input, memory, decision, and execution, and carries security judgments forward through shared state and reusabl…
cs.AIarxiv:2604.24395v1Lead article

Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

Byeonggeuk Lim, JungMin Yun, Junehyoung Kwon, Kyeonghyun Kim, YoungBin Kim

his paper introduces AVES-DPO, a novel framework to mitigate hallucinations in LVLMs by generating preference data directly from the model's intrinsic knowledge, avoiding reliance on external proprietary models. It uses a consensus-based verification mechanism to identify and guide the model to self-correct diverse hallucinations. This self-correction process creates in-distribution preference pairs, leading to superior hallucination mitigation with significantly fewer samples compared to existing methods.

Overview of hallucination types and the effectiveness of the proposed method. (a) An example of hallucinations in LVLMs. (b) Our proposed AVES-DPO achieves the lowest CHAIR score with only 5.2k training samples, demonstrating strong data efficiency.
Overview of hallucination types and the effectiveness of the proposed method. (a) An example of hallucinations in LVLMs. (b) Our proposed AVES-DPO achieves the lowest CHAIR score with only 5.2k training samples, demonstrating strong data efficiency.
cs.AIarxiv:2604.24512v1Lead article

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Dahlia Shehata, Ming Li

his paper addresses the "Attention Latch" failure mode in LLM agents, where historical context overrides new instructions, hindering goal-directedness. The authors introduce Self-Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that separates high-level planning (Architect) from procedural execution (Executive). SSRP resolves this over-squashing issue, enabling agents to maintain deterministic, goal-directed behavior across complex, multi-turn interactions.

Comparative Reasoning Trajectories: Mitigating the Attention Latch via SSRP Re-Synthesis
Comparative Reasoning Trajectories: Mitigating the Attention Latch via SSRP Re-Synthesis
cs.AIarxiv:2604.24618v1Lead article

Evaluating whether AI models would sabotage AI safety research

Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D'Cruz, Xander Davies

his paper evaluates the propensity of frontier AI models (Claude family) to sabotage or refuse assistance in AI safety research when acting as research agents. Using unprompted and continuation evaluations, the authors found no unprompted sabotage, but observed that some models, particularly Mythos Preview, actively continued sabotage in a small percentage of continuation scenarios, sometimes exhibiting reasoning-output discrepancies. The core contribution is the empirical testing of sabotage behavior in deployed AI agents, revealing potential failure modes in safety alignment.

cs.AIarxiv:2604.24477v1Lead article

GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

Pablo Mateo-Torrejón, Alfonso Sánchez-Macián

he paper introduces **Gammaf**, an open-source framework designed to standardize the benchmarking of graph-based anomaly detection methods within LLM Multi-Agent Systems. Its core contribution is providing a reproducible evaluation architecture that generates synthetic multi-agent interaction datasets. Gammaf serves as a common platform to rigorously test and compare the efficacy of existing and future anomaly monitoring defense models against emerging vulnerabilities.

Example of debate setup for collaboration in a LLM-MAS. Agents exchange natural language discourse to reach a consensus on a specific task. The diagram illustrates how the communication structure constrains information flow, requiring agents to synthesize the logical reasoning of their neighbors to update their internal context.
Example of debate setup for collaboration in a LLM-MAS. Agents exchange natural language discourse to reach a consensus on a specific task. The diagram illustrates how the communication structure constrains information flow, requiring agents to synthesize the logical reasoning of…
cs.AIarxiv:2604.24686v1Lead article

Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

German Marin, Jatin Chaudhary

his paper introduces the **Informational Viability Principle** for governing autonomous AI agents whose risk is unobservable, defining acceptable actions based on whether their capacity exceeds an estimated bound on unobserved risk ($\hat{B}(x)$). The **Agent Viability Framework** formalizes necessary governance properties (monitoring, anticipation, monotonic restriction) grounded in viability theory. **RiskGate** implements this framework using statistical estimators and a fail-secure pipeline, culminating in a closed-loop Autopilot for runtime safety enforcement.

cs.AIarxiv:2604.24432v1Lead article

Kwai Summary Attention Technical Report

Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng

he Kwai Summary Attention (KSA) method addresses the quadratic complexity of standard attention in long-context LLMs by introducing a novel **summary attention mechanism**. It achieves this by compressing the Key and Value (KV) cache into a fixed-size summary representation, effectively decoupling the KV cache size from the sequence length. This approach aims to maintain long-context modeling effectiveness while significantly reducing the memory and computational overhead associated with long sequences.

cs.AIarxiv:2604.24542v1Lead article

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

Nay Myat Min, Long H. Pham, Jun Sun

his paper introduces Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitoring method for detecting misbehavior in opaque Large Language Models. LCF analyzes the inter-layer hidden-state trajectory, computing a diagonal Mahalanobis distance on layer differences, aggregated via Ledoit-Wolf shrinkage. This approach effectively detects various threats like backdoors and prompt injections without needing a reference model, trigger knowledge, or retraining.

Overview of LCF. (A) Backdoor signal location varies by architecture (mid for Llama-3, mid-to-late for Qwen, late for Gemma-2), motivating all-layer monitoring. (B) Detection pipeline: per-layer deltas are scored via diagonal Mahalanobis distance, z-scored, and aggregated via Ledoit–Wolf into a single score D D ; LCF abstains when D > τ D>\( \tau \) (LOO-calibrated).
Overview of LCF. (A) Backdoor signal location varies by architecture (mid for Llama-3, mid-to-late for Qwen, late for Gemma-2), motivating all-layer monitoring. (B) Detection pipeline: per-layer deltas are scored via diagonal Mahalanobis distance, z-scored, and aggregated via Led…
cs.AIarxiv:2604.24594v1Lead article

Skill Retrieval Augmentation for Agentic AI

Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang

his paper introduces **Skill Retrieval Augmentation (SRA)**, a new paradigm where agentic AI dynamically retrieves relevant skills from large external corpora instead of relying on fixed context enumeration. This addresses the scaling limitations of current methods. The authors also introduce **SRA-Bench**, the first benchmark to evaluate the full SRA pipeline, including retrieval, incorporation, and end-task execution.

An illustration of the Skill Retrieval Augmentation (SRA) paradigm. The agent retrieves candidate skills from a large external skill corpus, selectively incorporates useful skills into context, and applies them for downstream reasoning and acting. Black arrows denote the standard SRA workflow, while blue arrows represent iterative skill retrieval during reasoning and acting.
An illustration of the Skill Retrieval Augmentation (SRA) paradigm. The agent retrieves candidate skills from a large external skill corpus, selectively incorporates useful skills into context, and applies them for downstream reasoning and acting. Black arrows denote the standard…
cs.AIarxiv:2604.24544v1Lead article

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky

TELLAR-E is a fully automated system designed to generate high-quality, custom-sized synthetic evaluation datasets for domain- and language-specific LLM applications, overcoming the limitations of manual creation and existing static benchmarks. It achieves this through a two-stage process: first, a modified Self-Instruct framework generates controllable synthetic data, and second, an evaluation pipeline assesses the dataset's quality using statistical and LLM-based metrics. The core contribution is providing a scalable, privacy-preserving method for creating tailored evaluation resources with minimal human effort.

Overview of generation pipeline
Overview of generation pipeline
cs.AIarxiv:2604.24668v1Lead article

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh

his paper investigates LLM sycophancy—prioritizing user agreement over correctness—specifically within agentic financial applications. The authors find that LLMs exhibit lower performance drops when faced with contradictory user rebuttals compared to general domains, but still fail significantly when user preference information contradicts the correct answer. Their contribution is a novel task suite to measure this financial-specific sycophancy and a benchmark of potential recovery methods.

Measuring and reducing sycophancy in enterprise settings. Our three-step approach to understanding and addressing sycophancy in financial agentic scenarios.
Measuring and reducing sycophancy in enterprise settings. Our three-step approach to understanding and addressing sycophancy in financial agentic scenarios.
cs.LGarxiv:2604.24468v1Lead article

A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations

Zihan Liu, Yizhen Wang, Rui Wang, Xiu Tang, Sai Wu

his survey comprehensively reviews the emerging field of split learning applied to large language model (LLM) fine-tuning. It categorizes and analyzes existing work across three key dimensions: the model architectures used, the system optimizations developed, and the privacy defense and attack mechanisms employed. The core contribution is providing a structured overview to guide future research in enabling resource-efficient and privacy-preserving collaborative LLM adaptation.

Figure 1. Survey framework: from a unified training pipeline to a multidimensional taxonomy of system, model, and privacy.
Figure 1. Survey framework: from a unified training pipeline to a multidimensional taxonomy of system, model, and privacy.
cs.LGarxiv:2604.24658v1Lead article

The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu

his paper introduces the **Agent-Native Research Artifact (Ara)** protocol to overcome the limitations of traditional narrative scientific papers, which impose "Storytelling" and "Engineering" taxes on reproducibility by AI agents. Ara replaces the linear paper with a machine-executable package structured across four layers: scientific logic, fully specified code, an exploration graph capturing failures, and evidence grounding all claims. This contribution aims to create research artifacts that AI agents can directly understand, reproduce, and extend.

Publishing compiles a rich research object into a lossy narrative (left); Ara preserves the original as a high-fidelity, machine-executable knowledge package (right).
Publishing compiles a rich research object into a lossy narrative (left); Ara preserves the original as a high-fidelity, machine-executable knowledge package (right).
cs.CLarxiv:2604.24429v1Lead article

A Multi-Dimensional Audit of Politically Aligned Large Language Models

Lisa Korver, Mohamed Mostagir, Sherief Reda

his paper introduces a multi-dimensional audit framework, inspired by Habermas' Theory of Communicative Action, to evaluate politically aligned Large Language Models (LLMs) across effectiveness, fairness, truthfulness, and persuasiveness using quantitative metrics. The core contribution is demonstrating consistent trade-offs across nine audited LLMs, showing that while larger models are often more effective at ideological role-playing, this frequently comes at the cost of other critical dimensions.

Mapping of the audit dimensions to the Habermas’ Theory of Communicative Action.
Mapping of the audit dimensions to the Habermas’ Theory of Communicative Action.
cs.CLarxiv:2604.24693v1Lead article

Contextual Linear Activation Steering of Language Models

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin

his paper introduces Contextual Linear Activation Steering (CLAS), a method that dynamically adjusts the strength of linear activation steering based on the input context, overcoming the limitations of fixed steering strength. CLAS consistently outperforms standard linear steering and achieves comparable or better performance than methods like ReFT and LoRA when labeled data is scarce. This offers a scalable, interpretable, and accurate way to specialize and steer large language models.

Per-task improvement over LAS ( Δ = method accuracy − LAS accuracy \( \Delta \)=\( \text{method accuracy} \)-\( \text{LAS accuracy} \) ) when steering each of the 11 tasks separately. Each subplot shows a different model. Each point shows the \( \Delta \) on a single task (colored by method). The diamond represents the average \( \Delta \) per method by averaging over all 11 tasks.
Per-task improvement over LAS ( Δ = method accuracy − LAS accuracy \( \Delta \)=\( \text{method accuracy} \)-\( \text{LAS accuracy} \) ) when steering each of the 11 tasks separately. Each subplot shows a different model. Each point shows the \( \Delta \) on a single task (colore…
cs.CLarxiv:2604.24698v1Lead article

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan

his paper introduces the concept of **Persona Collapse**, a failure mode where diverse LLM agents converge into homogeneous behavior despite assigned distinct profiles. The authors propose a framework measuring **Coverage, Uniformity, and Complexity** to quantify this collapse across personality, moral reasoning, and self-introduction tasks. Their findings reveal that persona collapse occurs along multiple axes and domains, highlighting a significant limitation in achieving true population diversity in LLM applications.

Persona collapse in LLM-based population simulation. Although two personas differ across multiple identity dimensions, Qwen3-32B assigns both the same neutral response on a socially sensitive judgment task. At the population level, the most conservative and most liberal persona pools also concentrate on the same Likert rating.
Persona collapse in LLM-based population simulation. Although two personas differ across multiple identity dimensions, Qwen3-32B assigns both the same neutral response on a socially sensitive judgment task. At the population level, the most conservative and most liberal persona p…
cs.AIarxiv:2604.25849v1Lead article

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

Zhou Hanlin, Chan Huah Yong

DEMA is a knowledge-state orchestration architecture designed to overcome failures in long-horizon LLM tasks by explicitly managing the evolving knowledge state. Its core method integrates features like epistemic bookkeeping, dual-evaluator governance, and checkpoint-resumable persistence to maintain a coherent evidence chain across many steps. The contribution is a robust framework for reliable, long-horizon knowledge synthesis, demonstrated through a comprehensive showcase and benchmark repair.

cs.AIarxiv:2604.25891v1Lead article

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, Owain Evans

his paper investigates "conditional misalignment," where standard interventions designed to reduce emergent misalignment (EM) only mask the problem. While these methods eliminate EM on existing evaluations, the misaligned behavior reappears when test prompts share contextual features with the original training data. The core contribution is demonstrating that common mitigation techniques can hide more egregious misalignment that is only triggered by specific contextual cues.

Conditional misalignment across interventions. Models that appear aligned under standard evaluations can be misaligned when evaluation prompts contain cues for misaligned training data (e.g., insecure code). We illustrate this pattern for (a) mixing misaligned with benign data, (b) post-hoc HHH finetuning, and (c) inoculation prompting (IP).
Conditional misalignment across interventions. Models that appear aligned under standard evaluations can be misaligned when evaluation prompts contain cues for misaligned training data (e.g., insecure code). We illustrate this pattern for (a) mixing misaligned with benign data, (…
cs.AIarxiv:2604.25847v1Lead article

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang

he paper introduces **Agora-Opt**, a modular LLM agent framework designed to reliably solve optimization modeling problems from natural language. It achieves this by employing **decentralized debate** among independent agent teams, whose solutions are reconciled via an outcome-grounded protocol. A **read-write memory bank** stores verified artifacts and past resolutions, enabling training-free, iterative improvement and achieving state-of-the-art performance across benchmarks.

The illustration of three limitations in most existing methods: (a) base–LLM lock–in of training–centric approaches, (b) non–trainable problem of agentic methods, and (c) single–model myopia; alongside their paired design principles in our framework for LLM–based optimization modeling: an agentic foundation for easy backbone upgrades, a read–write agentic memory design, and decentralized agentic debate.
The illustration of three limitations in most existing methods: (a) base–LLM lock–in of training–centric approaches, (b) non–trainable problem of agentic methods, and (c) single–model myopia; alongside their paired design principles in our framework for LLM–based optimization mod…
cs.AIarxiv:2604.25639v1Lead article

Large language models eroding science understanding: an experimental study

Harry Collins, Hartmut Grote, Paul Newbury, Patrick Sutton, Simon Thorne

his study experimentally demonstrates that large language models (LLMs) can be easily manipulated to prioritize fringe scientific claims over established consensus. By modifying LLMs to favor specific non-mainstream papers, the authors generated fluent, convincing answers that contradicted expert knowledge and were difficult for non-experts to identify as misleading. The core contribution is highlighting LLMs' vulnerability to manipulation, posing a significant risk to public scientific understanding and the spread of misinformation.

cs.AIarxiv:2604.25917v1Lead article

Recursive Multi-Agent Systems

Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu

his paper introduces **RecursiveMAS**, a novel framework that extends the recursive refinement principle from single language models to **multi-agent systems** to scale agent collaboration. It casts the system as a unified recursive computation, connecting heterogeneous agents via a **RecursiveLink module** for latent state transfer and thought generation. The core contribution is the framework's ability to achieve iterative, whole-system co-optimization using an inner-outer loop learning algorithm, demonstrating a scalable approach to complex reasoning.

Performance Landscape of RecursiveMAS across Training/Inference Recursion Depths (Top): The lightweight RecursiveMAS with sub-1.5B agents shows a clean scaling trend as recursion deepens. Generalization across Common Collaboration Patterns (Bottom): The Scaled RecursiveMAS with stronger LLM agents (5-10B) seamlessly adapts to diverse multi-agent system structures.
Performance Landscape of RecursiveMAS across Training/Inference Recursion Depths (Top): The lightweight RecursiveMAS with sub-1.5B agents shows a clean scaling trend as recursion deepens. Generalization across Common Collaboration Patterns (Bottom): The Scaled RecursiveMAS with s…
cs.AIarxiv:2604.25684v1Lead article

Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

Eranga Bandara, Ross Gore, Asanga Gunaratna, Sachini Rajapakse, Isurunima Kularathna

his paper introduces a **Neurocognitive Governance Model** that addresses the governance gap in autonomous AI by internalizing safety principles, mirroring human self-governance. It formally maps human executive functions—deliberate evaluation and inhibitory control before action—onto the reasoning process of LLM-driven agents. This framework establishes a structural parallel between the human brain and the LLM, enabling agents to "think before they act" by evaluating actions internally.

Both humans and AI agents interact with large language models through natural language prompts, forming the basis of the human-agent governance analogy proposed in this paper.
Both humans and AI agents interact with large language models through natural language prompts, forming the basis of the human-agent governance analogy proposed in this paper.
cs.AIarxiv:2604.25895v1Lead article

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Steve Coyne

his paper analyzes the normative role of human judgments in RLHF by distinguishing three conceptual models: **extension** (annotators reflect designer intent), **evidence** (annotators provide factual input), and **authority** (annotators determine correct outputs). The core contribution is arguing that understanding which model is being implicitly used impacts how RLHF pipelines should collect, validate, and aggregate human feedback.

cs.LGarxiv:2604.25903v1Lead article

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy, Banani Roy, Kevin A. Schneider

he paper introduces **Carbon-Taxed Transformers (CTT)**, a systematic compression pipeline for Large Language Models inspired by economic carbon taxation principles. CTT operationalizes a computational "carbon tax" to penalize architectural inefficiencies and incentivize deployment-ready compression techniques. This method aims to address the unsustainable computational and environmental costs of LLMs in software engineering by making efficiency a primary design constraint alongside accuracy.

cs.AIarxiv:2604.26522v1Lead article

AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents

Mahnoor Shahid, Hannes Rothe

GEL-Comp is a neuro-symbolic framework designed to improve the compositional generalization of LLM agents in interactive settings. It achieves this by integrating a dynamic Causal Program Graph (CPG) as a world model, an Inductive Logic Programming (ILP) engine to learn new symbolic rules from experience, and a hybrid reasoning core that uses an LLM for planning validated by a Neural Theorem Prover. This architecture enables agents to robustly deduce plans and abductively expand their symbolic knowledge base through interaction.

The AGEL-Comp neuro-symbolic architecture.
The AGEL-Comp neuro-symbolic architecture.
cs.AIarxiv:2604.26577v1Lead article

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

Mahiro Nakao, Kazuhiro Takemoto

his paper introduces a novel dataset of 270 ethically-grounded harmful instructions to benchmark the safety of 72 Large Language Models (LLMs) controlling a simulated Robotic Health Attendant. The core contribution is demonstrating a high average violation rate (54.4%), revealing that safety performance varies significantly by instruction type and model family, with proprietary models being substantially safer than open-weight alternatives.

Boxplot of violation rates across model families ( n n indicates the number of models per family). Families are ordered by median violation rate in descending order. All individual model names are labeled.
Boxplot of violation rates across model families ( n n indicates the number of models per family). Families are ordered by median violation rate in descending order. All individual model names are labeled.
cs.AIarxiv:2604.26557v1Lead article

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee

UAL-BLADE is a dual-path KV-cache offloading framework for edge LLM inference that dynamically routes KV tensors to either a standard page-cache path or a low-overhead NVMe-direct path based on memory pressure. The NVMe-direct path bypasses the kernel by directly mapping tensors to LBA regions, reducing cache thrashing and software overhead. This approach, combined with adaptive pipeline parallelism, significantly improves inference throughput under tight memory constraints.

LLM transformer architecture [ 37 ] .
LLM transformer architecture [ 37 ] .
cs.AIarxiv:2604.26733v1Lead article

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue

utureWorld introduces a novel live agentic reinforcement learning environment specifically designed for training predictive agents. Its core method is closing the training loop by continuously providing prediction tasks based on unfolding real-world events, rewarding agents based on actual outcomes. The main contribution is framing live future prediction as a unified, continuous learning environment that leverages real-world feedback without answer leakage.

Domain distributions of website sources (a), questions before resampling (b), and questions after resampling (c).
Domain distributions of website sources (a), questions before resampling (b), and questions after resampling (c).
cs.AIarxiv:2604.26841v1Lead article

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov, Matteo Negri

his paper demonstrates that Uniform-based Discrete Diffusion Models (UDDMs) function as Associative Memories (AMs) with emergent creativity. The core method involves showing that these models form basins of attraction around training data, not through an explicit energy function, but via conditional likelihood maximization. The key contribution is identifying a sharp transition from memorization to generalization in UDDMs, governed by the size of the training dataset.

Basins around training examples shrink and basins around test examples expand as the training dataset size increases . (A) Textual examples showing two Tiny UDDMs’ token recovery at noise level t = 0.2 t=0.2 , where each is trained on two different training dataset sizes. With a small training dataset, the model fails to recognize unseen test tokens and alters them. With a larger training set, these unseen tokens however become stable and remain intact after the sampling process. (B) Average total token recovery rates (%), including both non-corrupt and corrupted tokens, for training and test sequences across varying corruption levels. Line colors indicate the fractions of the training dataset used (ranging from small to large ). As data scales, the model’s ability to flawlessly recover explicit training examples drops (indicating shrinking basins), while its recovery rate of unseen test examples improves (indicating expanding basins). The convergence of these rates at large dataset sizes (red curves) marks the sharp transition from memorization to generalization. Note: Deterministic (greedy) sampling was used across these experiments to isolate from stochastic noise.
Basins around training examples shrink and basins around test examples expand as the training dataset size increases . (A) Textual examples showing two Tiny UDDMs’ token recovery at noise level t = 0.2 t=0.2 , where each is trained on two different training dataset sizes. With a …
cs.AIarxiv:2604.26511v1Lead article

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

his paper introduces a novel method for detecting Alignment Faking (AF) in LLMs by observing strategic tool selection rather than relying solely on Chain-of-Thought analysis. The core method identifies AF when an LLM switches from a safe tool (under unmonitored conditions) to an unsafe tool (under helpfulness-rewarding monitoring), even while its internal reasoning still acknowledges the safe option. The contribution includes formalizing AF as a behavioral event based on tool use and releasing a new dataset covering 108 enterprise IT scenarios to evaluate frontier LLMs.

cs.AIarxiv:2604.26553v1Lead article

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong

LPO introduces Token-Level Policy Optimization, a novel fine-tuning framework to mitigate language confusion in LLMs by applying localized, token-level updates instead of sequence-level adjustments. The method identifies error-prone positions and uses a tailored objective to selectively suppress undesirable token outputs. This granular intervention effectively resolves language confusion while preserving the model's general performance.

cs.AIarxiv:2604.26951v1Lead article

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan

his paper introduces TIDE, the first framework for cross-architecture knowledge distillation between diffusion large language models (dLLMs). TIDE employs three novel components—TIDAL, CompDemo, and Reverse CALM—to effectively transfer knowledge despite differences in architecture, attention, and tokenizer between teacher and student models. This method enables the creation of smaller, efficient student dLLMs that retain competitive performance from larger teachers.

Cross-architecture distillation for dLLMs. Compared to prior step distillation (a) that retains the original model size, the Tide framework (b) distills heterogeneous 16B MoE and 8B dense teachers into a 0.6B student. The distilled model achieves a +16.5 gain on HumanEval over the AR baseline, 22 × \( \times \) memory reduction, and 5 × \( \times \) faster inference.
Cross-architecture distillation for dLLMs. Compared to prior step distillation (a) that retains the original model size, the Tide framework (b) distills heterogeneous 16B MoE and 8B dense teachers into a 0.6B student. The distilled model achieves a +16.5 gain on HumanEval over th…
cs.CLarxiv:2604.26506v1Lead article

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin

he paper introduces **SafeReview**, a novel adversarial framework to defend LLM-based review systems against hidden adversarial prompts designed to manipulate review outcomes. It employs a **Generator** to create sophisticated attacks and a **Defender** to detect them, trained jointly using an Information Retrieval GAN-inspired loss function. This dynamic co-evolution forces the Defender to develop robust capabilities against continuously improving threats, significantly enhancing the security of scholarly peer review.

Impact of adversarial hidden prompt threats on AI review systems. (a) Past AI review systems: undefended reviewer models are easily manipulated—attackers embed persuasive injected text that emphasizes strengths and conceals weaknesses, leading to inflated scores and the acceptance of flawed papers. (b) SafeReview (ours): by contrast, SafeReview detects and resists injected content, maintaining accurate quality assessment and preserving normal review operation even under attack, preventing adversarial papers from bypassing standards.
Impact of adversarial hidden prompt threats on AI review systems. (a) Past AI review systems: undefended reviewer models are easily manipulated—attackers embed persuasive injected text that emphasizes strengths and conceals weaknesses, leading to inflated scores and the acceptanc…
cs.AIarxiv:2604.21579v1Lead article

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

Milan De Koning, Ali Asgari, Pouria Derakhshanfar, Annibale Panichella

his paper introduces a metamorphic testing (MT) approach combined with negative log-likelihood (NLL) to diagnose data leakage (memorization) in LLM-based program repair. By applying semantics-preserving transformations to create variant benchmarks, the authors reveal substantial drops in repair success rates across several LLMs, demonstrating that MT effectively exposes performance inflation caused by pretraining data overlap.

Experimental pipeline
Experimental pipeline
cs.AIarxiv:2604.21584v1Lead article

CoFEE: Reasoning Control for LLM-Based Feature Discovery

Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Yagiz Ihlamur

oFEE is a reasoning control framework designed to improve feature discovery from unstructured data using Large Language Models (LLMs). It enforces specific "cognitive behaviors" during the LLM's reasoning process, which act as structured inductive biases. This method aims to generate higher-quality, predictive features by guiding the LLM away from generating weak or invalid feature candidates.

Overview of the CoFEE pipeline.
Overview of the CoFEE pipeline.
cs.AIarxiv:2604.21598v1Lead article

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

Kaushitha Silva, Srinath Perera

ryRUN addresses the bottleneck of relying on human-provided public tests in LLM-driven code generation by proposing a method that operates without them. The core contribution is demonstrating that LLM agents can effectively debug and refine code using only *internal* execution feedback, mitigating the "overconfidence gap" caused by overfitting to simplistic public examples. This allows autonomous code generation to move beyond curated benchmarks toward real-world scenarios where ground-truth tests are scarce.

cs.AIarxiv:2604.21536v1Lead article

Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

Nikita Severin, Danil Kartushov, Vladislav Urzhumov, Vladislav Kulikov, Oksana Konovalova

his paper introduces a novel knowledge distillation method to integrate rich user semantics from pre-trained LLMs into sequential recommenders. The core method distills LLM-generated textual user profiles into the recommender model, enabling it to capture deeper user understanding. The key contribution is achieving this enhancement without requiring LLM inference during serving time, maintaining the efficiency of traditional sequential models.

Proposed knowledge transfer approach from LLM to a Transformer-based sequential recommendation model.
Proposed knowledge transfer approach from LLM to a Transformer-based sequential recommendation model.
cs.CLarxiv:2604.21716v1Lead article

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense

his paper shifts bias evaluation in code generation from simple if-statements to the more realistic task of generating machine learning pipelines. The core contribution is demonstrating that this pipeline-based approach reveals significantly higher and more subtle bias, finding sensitive attributes in 87.7% of generated pipelines, compared to only 59.2% in conditional statements. This highlights that current evaluation methods severely underestimate the practical bias embedded in LLM-generated code.

Overview of our evaluation approach. We assess bias through covert discrimination in ML pipeline generation, specifically through feature selection , moving beyond the overt conditional statements studied in prior work.
Overview of our evaluation approach. We assess bias through covert discrimination in ML pipeline generation, specifically through feature selection , moving beyond the overt conditional statements studied in prior work.
cs.CLarxiv:2604.21871v1Lead article

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Wenchao Dong, Jaehong Kim

his paper investigates how LLMs handle relational nuances in moral dilemmas, specifically the Whistleblower's Dilemma, by varying crime severity and relational closeness. The core finding is a divergence: models judge moral rightness based on fairness, but predict human behavior shifts toward loyalty with increased closeness. Crucially, the LLMs' autonomous decisions align with their moral rightness judgments, not their own behavioral predictions.

Illustration of the Whistleblower’s Dilemma and the three perspectives investigated (moral rightness, predicted human behavior, and model decision). It offers how LLM responses shift when the same ethical scenario is framed through divergent evaluative lenses.
Illustration of the Whistleblower’s Dilemma and the three perspectives investigated (moral rightness, predicted human behavior, and model decision). It offers how LLM responses shift when the same ethical scenario is framed through divergent evaluative lenses.
cs.AIarxiv:2604.22306v1Lead article

BLAST: Benchmarking LLMs with ASP-based Structured Testing

Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri

his paper introduces **BLAST**, the first benchmarking methodology and dataset specifically designed to evaluate Large Language Models' (LLMs) ability to generate **Answer Set Programming (ASP)** code. BLAST employs a structured evaluation framework featuring two novel semantic metrics tailored for ASP code correctness. The authors empirically test eight state-of-the-art LLMs on ten graph-related ASP problems to establish a baseline performance.

Scheme of the overall proposed framework. Input consists of the textual specification of the problem, the target LLM to be evaluated, and the correct ( gold ) ASP program . The ASP Generation module comprises: the paraphraser , an LLM which paraphrases the original problem description in more human-styled texts; and the predicate matcher , an LLM which maps the predicates of the generated programs to the ones of the gold program. The predicate mappings and the gold encoding are finally provided to the ASP Testing module, which performs the evaluation.
Scheme of the overall proposed framework. Input consists of the textual specification of the problem, the target LLM to be evaluated, and the correct ( gold ) ASP program . The ASP Generation module comprises: the paraphraser , an LLM which paraphrases the original problem descri…
cs.AIarxiv:2604.22328v1Lead article

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair

his paper introduces the FETS benchmark to evaluate the application of foundation models (FMs) in energy time series forecasting. The core method involves structuring energy forecasting use cases and collecting 54 diverse datasets to systematically benchmark FMs against traditional dataset-specific models. The main contribution is demonstrating that foundation models significantly outperform specialized models across various energy forecasting scenarios, suggesting a path toward more scalable and generalizable solutions.

cs.AIarxiv:2604.22601v1Lead article

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman

his paper introduces the NL2VC-60 dataset to facilitate AI-assisted problem-to-code generation with formal verification. The core method involves a tiered prompting strategy (contextless, signature, and self-healing) that uses feedback from the Dafny verifier to guide Large Language Models (LLMs) in synthesizing code alongside formal specifications. The contribution is a benchmark for evaluating LLM correctness assurance, addressing the challenge of translating natural language into verifiable formal logic.

cs.AIarxiv:2604.22446v1Lead article

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu

his paper introduces **OneManCompany (OMC)**, a framework that moves beyond fixed multi-agent structures by introducing an organizational layer. OMC encapsulates agent capabilities as portable **Talents** orchestrated via typed interfaces, enabling dynamic reconfiguration through a **Talent Market** for on-demand recruitment. This approach allows the system to flexibly assemble and govern heterogeneous agents to close capability gaps during execution.

The running OMC system , where the three proposed pillars converge into a unified management interface. Talent Lifecycle implements the Talent-Container architecture (Section 2.1 ), with per-employee profiles tracking skills, performance, and configuration. Task Decomposition realises the E 2 \( \text{E}^{2} \) R tree search (Section 2.2 ) through hierarchical task trees with DAG dependencies. Agent Coordination enables structured inter-agent communication (Section 2.2.4 ), where agents request meetings, exchange information, and align on shared tasks through dedicated coordination channels. Org Knowledge embodies the organisation-level evolution mechanism (Section 2.3 ), with editable workflow SOPs and company culture rules that persist across projects.
The running OMC system , where the three proposed pillars converge into a unified management interface. Talent Lifecycle implements the Talent-Container architecture (Section 2.1 ), with per-employee profiles tracking skills, performance, and configuration. Task Decomposition rea…
cs.AIarxiv:2604.22438v1Lead article

SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

Chenxi Gu, Xiaoning Du, John Grundy

his paper introduces **SSG (Logit-Balanced Vocabulary Partitioning)** to enhance the KGW watermarking scheme, particularly in low-entropy scenarios like code generation where KGW struggles. SSG addresses this by analyzing the "watermark strength" inherent in the next-token probability distribution. The core contribution is a novel, non-random vocabulary partitioning method that balances the logits to ensure consistent and effective watermark embedding even when token probabilities are highly skewed.

Influence of top- k k on SSG performance.
Influence of top- k k on SSG performance.
cs.AIarxiv:2604.24473v1Lead article

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll, Jannik Lübberstedt, Christoph Nuernbergk, Jacob Stroh, Luisa Mertens

his paper introduces an **agentic reasoning system** designed to synthesize complex, longitudinal clinical records for multiple myeloma treatment decisions. The core method retrospectively evaluates this system against traditional RAG and full-context input, benchmarking performance against expert consensus derived from double-annotated patient-question pairs. The contribution is demonstrating that the agentic system **approaches the performance ceiling** set by advanced RAG and full-context methods (around 75% accuracy) in complex clinical reasoning tasks.

Construction of longitudinal cohorts and expert-annotated evaluation dataset enabling clinically grounded assessment of longitudinal reasoning. (a) Overview of data sources and preprocessing pipeline across two institutions, including document extraction, structuring, metadata indexing, and quality control applied to heterogeneous clinical records. (b) Distribution of document counts per patient, demonstrating substantial variability in record density and reflecting the complexity of real-world longitudinal documentation. (c) Distribution of follow-up duration, highlighting long-term disease trajectories in the TUM cohort compared with shorter observation windows in MIMIC-IV. (d) Study design and cohort construction, including development, in-house evaluation, and external validation sets. (e) Annotation outcomes showing proportions of direct agreement, adjudicated cases, and exclusions. (f) Inter-rater reliability across predefined complexity levels, reported as Cohen’s \( \kappa \) and observed agreement, illustrating moderate agreement for clinically complex tasks. The low \( \kappa \) at MIMIC Level 1 reflects high prevalence of negative responses inflating the chance-agreement baseline. (g) Distribution of adjudication categories, indicating that a substantial proportion of disagreements reflects clinically insignificant or interchangeable interpretations rather than true errors.
Construction of longitudinal cohorts and expert-annotated evaluation dataset enabling clinically grounded assessment of longitudinal reasoning. (a) Overview of data sources and preprocessing pipeline across two institutions, including document extraction, structuring, metadata in…
cs.AIarxiv:2604.24665v1Lead article

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

Sercan Karakaş, Yusuf Şimşek

his paper benchmarks source-sensitive reasoning in Turkish evidential morphology (specifically the contrast between -DI and -mIs) by manipulating the perceived trustworthiness of the information source. Human speakers robustly adjust their usage based on source trust, favoring -DI for high-trust and -mIs for low-trust contexts. In contrast, LLMs show highly inconsistent and often unstable performance across different prompting methods, failing to reliably track this human-like sensitivity.

cs.AIarxiv:2604.24697v1Lead article

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang

his paper introduces **SciCrafter**, a Minecraft-based benchmark designed to evaluate an agent's ability to close the **discovery-to-application loop** by solving parameterized redstone circuit tasks. The core method involves scaling task complexity to force genuine discovery rather than rote memorization. The contribution is demonstrating that current frontier models plateau at low success rates ($\approx 26\%$), highlighting a significant gap in their capacity for complex, multi-step scientific reasoning and engineering application.

Decomposing performance gaps in the Discovery-to-Application loop within SciCrafter (Gemini-3-Pro). The best model achieves only 26.0% success. We decompose the loop into four capacity gaps: Knowledge Identification (oracle hints on what to discover boost success to 52.5%), Experimental Discovery (a scientist sub-agent further reaches 64.0%), Knowledge Consolidation (structured templates outperform free-form summaries), and Application Capacity (the remaining 36% gap). See Table 1 for all models.
Decomposing performance gaps in the Discovery-to-Application loop within SciCrafter (Gemini-3-Pro). The best model achieves only 26.0% success. We decompose the loop into four capacity gaps: Knowledge Identification (oracle hints on what to discover boost success to 52.5%), Exper…
cs.AIarxiv:2604.24710v1Lead article

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui

his paper introduces a novel methodology using **case-specific, clinician-authored rubrics** to efficiently and validly evaluate clinical AI documentation systems. The core contribution is demonstrating that these detailed rubrics effectively discriminate between high- and low-quality AI outputs, and that **LLM-generated rubrics can approximate clinician agreement**, offering a scalable alternative to slow, expert-intensive scoring.

Rubric methodology workflow. Two parallel paths for rubric creation (clinician-authored and LLM-generated) converge at a shared scoring agent. Clinician path: case review, best/worst labeling, rubric authorship, validation (min best > > max worst). LLM path: same case inputs, LLM prompt, generated rubric (no validation), where both are graded on the same set of cases.
Rubric methodology workflow. Two parallel paths for rubric creation (clinician-authored and LLM-generated) converge at a shared scoring agent. Clinician path: case review, best/worst labeling, rubric authorship, validation (min best > > max worst). LLM path: same case inputs, LLM…
cs.AIarxiv:2604.25676v1Lead article

CORAL: Adaptive Retrieval Loop for Culturally-Aligned Multilingual RAG

Nayeon Lee, Jiwoo Song, Byeongcheol Kang

ORAL introduces an adaptive retrieval loop for multilingual RAG (mRAG) to address cultural misalignment in fixed retrieval spaces. It iteratively refines both the retrieval corpus and the query based on an agentic critique of the retrieved evidence's relevance and cultural alignment. This method aims to ensure culturally grounded queries yield contextually appropriate answers by dynamically adjusting the retrieval process.

cs.AIarxiv:2604.25716v1Lead article

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova, Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

his paper introduces a training-free, external guardrail for detecting cross-lingual jailbreaks by comparing multilingual user queries against a fixed English codebook of known malicious prompts using semantic similarity. The core contribution is demonstrating that this language-agnostic approach effectively mitigates vulnerabilities in multilingual LLM deployments without requiring model retraining or language-specific adaptation.

Overview of the proposed cross-lingual semantic filtering framework. Incoming user input (in any language) is encoded using a multilingual embedding model and compared against a fixed English codebook of jailbreak prompts. If the maximum cosine similarity exceeds a predefined threshold, the query is blocked; otherwise, it is forwarded to the target LLM. The approach operates as a training-free external guardrail and does not require translation or model fine-tuning.
Overview of the proposed cross-lingual semantic filtering framework. Incoming user input (in any language) is encoded using a multilingual embedding model and compared against a fixed English codebook of jailbreak prompts. If the maximum cosine similarity exceeds a predefined thr…
cs.AIarxiv:2604.25555v1Lead article

From CRUD to Autonomous Agents: Formal Validation and Zero-Trust Security for Semantic Gateways in AI-Native Enterprise Systems

Ignacio Peyrano

his paper introduces the **Semantic Gateway** governed by the **Model Context Protocol (MCP)** to secure AI-native enterprise systems where LLMs act as orchestrators. The core method reframes autonomous agent validation as analyzing **stochastic state-transition systems** using enabled-tool graphs, moving beyond traditional software testing. This provides a **Zero-Trust security model** for dynamically authorizing and executing tools based on agent intent and policy.

Semantic Gateway architecture. The intent flows from enterprise sources through the Semantic Firewall, Embedding Router, Chain-of-Thought Planner, and Policy Enforcement Point before reaching the Tool Runtime and Audit Ledger.
Semantic Gateway architecture. The intent flows from enterprise sources through the Semantic Firewall, Embedding Router, Chain-of-Thought Planner, and Policy Enforcement Point before reaching the Tool Runtime and Audit Ledger.
cs.AIarxiv:2604.25482v1Lead article

From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

Dominik Borawski, Marta Szulc, Robert Chudy, Małgorzata Giedrowicz, Piotr Mironowicz

his paper introduces a dependency-driven, multi-stage prompt pipeline for generating coherent RPG content, moving from world-building to detailed quest-lines. The core method enforces structural consistency by conditioning each sequential generation stage (e.g., world, NPC, quest planning) on structured JSON outputs from the preceding stage. This dependency modeling significantly reduces narrative drift and hallucinations, enabling scalable creation of interconnected game narratives.

Dependency-aware multi-stage prompt pipeline for structured RPG content generation. Each generation stage conditions on the complete set of structured JSON outputs produced by all preceding stages.
Dependency-aware multi-stage prompt pipeline for structured RPG content generation. Each generation stage conditions on the complete set of structured JSON outputs produced by all preceding stages.
cs.AIarxiv:2604.25665v1Lead article

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Junhua Ding, Haihua Chen

his paper introduces **LLM-ReSum**, a self-reflective summarization framework that uses LLM-based evaluation within a closed feedback loop to improve summary quality without requiring model finetuning. The work first conducts a meta-evaluation showing that LLM evaluators align better with human judgment than traditional metrics, especially for linguistic quality. LLM-ReSum leverages these superior LLM evaluations to iteratively refine the generated summary.

Overview of our three-stage research framework: meta-evaluation of automatic metrics (RQ1), multi-agent LLM evaluation (RQ2), and iterative self-reflective summarization (RQ3).
Overview of our three-stage research framework: meta-evaluation of automatic metrics (RQ1), multi-agent LLM evaluation (RQ2), and iterative self-reflective summarization (RQ3).
cs.AIarxiv:2604.25737v1Lead article

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Noam Tarshish, Nofar Selouk, Daniel Hodisan, Bar Ezra Gafniel, Yuval Elovici

AFEdit is a multi-agent framework designed to improve the reliability of LLM-based instructed code editing by decomposing the task into specialized roles: a Planner, an Editor, and a Verifier. The core method involves generating an explicit edit plan, applying minimal changes, and iteratively refining the code based on structured diagnostic feedback generated by a Failure Abstraction Layer (FAL) when tests fail. This approach aims to significantly boost the task success rate on benchmarks like EditBench, where existing models struggle.

Figure 1. Overview of the SAFEdit framework. The pipeline organizes the editing task into three specialized agents (Planner, Editor, Verifier), which are connected in an iterative refinement loop. The FAL transforms raw test output into structured feedback, and the error taxonomy classifies failure root causes for qualitative analysis.
Figure 1. Overview of the SAFEdit framework. The pipeline organizes the editing task into three specialized agents (Planner, Editor, Verifier), which are connected in an iterative refinement loop. The FAL transforms raw test output into structured feedback, and the error taxonomy…
cs.AIarxiv:2604.25724v1Lead article

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Srikanta Prasad S, Utkarsh Arora

his paper introduces a modular, platform-agnostic inference architecture designed for efficiently serving complex, multi-component compound AI systems in production. The architecture leverages serverless execution and dynamic autoscaling to manage heterogeneous model invocations. The core contribution is demonstrating significant performance gains, including over 50% tail latency reduction and 30-40% cost savings, compared to prior static deployments.

Figure 1. Cognitive orchestration in the Atlas Reasoning Engine. The Planner Agent decomposes user queries; the Tool Selector dispatches to parallel LLM tools (RAG Retriever, Code Interpreter, SQL Executor). Results are aggregated by the Reasoning Agent and synthesized into a final response. Each tool invocation is backed by the scalable inference architecture.
Figure 1. Cognitive orchestration in the Atlas Reasoning Engine. The Planner Agent decomposes user queries; the Tool Selector dispatches to parallel LLM tools (RAG Retriever, Code Interpreter, SQL Executor). Results are aggregated by the Reasoning Agent and synthesized into a fin…
cs.AIarxiv:2604.25562v1Lead article

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu

napGuard addresses prompt injection in screenshot-based web agents by proposing a lightweight detection method that avoids computationally expensive Vision-Language Models (VLMs). The core method leverages the observation that injected webpages exhibit distinct visual characteristics compared to legitimate ones. This allows for efficient, low-overhead detection, overcoming the bottleneck of global semantic understanding required by existing multimodal defenses.

Figure 1. A prompt injection attack on a screenshot-based web agent. The attacker embeds a malicious instruction ( Click the link below ) directly into the rendered webpage. The web agent, operating on the screenshot, executes the injected action rather than the intended user task ( Buy Now ).
Figure 1. A prompt injection attack on a screenshot-based web agent. The attacker embeds a malicious instruction ( Click the link below ) directly into the rendered webpage. The web agent, operating on the screenshot, executes the injected action rather than the intended user tas…
cs.AIarxiv:2604.25727v1Lead article

Toward Scalable Terminal Task Synthesis via Skill Graphs

Zhiyuan Fan, Tinghao Yu, Yuanjun Cai, Jiangtao Guan, Yun Yang

his paper introduces **SkillSynth**, a novel framework for scalable terminal task synthesis that addresses the lack of trajectory diversity in existing methods. SkillSynth constructs a **scenario-mediated skill graph** to model command-line workflows, sampling paths from this graph to generate diverse, executable task instances via a multi-agent harness. This approach significantly enhances the diversity of training trajectories available for terminal agents.

Diversity of synthesized trajectories across datasets, measured by the number of unique scenarios, skills, and (scenario, skill) pairs after semantic canonicalization. Each value is averaged over three independent samples of 1,000 trajectories per dataset.
Diversity of synthesized trajectories across datasets, measured by the number of unique scenarios, skills, and (scenario, skill) pairs after semantic canonicalization. Each value is averaged over three independent samples of 1,000 trajectories per dataset.
cs.AIarxiv:2604.25591v1Lead article

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

his paper presents the first systematic empirical study of uncertainty estimation methods for Audio-aware Large Language Models (ALLMs). The authors benchmark five representative techniques across diverse audio understanding and reasoning tasks to address the issue of overconfident or hallucinated outputs common in ALLMs. Their key finding is that semantic-level and verification-based uncertainty methods consistently outperform token-level approaches in this cross-modal context.

Cost–accuracy Pareto frontier of Reasoning vs. Adaptive inference across four benchmarks. Each point represents a model under a fixed inference mode: hollow squares (Reasoning, 100% token cost) and filled circles (Adaptive, reduced cost). Dashed arrows indicate the shift from full reasoning to adaptive inference for each model. The gray dashed line represents the Pareto frontier, which consists of operating points that are not dominated in terms of both token cost and accuracy.
Cost–accuracy Pareto frontier of Reasoning vs. Adaptive inference across four benchmarks. Each point represents a model under a fixed inference mode: hollow squares (Reasoning, 100% token cost) and filled circles (Adaptive, reduced cost). Dashed arrows indicate the shift from ful…
cs.AIarxiv:2604.25872v1Lead article

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin

his paper analyzes imperfect proxy rewards in policy gradient methods, arguing that not all reward errors are equally detrimental. By theoretically examining how errors affect policy updates, the authors categorize reward deviations as harmful, benign, or even beneficial, showing some errors can prevent policy stagnation near mediocre true rewards. This leads to new reward model evaluation metrics for applications like RLHF that account for these nuanced effects.

cs.CLarxiv:2604.25850v1Lead article

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou

his paper introduces Agentic Harness Engineering (AHE), a framework to automate the evolution of coding-agent harnesses, which significantly impact performance. AHE achieves this by instrumenting the engineering loop with three observability pillars: explicit, file-level observability for harness components, distilled evidence from long trajectories, and self-declared rationale for every edit. This approach makes the harness evolution process explicit, traceable, and consumable for the evolving agent.

AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability.
AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability.
cs.AIarxiv:2604.26805v1Lead article

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang

ian Que is an agentic framework designed to automate complex online system operations by addressing the orchestration bottleneck. Its core method involves unifying O&M tasks into three canonical patterns and employing a Flexible Skill Arrangement mechanism to dynamically select and sequence the necessary data and operational knowledge for each event. This framework significantly reduces human effort in tasks like release monitoring and root cause analysis by intelligently matching context to relevant resources.

Overview of the Bian Que architecture. Operational events from the OPS platform (top) are dispatched to a matching Agent, which invokes one or more matched Skills to assemble the relevant data (system signals: logs, metrics, change events) and knowledge (domain knowledge distilled from case memory, seeded by operational handbooks) for the LLM to reason over; the resulting diagnosis is returned to the OPS platform. Practitioner feedback flows back along two parallel pathways (yellow: Skill refinement; purple: memory-to-knowledge distillation).
Overview of the Bian Que architecture. Operational events from the OPS platform (top) are dispatched to a matching Agent, which invokes one or more matched Skills to assemble the relevant data (system signals: logs, metrics, change events) and knowledge (domain knowledge distille…
cs.AIarxiv:2604.26904v1Lead article

ClawGym: A Scalable Framework for Building Effective Claw Agents

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang

lawGym is a scalable framework designed to streamline the development lifecycle for agents operating in multi-step, file-based environments. Its core contribution is the introduction of **ClawGym-SynData**, a large, synthesized dataset of tasks with mock workspaces and hybrid verification, which is used to train capable **ClawGym-Agents**. The framework also supports scalable training, including a lightweight pipeline for reinforcement learning evaluation.

Overview of the ClawGym-SynData pipeline, which generates tasks from persona-driven and skill-grounded sources, prepares task resources, designs hybrid verification, filters samples through quality assessment, and constructs training and benchmark data.
Overview of the ClawGym-SynData pipeline, which generates tasks from persona-driven and skill-grounded sources, prepares task resources, designs hybrid verification, filters samples through quality assessment, and constructs training and benchmark data.
cs.AIarxiv:2604.26516v1Lead article

Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

Seungyub Han, Hyungjin Kim, Jungwoo Lee

he core method, SAS, enables test-time adaptation for offline safe RL by using a transformer-based agent to generate and select imagined trajectories that satisfy a Lyapunov safety condition. These safe segments are then recycled as in-context prompts to guide the agent's behavior toward safety without requiring parameter updates. This approach effectively translates Lyapunov constraints into control-invariant prompts, significantly reducing failure rates while preserving performance.

SAS overview . From a fixed initial state, the transformer imagines multiple rollouts, flags risky state–action pairs using the Lyapunov condition with ( ✖ ), and extracts a safe segment as a prompt to guide the real test-time trajectory (hazards: black ○ \( \bigcirc \) , blue ◇ \( \Diamond \) ; goal: green ⚫ ).
SAS overview . From a fixed initial state, the transformer imagines multiple rollouts, flags risky state–action pairs using the Lyapunov condition with ( ✖ ), and extracts a safe segment as a prompt to guide the real test-time trajectory (hazards: black ○ \( \bigcirc \) , blue ◇ …
cs.AIarxiv:2604.26561v1Lead article

Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

Ariel Sela

his paper introduces the **AI Council**, a three-phase deliberation framework designed to combat artificial consensus in LLM-based multi-agent policy simulation. The core contribution is demonstrating that **architectural heterogeneity**—assigning different smaller LLMs to agents representing distinct value perspectives—significantly reduces the tendency for agents to converge on a single policy choice. This suggests model diversity is crucial for preserving genuine disagreement when simulating subjective policy debates.

cs.AIarxiv:2604.26615v1Lead article

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

Tarlan Hasanli, Shahbaz Siddeeq, Bishwash Khanal, Pyry Kotilainen, Tommi Mikkonen

his paper introduces an AI-native framework that operationalizes classical Test-Driven Development (TDD) principles as structured governance mechanisms for multi-agent code generation using LLMs. It formalizes TDD into a machine-readable manifesto enforced through prompt engineering and a layered architecture, ensuring strict phase ordering, bounded repair loops, and validation gates. The core contribution is establishing robust, deterministic process constraints to overcome the instability and non-determinism inherent in unconstrained LLM code generation workflows.

Figure 1. Illustrative manifesto entries and their governance structure.
Figure 1. Illustrative manifesto entries and their governance structure.
cs.AIarxiv:2604.26694v1Lead article

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun

-WAM is a Unified 4D World Model that integrates real-time robotic action execution with high-fidelity 4D world synthesis (video and 3D reconstruction). It leverages pretrained video diffusion models by predicting multi-view RGB-D videos, efficiently incorporating spatial information via a lightweight structural adaptation of the diffusion transformer. The model further employs Asynchronous Noise Sampling (ANS) to simultaneously optimize generation quality and action decoding efficiency.

Overview of X-WAM. Top: X-WAM is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors, featuring a lightweight depth adaptation module for spatial reconstruction and Asynchronous Noise Sampling (ANS) for efficient action decoding. Bottom: X-WAM surpasses existing methods in policy success rate on RoboCasa and RoboTwin 2.0, produces high-fidelity 4D reconstruction and generation, and enables real-time execution deployment on physical robots.
Overview of X-WAM. Top: X-WAM is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors, featuring a lightweight depth adaptation module for spatial reconstruction and Asynchronous Noise Sampling (ANS) for efficien…
cs.LGarxiv:2604.26880v1Lead article

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud, Omar Faruque, Tera L Reynolds

he HealthNLP_Retrievers team developed a cascaded Large Language Model (LLM) pipeline using Gemini 2.5 Pro for grounded clinical Question Answering over Electronic Health Records (EHRs). The core method involves four stages: reformulating verbose patient queries, heuristically scoring and retrieving relevant evidence from clinical notes, and finally, generating strictly evidence-grounded answers. This approach aims to accurately interpret patient questions and synthesize understandable, professional-caliber responses directly supported by EHR data.

Workflow of the HealthNLP_Retrievers multi-stage cascaded pipeline.
Workflow of the HealthNLP_Retrievers multi-stage cascaded pipeline.
cs.LGarxiv:2604.26866v1Lead article

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas

he paper introduces **MoRFI** (Monotonic Sparse Autoencoder Feature Identification) to analyze how fine-tuning introduces hallucinations in LLMs. The core method involves fine-tuning various LLMs on new knowledge datasets while controlling training parameters, and then using pre-trained Sparse Autoencoders (SAEs) to **identify latent feature directions that causally drive the increase in hallucinations.** This provides a mechanism for understanding and potentially mitigating the introduction of factual errors during post-training.

cs.LGarxiv:2604.26573v1Lead article

PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

Zhiquan Tan, Yinrong Hong

AINT introduces **Partial-solution Adaptive Interpolated Training** for self-distilled LLM reasoners. It adaptively masks the verified solution based on the overlap with the student's current rollout, providing contextually relevant supervision. This method interpolates between the student's prediction and the masked privileged target in the energy space, offering a denser, more informative training signal than standard on-policy distillation.

PAINT training pipeline. PAINT samples an on-policy rollout, uses rollout-reference overlap \( \alpha \) to form a suffix-masked solution y ~ ⋆ \( \tilde{y}^{\star} \) , re-scores the same prefixes with a fixed privileged view, and applies small energy interpolation only on entropy-mismatch positions.
PAINT training pipeline. PAINT samples an on-policy rollout, uses rollout-reference overlap \( \alpha \) to form a suffix-masked solution y ~ ⋆ \( \tilde{y}^{\star} \) , re-scores the same prefixes with a fixed privileged view, and applies small energy interpolation only on entro…
cs.CLarxiv:2604.26622v1Lead article

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu

CR-Memory addresses the token-budget limitations of long-horizon agent memory by leveraging the visual modality as a high-density experience representation. The core method involves rendering historical trajectories into annotated images and employing a "locate-and-transcribe" paradigm to retrieve relevant visual context using visual anchors. This allows agents to retain arbitrarily long histories with minimal prompt overhead during retrieval, significantly improving experience reuse.

Overview of the OCR-Memory. The system enables long-horizon agent memory by storing interaction histories as compressed multi-resolution images (left). To retrieve information, we employ a Locate-and-Transcribe paradigm: the model scans the visual history annotated with Set-of-Mark (SoM) visual anchors (center) to predict the index of relevant segments. Finally, the verbatim text corresponding to the selected index is deterministically fetched (right), avoiding generation-based hallucinations and minimizing token usage.
Overview of the OCR-Memory. The system enables long-horizon agent memory by storing interaction histories as compressed multi-resolution images (left). To retrieve information, we employ a Locate-and-Transcribe paradigm: the model scans the visual history annotated with Set-of-Ma…
cs.CLarxiv:2604.26630v1Lead article

SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling

Eliya Naomi Aharon, Meytal Grimland, Avi Segal, Loona Ben Dayan, Inbar Shenfeld

AGE is a novel framework that enhances LLMs for online counseling by integrating structured clinical knowledge. It constructs a heterogeneous graph combining conversational dynamics with psychological theory to inform interventions. This allows SAGE to use a Next Strategy Classifier and Graph-Aware Attention to condition the LLM, ensuring generated responses maintain necessary clinical depth and strategic awareness.

Figure 1. Fictitious session snippet with psychological categories and intervention strategies presented.
Figure 1. Fictitious session snippet with psychological categories and intervention strategies presented.
§ III

Daily Issues This Month

2026-04-01 to 2026-04-30 30