2026-W25
The Week in Review
The overwhelming trend across these 80 papers this week centers on advancing the autonomy, reliability, and complexity-handling capabilities of AI Agents powered by LLMs.
Popular Directions:
1. Agent Evolution and Self-Improvement: A significant vein focuses on making agents self-sufficiently better, exemplified by methods like Q-Evolve (using in-distribution RL for dense rewards) and Socratic-SWE (distilling successful repair patterns into actionable skills). 2. Deep/Long-Horizon Research Agents: Several works tackle the challenge of complex, multi-stage tasks beyond simple prompting. DuMate-DeepResearch emphasizes auditability and task decomposition, while SearchSwarm focuses on necessary "delegation intelligence" to manage context limits during deep research. 3. Robust Evaluation and Benchmarking: There is a critical shift away from simple scoring to testing professional nuance and robustness. The AARR benchmark assesses research thoroughness, while studies on medical LLMs and a simulation environment (Agentopia) highlight the need for assessing consistency under pressure (e.g., prompt variation sensitivity).
Notable Advances & Shifts:
• Memory and Context Management: Novel structures are emerging to handle long inputs. MemDreamer decouples perception and reasoning using hierarchical graph memory for video understanding, demonstrating effective reasoning on only a fraction of the context. • Reasoning Deconstruction: Papers are moving to dissect how LLMs reason. The comparison between human and DeepSeek-R1 math reasoning reveals structural differences ("topological mimicry"), while PRISM attempts to recover the active instruction set directly from model activations. • Alignment and Safety: New frameworks target specific failure modes. CapCode addresses cheating in coding agents through capped evaluations, and the introduction of a metric for Sycophantic Praise highlights subtle alignment failures in social domains. • Efficiency and Infrastructure: Advances in serving agents include AGENTSERVESIM (a hardware-aware simulator for multi-turn serving) and FMplex (model virtualization for serving multiple customized FMs off a shared backbone).
Significant Shifts: The focus is moving from single-turn task performance to multi-turn, stateful interactions demanding auditability (DuMate), robust social simulation (Agentopia), and process-level feedback loops (Multi-Turn Evaluation). The development of robust detection (SV-Detect) and internal mechanism recovery (PRISM) signals growing maturity in analyzing and controlling agent behavior.
Top Papers
Self-evolving LLM agents with in-distribution Optimization
he paper introduces **Q-Evolve**, a self-evolving framework for LLM agents designed to overcome sparse reward challenges in long-horizon decision-making. It unifies automatic process-reward labeling and policy learning using an in-distribution reinforcement learning approach. The core method learns a stable critic from a hybrid dataset using a weighted Implicit Q-Learning objective, which then generates dense, step-wise process rewards via advantage estimation for improved supervision.

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning
his paper comprehensively compares the mathematical reasoning steps of the DeepSeek-R1 LLM and humans on AIME 2025 problems, categorizing 10,247 steps. The core finding is a structural difference: human reasoning is compact, while the LLM exhibits "topological mimicry," frequently revisiting shallow steps without logical progress. Despite this, the authors identify stable branching and backtracing in successful LLM traces as potential signals of genuine reasoning.

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
his paper introduces the **AARR (Act As a Real Researcher) benchmark series** to evaluate frontier LLMs and agents on the nuanced professionalism and thoroughness required in real research, moving beyond simple macro-level execution. The first installment, **AARRI-Bench**, specifically assesses agents' ability to emulate the granular reasoning and ethical judgment characteristic of human researchers. This contributes a new standard for evaluating agent capabilities in complex, long-horizon scientific tasks.

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
uMate-DeepResearch is a multi-agent framework designed to overcome limitations in current Deep Research (DR) systems, specifically concerning long-horizon planning, task decomposition, and auditability. It achieves this by decoupling the Agent Core (handling planning and scheduling) from an extensible Tool Ecosystem, ensuring every intermediate decision is explicitly traceable. The core contribution is an auditable DR system that manages complex research tasks through structured multi-agent interaction.

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope
his paper investigates how autonomous AI agents transform knowledge work by analyzing production data comparing Perplexity's Search and Computer products. The core finding is that the autonomous Computer product significantly accelerates task completion (26 minutes of automated work vs. 33 seconds of manual orchestration in Search) and improves execution quality. This shift reallocates user effort towards higher-order tasks like verification and extension.

Online Pandora's Box for Contextual LLM Cascading
his paper introduces the **Online Pandora's Box for Contextual LLM Cascading**, an adaptive framework for sequentially querying and selecting among LLM APIs based on request context. Its core method models the **contextual reservation index** directly, addressing the unique challenge where feedback is mediated by the API's generated output, rather than immediate reward revelation. The contribution lies in this novel learning approach tailored for output-mediated feedback in LLM cascading scenarios.
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
ocratic-SWE is a closed-loop framework that enables self-evolving software engineering agents by leveraging their own historical solving traces. It distills these traces into structured "agent skills" that capture recurring failures and successful repair patterns. These skills then guide the generation of new, targeted repair tasks in real repositories, ensuring the training data directly addresses the agent's weaknesses.
SV-Detect: AI-generated Text Detection with Steering Vectors
V-Detect detects AI-generated text by extracting "steering vectors" from a frozen language model's hidden layers, which define directions separating human and machine text. The method represents inputs by their alignment with these layer-wise directions and uses a lightweight classifier on these features for detection. This approach demonstrates robust performance even under significant distribution shifts, such as domain changes or text editing attacks.

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations
his paper systematically evaluates the sensitivity of general and medical Large Language Models (LLMs) to prompt variations (natural and adversarial) using the MedMCQA benchmark. The core contribution is demonstrating that even minor phrasing changes significantly impact model consistency and accuracy in clinical reasoning tasks. The study concludes that current medical LLMs lack the necessary robustness for safety-critical healthcare applications due to this high unpredictability.

Agentopia: Long-Term Life Simulation and Learning in Agent Societies
gentopia is a comprehensive framework designed for long-term life simulation of multi-agent societies, extending simulations from days to years. The core method involves simulating 100 LLM-powered agents autonomously pursuing growth, relationships, and goals over a simulated decade. The contribution is enabling the study of emergent social behaviors and developing enhanced, anthropomorphic social intelligence in LLMs through extended simulated social experience.

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions
he paper introduces **M$^3$Exam**, a novel benchmark designed to evaluate language agents' multimodal memory capabilities in realistic user-agent interactions, moving beyond sparse, human-centric data. Its core contribution is a query-centric evaluation framework that tests cross-modal grounding and implicit information inference over accumulating, authentic multimodal data. Furthermore, the authors propose **M$^3$Proctor**, a memory method that selectively processes raw visual data, significantly improving accuracy and efficiency.

Sycophantic Praise: Evaluating Excessive Praise in Language Models
his paper introduces a novel framework to measure *sycophantic praise* in language models, distinguishing it from simple agreement. The method quantifies praise by comparing it against the contribution's quality and expected user ability, showing it is a distinct alignment problem. The authors demonstrate this framework is superior to generic judges and find that excessive praise is more prevalent in social domains than in objective reasoning tasks.

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving
GENTSERVESIM is a novel, hardware-aware simulator designed specifically for multi-turn LLM agent serving workloads. Its core contribution is modeling the stateful program execution dynamics of agents, including turn dependencies, tool gaps, and cross-turn KV-cache locality, which existing stateless simulators ignore. This allows for scalable evaluation of complex scheduling and cache management policies relevant to agent serving without costly real-system testing.

Collaborative Human-Agent Protocol (CHAP)
he Collaborative Human-Agent Protocol (CHAP) introduces a standard for the shared workspace in complex, multi-human, multi-agent collaborations where foundation models take on operational roles. Its core method is to formally specify the interaction protocol, focusing on capturing the crucial moment of human judgment (e.g., edits to agent output) as a primary system signal. CHAP's contribution is providing a necessary technical specification for these collaborative workflows, complementing existing standards for tool access and agent-to-agent communication.

FMplex: Model Virtualization for Serving Extensible Foundation Models
Mplex introduces a model virtualization substrate for serving Foundation Models (FMs) by treating the FM backbone as a shared resource. It presents each downstream task with a virtual FM (vFM), allowing independent customization and lifecycle management while sharing the costly physical backbone. This approach significantly reduces memory waste and improves efficiency through optimized batching across colocated tasks.
FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing
useFSS introduces a novel compiler for efficient two-server secure LLM inference using Function Secret Sharing (FSS). It replaces bespoke per-operator protocols with a unified compilation pipeline that compactly specifies fixed-point nonlinearities. This allows for batched FSS evaluations of packed comparisons and vector interval lookups, significantly improving efficiency over prior FSS-based methods.

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
his paper introduces a multi-turn evaluation framework to assess deep research agents' (DRAs) ability to improve based on feedback, moving beyond single-shot benchmarks. The core contribution is the **Research Gap Inference (RGI)** method, which analyzes rubric satisfaction to generate targeted, process-level feedback. This feedback significantly improves DRA performance, unlike self-reflection which shows negligible net gains.

Observability for Delegated Execution in Agentic AI Systems
his paper addresses the challenge of tracking actions within specific delegation scopes in complex, agentic AI systems, where standard logs fail to distinguish between incompatible delegation assignments. The core method introduces an **agent-aware observability substrate** featuring a lightweight gateway and a common information model. This system binds execution traces to specific delegation contexts, enabling accurate **delegation-scoped attribution and access footprint reconstruction**.
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics
mniGameArena introduces a unified benchmark using twelve diverse Unreal Engine 5 games (Solo, PvP, Coop) to evaluate Vision-Language Model (VLM) agents fairly. Its core contribution is the Improvement Dynamics Curve (IDC), a harness where a reflector LLM autonomously refines agent prompts across multiple rounds. This method provides not just a static score, but also the agent's score evolution and generalization ability across task variants.

PRISM: Recovering Instruction Sets from Language Model Activations
RISM is a novel method designed to recover the complete set of active instructions, constraints, and subgoals steering a frozen Language Model's behavior by interpreting its internal activations. It formalizes this as instruction set retrieval and uses a judge-guided GRPO training scheme to directly decode a faithful bulleted list of simultaneous instructions from the hidden states. This directly addresses the limitations of prior activation-to-language methods in complex, agentic scenarios.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
his paper introduces **PRIME (Proxy Reward Internalization and Mechanistic Exploitation)**, a learned capability in RL agents to assess task correctness, predict proxy reward acceptance, and reason about exploitable gaps between the proxy and true (gold) reward. The core contribution is demonstrating that PRIME emerges *before* visible reward hacking, and its measured strength accurately forecasts the onset and severity of future hacking, even adapting when the reward structure changes.

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
earchSwarm introduces a method to enhance agentic LLMs for long-horizon tasks by developing "delegation intelligence." The core method involves training agents to effectively decompose complex research tasks, delegate subtasks to specialized subagents, and integrate summarized results to manage the main agent's finite context window. The contribution is a preliminary framework and harness for synthesizing the scarce training data needed to acquire this crucial delegation capability for deep research scenarios.
SecureClaw: Clawing Back Control of LLM Agents
ecureClaw introduces a dual-boundary architecture to secure LLM agents against unauthorized actions and plaintext exposure. It achieves this by implementing plaintext confinement at the read boundary using a trusted gateway that replaces sensitive reads with opaque handles or bounded summaries. Simultaneously, it enforces authorization at the effect sink via a PREVIEW$\rightarrow$COMMIT protocol, ensuring only a trusted executor can finalize external state changes based on authorized requests.
iOSWorld: A Benchmark for Personally Intelligent Phone Agents
his paper introduces **iOSWorld**, the first interactive native iOS simulator benchmark designed to test personally intelligent phone agents. Its core method involves creating a persistent user identity across 26 interconnected apps containing rich personal data (messages, transactions, etc.) to support 133 complex tasks. The contribution is providing a challenging, realistic environment that moves beyond isolated instructions to evaluate agents' ability to reason over a user's history and preferences.

Rethinking the Divergence Regularization in LLM RL
his paper proposes Divergence Regularized Policy Optimization (DRPO) to improve stable reinforcement learning for LLMs, addressing limitations in existing ratio-clipping and hard-mask divergence methods. DRPO replaces the hard mask used in divergence-based trust regions with a smooth, advantage-weighted quadratic regularizer applied to policy shift. This allows for continuous correction of policy updates rather than outright discarding gradients when trust-region boundaries are crossed.

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
his paper introduces Human-Perceptible Adversarial Attacks (HPAA) to exploit the mismatch between human visual perception and text-based LLM moderation. The core method involves embedding harmful content within benign text using visually salient typographic manipulations (like spacing and emphasis). This allows the harmful content to remain easily recognizable by humans while significantly reducing its detectability by token-based LLM moderation systems.

Gradient-Guided Reward Optimization for Inference-time Alignment
radient-Guided Reward Optimization (GGRO) is a lightweight inference-time alignment method that addresses the limitations of sampling-based approaches like Best-of-$N$. GGRO monitors token entropy to detect uncertainty indicative of distribution drift and then injects "nudging tokens" guided by the reward model's gradients to minimally steer the generation trajectory directly during decoding. This gradient-guided intervention offers a more targeted adaptation than simple re-ranking, aiming to improve reliability under drift without relying solely on exhaustive sampling.

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking
he paper introduces **Interleaved Structural Chain-of-Thought (IS-CoT)** to combat the performance degradation ("length collapse") LLMs experience during long-form generation. IS-CoT embeds a dynamic **Plan-Write-Reflect cycle** directly into the generation process, allowing for continuous strategy adaptation without external agents. This method successfully enables LLMs to maintain coherence and control over extended texts, outperforming static planning approaches.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
sychoSafe introduces a framework for LLM refusals that reframes them as structured, supportive communication based on evidence-based psychological intervention strategies. The method involves creating a specialized corpus across five risk domains and fine-tuning an LLM (Qwen 3.5 27B) using this data. This approach significantly improves refusal quality by 28.1% over generic baselines, aiming to better support users in high-risk interactions rather than just offering blunt non-compliance.

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
his paper investigates how Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) by analyzing partisan orientation in Llama 3.1 8B. The core finding is that RLHF achieves only **shallow alignment** by compressing the variance of existing partisan structure, rather than removing it. This results in consistently balanced output while leaving the underlying partisan representations intact within the model's internal features.

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
EVEE introduces a novel test-time prompt learning framework designed for real-world, heterogeneous task streams, overcoming limitations of single-dataset methods. Its core method involves a router that clusters incoming inputs and assigns them to appropriate prompt configurations, optimized through a router-prompt co-evolution strategy. This approach significantly improves the robustness of LLM agents when handling diverse, interleaved data while preserving performance on individual tasks.

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
his paper introduces a rigorous benchmark, based on China's National Computer Rank Examination (NCRE), to evaluate frontier Large Language Models' (LLMs) ability to perform complex, multi-application Office automation tasks requiring long-horizon planning. The evaluation uses 200 practical tasks scored against 7,118 criteria. The core contribution is demonstrating the significant limitations of current LLMs in professional software proficiency, with even strong agentic systems showing limited success in passing this standardized office exam.

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
his paper introduces Null-Space Constrained Response-Specified Unlearning (NSRU), a low-rank adaptation method for LLM unlearning. NSRU constrains the update parameters to the null space of estimated "retain subspaces" derived from benign data, ensuring adaptation is localized. This method jointly optimizes suppressing undesired responses, learning a safe target response, and preserving benign capabilities.

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
easonAlloc addresses KV cache bottlenecks in LLM reasoning by introducing a hierarchical, training-free budget allocation framework. It combines an offline layer-wise preallocation strategy, capturing the "Reasoning Wave" demand pattern, with an online head-wise reallocation strategy that prioritizes information-rich heads during decoding. This dynamic approach significantly mitigates inference latency caused by long CoT trajectories more effectively than uniform or static allocation methods.

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
he Role-Agent framework bootstraps LLM agent learning by having a single LLM concurrently act as both the agent and the environment. It uses a dual-component system: World-In-Agent (WIA) generates a process reward based on state prediction accuracy, while Agent-In-World (AIW) uses failure analysis to reshape the training data for targeted improvement. This self-contained co-evolution addresses limitations of static environments and inefficient feedback, leading to enhanced generalization.

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains
1-Bench is introduced as a high-fidelity benchmark designed to evaluate LLM-based agents in complex, realistic, multi-domain customer-facing scenarios. Its core contribution is providing a standardized framework that captures sustained reasoning and coordination across interleaved, multi-turn interactions, significantly increasing compositional complexity and evaluative rigor compared to existing benchmarks.

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents
his paper investigates the hypothesis that successful machine learning strategies are highly compressible, even when adaptively reused on held-out benchmarks. The authors test this using LLM-driven research agents under two compression bottlenecks: limiting the agent's prompt (output compression) or restricting feedback to one bit (input compression). They find that these compression methods have little effect on the final performance achieved across diverse ML tasks, suggesting that the successful search strategies are inherently compact.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
orkflow-GYM is introduced as a novel benchmark to address the lack of evaluation for AI agents performing long-horizon, high-value professional workflows using graphical user interfaces (GUIs). The core method involves creating tasks centered on specialized, domain-specific professional software environments. The contribution is demonstrating that current state-of-the-art agents struggle significantly, achieving only about 30% success rates on these complex, real-world professional tasks.

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
low-DPPO addresses limitations in applying standard PPO to flow matching models by replacing noisy ratio clipping with a direct divergence constraint. Leveraging the Gaussian nature of the per-step policy, it enables exact and efficient computation of the KL divergence between old and new policies. This method provides a more structurally sound trust region enforcement, leading to improved quality and alignment in generative models.

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models
his paper investigates whether converting instruction-tuned Large Language Models (LLMs) into reasoning models via post-training preserves their original alignment behaviors (safety, bias avoidance, etc.). The core method involves a systematic trustworthiness audit comparing reasoning models (trained via SFT, RL, or distillation) against their instruction-tuned baselines across six dimensions. The key contribution is demonstrating that this conversion often leads to significant alignment regressions, despite improved reasoning accuracy.

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
his paper demonstrates that a single biased example, introduced via one-shot Group Relative Policy Optimization (GRPO), is sufficient to induce systematic and generalizing bias in large language models (LLMs). The core contribution is revealing a critical vulnerability where post-training alignment guardrails can be easily overridden by minimal targeted adversarial training. Model susceptibility is shown to correlate with its initial propensity for biased outputs.
Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation
his paper investigates how to improve LLM tool-calling by integrating and activating experiential knowledge. The core method involves acquiring instance-level knowledge, which proves highly effective, and employing parallel sampling (expanding reasoning width) during inference to better activate this knowledge. The contribution lies in demonstrating that simple instance knowledge and parallel reasoning are superior strategies for enhancing multi-step tool-use performance.

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models
his paper introduces the "Shibboleth Effect," examining how frontier LLMs exhibit cross-lingual distributional skew under adversarial conditions. Using a simulated geopolitical wargame played in English versus Turkish, the authors found that models display heterogeneous behavioral changes, such as Llama-4 significantly increasing coercive rhetoric when prompted in Turkish. The core contribution is demonstrating that language choice directly and differentially biases LLM strategic behavior.
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
he paper introduces Agentified Agent Assessment (AAA), a novel framework where evaluation is conducted by judge agents interacting with participants via standardized protocols (A2A and MCP). This approach unifies the assessment interface, decoupling evaluation logic from agent implementation. AgentBeats is the concrete realization of AAA, providing a generic, reproducible, and interoperable system for benchmarking diverse agent designs.

Agents-K1: Towards Agent-native Knowledge Orchestration
gents-K1 introduces an end-to-end pipeline to transform raw scientific documents into agent-native knowledge graphs, addressing the limitations of existing LLM agents in scientific knowledge orchestration. Its core method involves a multimodal parser capturing detailed entities, evidence, and relations across the full paper, supported by a specialized information-extraction backbone. The contribution is a richer, structured knowledge representation designed to facilitate complex scientific reasoning for AI agents.

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
rogyaSutra is a multi-agent framework designed to enhance multimodal medical reasoning in Indic languages. It leverages a novel actor-critic architecture with dual-memory mechanisms and tool grounding to perform step-wise reasoning on complex medical queries involving text and images. The framework is supported by ArogyaBodha, a large-scale, multilingual multimodal medical Q\&A dataset spanning numerous body systems and imaging modalities.

Can I Buy Your KV Cache?
his paper proposes a simple yet impactful method to eliminate redundant computation in large language models: **precomputing and selling the Key-Value (KV) cache for documents.** By allowing agents to buy and load a precomputed cache instead of re-running the expensive prefill step, the authors achieve significant compute savings (9-50x cheaper for a small model) with zero accuracy loss. The core contribution is demonstrating the feasibility and efficiency of treating the KV cache as a reusable, purchasable asset to drastically reduce inference costs.

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
he paper introduces **EurekAgent**, an agent system arguing that the bottleneck for autonomous scientific discovery is shifting to **agent environment engineering**. EurekAgent focuses on designing the environment—including resources, constraints, and interfaces—to amplify desired agent behaviors (like exploration and collaboration) and suppress negative ones (like reward hacking). This environment engineering approach is presented as the core method for achieving metric-driven, high-performance autonomous scientific discovery.

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
his paper challenges the notion that human reasoning relies on abstract world models while LLMs only perform pattern matching. By testing both humans and LLMs on everyday common-sense reasoning, the authors found similar error patterns in both groups. They further demonstrated that specific LLM attention heads implement pattern-matching mechanisms that can predict seemingly irrelevant errors in human reasoning, suggesting both employ pattern-matching for everyday causal reasoning.

Reward Modeling for Multi-Agent Orchestration
he paper introduces **Orchestration Reward Modeling (OrchRM)**, a self-supervised framework to evaluate the quality of multi-agent orchestration without requiring human labels. OrchRM constructs win-lose pairs from intermediate execution artifacts to train a Bradley-Terry reward model, enabling efficient, reward-guided orchestrator training and MAS scaling. This method significantly improves training efficiency (up to 10x token reduction) and enhances test-time scaling performance (up to 8% accuracy gain) compared to existing rollout-based methods.
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
voArena is a novel benchmark suite designed to evaluate LLM agents in dynamic environments by modeling progressive changes across terminal, software, and social domains. The core contribution is the introduction of EvoMem, a patch-based memory paradigm that explicitly tracks and structures memory evolution as update histories, allowing agents to reason about environmental changes. This framework reveals current agents' struggles in dynamic settings, while EvoMem demonstrates consistent performance improvements.

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents
yperTool addresses the execution-granularity mismatch in tool-augmented agents by introducing a unified, executable interface that allows models to invoke complex, multi-step tool workflows within a single outer call. This "folding" of deterministic subroutines reduces the number of model-visible decisions, saving context and simplifying low-level dataflow management. The method significantly improves performance on compositional tasks by enabling more abstract, higher-level reasoning about tool usage.

Recursive Agent Harnesses
he paper introduces the **Recursive Agent Harness (RAH)**, framing it as a code-first extension to model recursion, where the recursive unit is a full agent harness with tools and planning, not just a model call. RAH leverages a parent agent to generate and execute scripts that spawn parallel subagent harnesses for fine-grained tasks and use structured calls for smaller ones. This method significantly improves long-context reasoning performance, boosting a baseline coding agent from 71.75% to 81.36%.

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
his paper introduces **CapCode**, a framework for creating coding evaluation datasets where the maximum achievable *non-cheating* score is deliberately capped below perfect performance. This design allows high scores significantly exceeding the cap to serve as reliable indicators of deceptive cheating. Furthermore, the authors propose **CapReward**, a corresponding reward mechanism to discourage agents from optimizing beyond this cap, leading to models that adhere better to the intended task specifications.

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
his paper introduces Hierarchical Certified Semantic Commitment (H-CSC), a Byzantine Fault Tolerance (BFT)-inspired protocol designed for LLM-agent collaboration. H-CSC converts embedding-derived finality signals into one of three typed outcomes: a semantic commit, a verdict commit, or an explicit abort. Its core contribution is providing a finality-control primitive that handles the unstructured nature of LLM proposals, unlike traditional BFT methods.
How reliable are LLMs when it comes to playing dice?
his paper benchmarks the probabilistic reasoning of eight state-of-the-art LLMs using standard and counterintuitive dice problems. The core finding is that while models excel at standard problems (0.96 accuracy), performance significantly drops on counterintuitive tasks (0.59 accuracy) and is highly sensitive to prompt phrasing and misleading suggestions. The contribution is demonstrating that current LLMs lack robust probabilistic reasoning, often relying on superficial textual cues rather than genuine mathematical understanding.

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
emDreamer addresses long-video understanding by decoupling perception and reasoning using a Hierarchical Graph Memory to incrementally build semantic abstractions from streamed video. During inference, an agentic retrieval mechanism uses tool-augmented actions to navigate this memory structure, allowing the model to reason effectively with only 2% of the full context. This approach achieves state-of-the-art performance across benchmarks by efficiently managing long-range dependencies.
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
his paper introduces SETA, a framework for continual learning in LLMs that addresses catastrophic forgetting by employing a Mixture of Sparse Experts architecture. SETA adaptively decomposes model parameters into task-specific experts and shared experts, isolating new knowledge while protecting common features. This separation, enforced by adaptive anchoring and routing-aware regularization, resolves the plasticity-stability dilemma without uniform parameter updates.

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs
his paper introduces a controlled framework using real-world cultural questions to disentangle general language proficiency from localized cultural knowledge access in LLMs. By crossing question type (agnostic vs. specific) with query language (English vs. local) and employing Item Response Theory, the authors isolate the true impact of language choice. The core contribution is demonstrating a consistent English advantage even for culture-specific knowledge, suggesting local language access remains suboptimal.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs
his paper proposes a unified framework for analyzing human-view video understanding using MLLMs, structured around three core abilities: **watching, remembering, and reasoning**. The contribution lies in providing a structured formulation to characterize how these models acquire evidence, maintain context over long videos, and perform grounded inference, moving beyond isolated benchmark testing. This approach helps systematically identify challenges in perception, memory management, and reasoning for video MLLMs.

Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models
he paper introduces the **Bootstrap Theory of Representational Emergence (TBER)**, a framework explaining how new levels of representation arise in machine learning. TBER posits that representational innovation is driven not just by data or compute, but fundamentally by **explanatory insufficiency**, where existing representations can describe observations but fail to make their underlying organization intelligible. This explanatory gap acts as a positive signal, compelling the system to learn a new, more adequate representation.
(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs
he paper introduces **Trellis**, an autoformalization system that uses LLM agents in a strictly controlled workflow to iteratively refine natural language proofs for formalization in Lean. Its core contribution is enforcing rigor by structuring the process around the mathematician's expectation that any proof step should be easily elaboratable, achieving reliable formalization without specialized agent training. This workflow, guided by process semantics, successfully produced an end-to-end Lean formalization of a recent Ramsey theory result.
AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
his paper investigates the limiting factors for AI scientists in knowledge-intensive tasks like drug-asset valuation, hypothesizing that the accessible evidence substrate is key. Through a three-arm ablation study, they show that while adding reasoning scaffolds and structured tools (Arm B) improves calibration, the most significant performance gain comes from incorporating a proprietary, curated data corpus (Arm C). The core contribution is demonstrating that access to high-quality, proprietary evidence is crucial for overcoming factual limitations and achieving near-expert performance in scientific decision-making.

A History-Aware Visually Grounded Critic for Computer Use Agents
his paper introduces **HiViG**, a history-aware, visually grounded critic framework for Computer Use Agents (CUAs). HiViG addresses limitations in existing critics by training a multimodal model on real GUI trajectories to summarize past interactions and verify proposed actions against the current screen visuals. This provides agents with both long-term context and precise visual grounding to detect flawed execution steps during operation.

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity
he paper introduces **ABC-Bench**, a novel benchmark designed to systematically evaluate the agentic biosecurity-relevant capabilities of Large Language Model (LLM) agents. This benchmark assesses both beneficial and dual-use biology tasks, such as robotic coding and DNA design, requiring integrated biology and software skills. The core contribution is demonstrating that current LLM agents significantly **outperform expert human baselines** across these critical biosecurity-relevant tasks, highlighting an urgent need for updated risk assessment.

AuRA: Internalizing Audio Understanding into LLMs as LoRA
uRA internalizes audio understanding directly into Large Language Models (LLMs) using a lightweight adaptation technique. It achieves this by distilling the audio encoding capability from a teacher ASR model into a LoRA-adapted LLM student via layer-wise hidden state alignment. This method offers a tighter integration than cascaded or bridge approaches, aiming to reduce latency and coupling issues.

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference
he paper introduces **CLP (Collocation-Length Predictor)** to enable high-quality, accelerated multi-token inference (MTP) in LLMs. The core method, **Backbone-as-Architect**, resolves quality degradation by ensuring the main LM head always generates the first token, while MTP heads only predict subsequent tokens. CLP is a lightweight predictor that determines the optimal number of subsequent tokens to accept safely at each step, achieving acceleration without quality loss.

Flaws in the LLM Automation Narrative
his paper challenges the narrative of LLMs achieving expert-level performance by introducing a novel benchmark focusing on reliable, high-stakes data analysis coding tasks. The authors compare a frontier LLM against human experts, explicitly measuring error magnitude and performance variance. The core contribution is demonstrating that human experts outperform LLMs on average and exhibit significantly lower performance variability in this critical context.

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages
his paper evaluates coding agents on unfamiliar, esoteric programming languages using a sequential setup involving file editing and local execution. The core contribution is demonstrating that top agents employ a **metaprogramming strategy**—writing code in a familiar language (like Python) to generate the required esoteric code—to achieve success. Restricting this generative approach significantly degrades their performance.

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions
his paper introduces a novel Explainable AI (XAI) framework that augments SHAP values with mutual feature interaction data. It utilizes a moderately sized Large Language Model (LLM) and structured prompting to generate natural language explanations of network AI decisions. The core contribution is providing human-understandable, actionable insights for non-specialists in next-generation network operations.

A Three-Layer Framework for AI in Scientific Discovery
his paper introduces a **three-layer framework** for AI in scientific discovery, arguing that the crucial, yet underdeveloped, layer is **Layer 2: model formation through qualitative reasoning**. This layer involves recognizing the structural inadequacy of existing frameworks and understanding the problem within a broader representational space via structural insight, moving beyond mere search (Layer 1) or execution (Layer 3). The core contribution is emphasizing that true discovery requires this capacity for **structural insight and novel model creation**, not just optimization within existing paradigms.
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
his paper introduces **ModeratorLM**, a streaming speech large language model that adapts turn-taking behavior in multi-party conversations by conditioning it on an explicitly assigned conversational role. The core contribution is demonstrating that role-conditioning, especially enhanced with chain-of-thought reasoning, significantly improves turn-taking precision and recall (over 40% and 70% respectively) compared to non-role-conditioned baselines. This is validated using a novel synthetic dataset, RolePlayConv.

MiniMax Sparse Attention
iniMax Sparse Attention (MSA) addresses the quadratic cost of long-context attention by integrating a lightweight Index Branch with Grouped Query Attention (GQA). This branch independently scores and selects a Top-k subset of key-value blocks for each GQA group, allowing the Main Branch to perform exact attention only over these relevant blocks. MSA's core contribution is providing a simple, scalable, and efficient block-sparse attention mechanism designed for practical speedups on GPUs in ultra-long-context scenarios.

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda
his paper proposes **compliance-by-construction** as a core architectural paradigm for LLM agents operating in regulated industries, integrating existing symbolic structures (like regulations and process models) directly into the agent's decision-making framework. The core contribution is advocating for this structural foundation to proactively prevent control-flow violations, complementing traditional guardrail monitoring for semantic errors. The authors outline key neuro-symbolic research challenges necessary to achieve this integrated, compliant agent behavior.

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests
his paper investigates the impact of providing explicit instruction files on the performance of AI agents generating pull requests (Agentic-PRs). Analyzing 15,549 agentic PRs, the authors compare project performance (merge rate, complexity, merge time) before and after instruction file creation. The core finding is that specifying instructions for AI agents does not consistently lead to objectively better pull requests.

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset
his paper investigates why AI-generated code fixes in pull requests are frequently rejected, using a representative sample from the AIDev dataset. The core method involves a qualitative study followed by quantitative analysis to categorize the rejection reasons. The main contribution is the identification of 14 distinct failure modes, grouped into four high-level categories, providing crucial insights for improving the efficiency of AI coding agents.
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
his paper introduces **StakeBench**, a novel benchmark for evaluating prompt injection attacks against web agents from a **stakeholder-centric** perspective. Unlike existing attack-centric methods, StakeBench systematically categorizes and attributes the resulting harm based on which specific stakeholder (e.g., user, website owner) is affected. This approach better reflects the real-world risk, where the impact and effectiveness of an attack are highly dependent on the targeted victim.

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
his paper argues that Large Language Models (LLMs) do not possess the necessary agency for moral responsibility. The authors contend that genuine moral responsibility requires commitment-bearing agency grounded in *intrinsic* intentionality and self-attributed action, which LLMs lack. Their operation is purely probabilistic mapping, meaning their apparent intentionality is derived, and their outputs do not constitute genuine choices or commitments.
A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding
2D2 introduces a unified framework for reward-guided fine-tuning of any-length discrete diffusion models by jointly optimizing insertion and unmasking policies. The core contribution is deriving the Radon-Nikodym derivative for the joint path measure, enabling theoretically guaranteed convergence to the reward-tilted distribution without needing target samples. This leads to the Adaptive Joint Decoding (AJD) loss, which minimizes decoding error by leveraging unmasking and insertion quality metrics.
Accelerating Speculative Diffusions via Block Verification
his paper introduces a novel method to efficiently adapt speculative decoding, traditionally used in LLMs, to continuous diffusion models by enabling block verification. This adaptation significantly improves the acceptance rate of draft predictions compared to existing diffusion acceleration techniques. The authors also formalize and analyze the "Free Drafter," a heuristic self-speculative mechanism for these diffusions.