From the arXiv
Friday, 12 June 2026 · 20 papers
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
The paper introduces Agentified Agent Assessment (AAA), a novel framework where evaluation is conducted by judge agents interacting with participants via standardized protocols (A2A and MCP). This approach unifies the assessment interface, decoupling evaluation logic from agent implementation. AgentBeats is the concret…
Agents-K1: Towards Agent-native Knowledge Orchestration
Agents-K1 introduces an end-to-end pipeline to transform raw scientific documents into agent-native knowledge graphs, addressing the limitations of existing LLM agents in scientific knowledge orchestration. Its core method involves a multimodal parser capturing detailed entities, evidence, and relations across the full…
ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
ArogyaSutra is a multi-agent framework designed to enhance multimodal medical reasoning in Indic languages. It leverages a novel actor-critic architecture with dual-memory mechanisms and tool grounding to perform step-wise reasoning on complex medical queries involving text and images. The framework is supported by Aro…
Can I Buy Your KV Cache?
This paper proposes a simple yet impactful method to eliminate redundant computation in large language models: **precomputing and selling the Key-Value (KV) cache for documents.** By allowing agents to buy and load a precomputed cache instead of re-running the expensive prefill step, the authors achieve significant com…
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
The paper introduces **EurekAgent**, an agent system arguing that the bottleneck for autonomous scientific discovery is shifting to **agent environment engineering**. EurekAgent focuses on designing the environment—including resources, constraints, and interfaces—to amplify desired agent behaviors (like exploration and…
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
This paper challenges the notion that human reasoning relies on abstract world models while LLMs only perform pattern matching. By testing both humans and LLMs on everyday common-sense reasoning, the authors found similar error patterns in both groups. They further demonstrated that specific LLM attention heads impleme…
Reward Modeling for Multi-Agent Orchestration
The paper introduces **Orchestration Reward Modeling (OrchRM)**, a self-supervised framework to evaluate the quality of multi-agent orchestration without requiring human labels. OrchRM constructs win-lose pairs from intermediate execution artifacts to train a Bradley-Terry reward model, enabling efficient, reward-guide…
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
EvoArena is a novel benchmark suite designed to evaluate LLM agents in dynamic environments by modeling progressive changes across terminal, software, and social domains. The core contribution is the introduction of EvoMem, a patch-based memory paradigm that explicitly tracks and structures memory evolution as update h…
HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents
HyperTool addresses the execution-granularity mismatch in tool-augmented agents by introducing a unified, executable interface that allows models to invoke complex, multi-step tool workflows within a single outer call. This "folding" of deterministic subroutines reduces the number of model-visible decisions, saving con…
Recursive Agent Harnesses
The paper introduces the **Recursive Agent Harness (RAH)**, framing it as a code-first extension to model recursion, where the recursive unit is a full agent harness with tools and planning, not just a model call. RAH leverages a parent agent to generate and execute scripts that spawn parallel subagent harnesses for fi…
A Three-Layer Framework for AI in Scientific Discovery
This paper introduces a **three-layer framework** for AI in scientific discovery, arguing that the crucial, yet underdeveloped, layer is **Layer 2: model formation through qualitative reasoning**. This layer involves recognizing the structural inadequacy of existing frameworks and understanding the problem within a bro…
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
This paper introduces **ModeratorLM**, a streaming speech large language model that adapts turn-taking behavior in multi-party conversations by conditioning it on an explicitly assigned conversational role. The core contribution is demonstrating that role-conditioning, especially enhanced with chain-of-thought reasonin…
MiniMax Sparse Attention
MiniMax Sparse Attention (MSA) addresses the quadratic cost of long-context attention by integrating a lightweight Index Branch with Grouped Query Attention (GQA). This branch independently scores and selects a Top-k subset of key-value blocks for each GQA group, allowing the Main Branch to perform exact attention only…
Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda
This paper proposes **compliance-by-construction** as a core architectural paradigm for LLM agents operating in regulated industries, integrating existing symbolic structures (like regulations and process models) directly into the agent's decision-making framework. The core contribution is advocating for this structura…
Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests
This paper investigates the impact of providing explicit instruction files on the performance of AI agents generating pull requests (Agentic-PRs). Analyzing 15,549 agentic PRs, the authors compare project performance (merge rate, complexity, merge time) before and after instruction file creation. The core finding is th…
Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset
This paper investigates why AI-generated code fixes in pull requests are frequently rejected, using a representative sample from the AIDev dataset. The core method involves a qualitative study followed by quantitative analysis to categorize the rejection reasons. The main contribution is the identification of 14 distin…
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
This paper introduces **StakeBench**, a novel benchmark for evaluating prompt injection attacks against web agents from a **stakeholder-centric** perspective. Unlike existing attack-centric methods, StakeBench systematically categorizes and attributes the resulting harm based on which specific stakeholder (e.g., user, …
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
This paper argues that Large Language Models (LLMs) do not possess the necessary agency for moral responsibility. The authors contend that genuine moral responsibility requires commitment-bearing agency grounded in *intrinsic* intentionality and self-attributed action, which LLMs lack. Their operation is purely probabi…
A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding
A2D2 introduces a unified framework for reward-guided fine-tuning of any-length discrete diffusion models by jointly optimizing insertion and unmasking policies. The core contribution is deriving the Radon-Nikodym derivative for the joint path measure, enabling theoretically guaranteed convergence to the reward-tilted …
Accelerating Speculative Diffusions via Block Verification
This paper introduces a novel method to efficiently adapt speculative decoding, traditionally used in LLMs, to continuous diffusion models by enabling block verification. This adaptation significantly improves the acceptance rate of draft predictions compared to existing diffusion acceleration techniques. The authors a…