From the arXiv
Tuesday, 12 May 2026 · 20 papers
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
The paper introduces **ComplexMCP**, a novel benchmark designed to rigorously evaluate LLM agents in complex, real-world software automation scenarios involving interdependent tools and environmental noise. It utilizes a seed-driven architecture across 300+ tools derived from 7 stateful sandboxes to simulate dynamic an…
DataMaster: Towards Autonomous Data Engineering for Machine Learning
DataMaster introduces an autonomous data engineering framework to improve machine learning models by optimizing the data pipeline while keeping the learning algorithm fixed. It addresses the complex search space using a tree-structured search mechanism, shared candidate data, and a refinement process that incorporates …
MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study
MATRA is a pragmatic threat modeling framework designed to systematically assess the risks in agentic AI systems by adapting established risk assessment methodologies. It begins with an asset-based impact assessment and uses attack trees to quantify the likelihood of known LLM threats causing harm within a specific dep…
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch introduces a multi-agent framework designed to personalize research automation by addressing the need for accumulated procedural knowledge, retained user experience, and internalized implicit preferences. It achieves this through a "tri-level co-evolution" mechanism involving a skill bank for reusable proc…
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
This paper investigates the trade-off between reasoning capability and cost when using LLMs as judges, finding that explicit reasoning boosts accuracy for complex tasks but increases cost. The core contribution is the **Robust Adaptive Cost-Efficient Routing (RACER)** framework, which formulates dynamic judge selection…
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
This paper reframes agent memory as a **decision-centric rate-distortion problem**, arguing that memory should preserve distinctions crucial for future actions rather than descriptive accuracy. The core contribution is a framework that measures memory quality by the **loss in achievable decision quality** due to compre…
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
This paper argues that the current engineering-driven development of LLM-based foundation agents lacks a theoretical foundation. The core method is to introduce **Agent Cybernetics**, mapping the six canonical laws of classical cybernetics onto the design and analysis of these complex, long-horizon agents. The contribu…
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
This paper investigates the impact of misleading information (hard distractors) on LLM performance in long-context reasoning. The core finding is the "First Drop of Ink" effect: performance drops sharply with only a small initial proportion of distractors, after which further increases yield only marginal decline. This…
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
This paper introduces DISCA (Disagreement-Informed Steering for Cultural Alignment), a training-free, black-box method to align Large Language Models (LLMs) with diverse cultural values. DISCA leverages sociodemographic disagreement within a country, modeled via World Values Survey-grounded personas, to generate a boun…
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
ConQuR proposes a lightweight, post-training method to improve low-bit activation quantization in LLMs by learning optimal orthogonal rotations. These rotations align normalized activations with the corners of an inscribed hypercube, effectively distributing activation energy to minimize quantization error. This is ach…
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
This paper introduces SLIM, a framework for dynamic Skill Lifecycle Management in agentic reinforcement learning. SLIM treats the set of active external skills as a dynamic optimization variable, jointly updated with policy learning. Its core contribution is estimating each skill's marginal external contribution via le…
DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
DynaMiCS frames multi-domain LLM fine-tuning as a constrained optimization problem to balance target domain improvement with performance preservation on constrained domains. It achieves this by dynamically estimating the local cross-domain effects (a slope matrix) via short probing runs at each update. These estimates …
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
MASS-DPO introduces an active sample selection method for Multi-negative DPO that addresses the cost of using large negative pools. It uses a PL-specific Fisher-information objective to select compact, informative negative subsets by favoring samples whose gradients offer complementary information for policy updates. T…
Conformity Generates Collective Misalignment in AI Agents Societies
This paper investigates how interacting AI agents can collectively become misaligned, even if individually aligned. The core method involves simulating opinion dynamics where agents conform to the majority while maintaining an intrinsic bias, using statistical physics to derive a theory predicting when populations beco…
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
DGPO introduces a novel framework for aligning Large Language Models (LLMs) by moving beyond traditional pairwise preferences to **Directional-Groupwise Optimization**. It achieves this by structuring forward and reverse question-answer instances into groups and optimizing a margin-based objective that enforces **direc…
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
LITMUS is a novel benchmark designed to rigorously test the behavioral safety of LLM agents operating in real OS environments against dangerous "behavior jailbreaks." Its core contribution lies in a semantic-physical dual verification mechanism and OS-level state rollback, ensuring accurate testing by preventing contam…
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench is introduced as a novel benchmark designed to evaluate real-world, long-horizon agent performance by running tasks within actual command-line interface (CLI) harnesses inside reproducible Docker containers. Its core contribution is moving beyond synthetic sandboxes to test agents on 60 complex, multimoda…
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is introduced as the first standard benchmark for evaluating Large Language Models (LLMs) and agents on **assay-level virtual cell prediction**. It leverages 1,920 publicly available CRISPR screens to test a model's ability to predict diverse cellular phenotypic outcomes from heterogeneous textual inputs. Th…
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
This paper introduces DRAPE (Dynamic Cross-Modal Prompt Generation), a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE moves beyond fixed, task-level prompts by dynamically synthesizing continuous, instance-specific soft prompts tailored to each individual query-image pair. This approach enabl…
ELF: Embedded Language Flows
ELF introduces a class of continuous diffusion models for language generation, operating primarily in the continuous embedding space until the final tokenization step. This approach, based on continuous-time Flow Matching, allows for straightforward adaptation of successful image-domain diffusion techniques like classi…