№01
cs.AI arxiv:2605.10787v1

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang et al.

The paper introduces **ComplexMCP**, a novel benchmark designed to rigorously evaluate LLM agents in complex, real-world software automation scenarios involving interdependent tools and environmental noise. It utilizes a seed-driven architecture across 300+ tools derived from 7 stateful sandboxes to simulate dynamic an…

9
№02
cs.AI arxiv:2605.10906v1

DataMaster: Towards Autonomous Data Engineering for Machine Learning

Yaxin Du, Xiyuan Yang, Zhifan Zhou et al.

DataMaster introduces an autonomous data engineering framework to improve machine learning models by optimizing the data pipeline while keeping the learning algorithm fixed. It addresses the complex search space using a tree-structured search mechanism, shared candidate data, and a refinement process that incorporates …

9
№03
cs.AI arxiv:2605.10763v1

MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study

Tim Van hamme, Thomas Vissers, Javier Carnerero-Cano et al.

MATRA is a pragmatic threat modeling framework designed to systematically assess the risks in agentic AI systems by adapting established risk assessment methodologies. It begins with an asset-based impact assessment and uses attack trees to quantify the likelihood of known LLM threats causing harm within a specific dep…

9
№04
cs.AI arxiv:2605.10813v1

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Jinhang Xu, Qiyuan Zhu, Yujun Wu et al.

NanoResearch introduces a multi-agent framework designed to personalize research automation by addressing the need for accumulated procedural knowledge, retained user experience, and internalized implicit preferences. It achieves this through a "tri-level co-evolution" mechanism involving a skill bank for reusable proc…

9
№05
cs.AI arxiv:2605.10805v1

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Wenbo Zhang, Lijinghua Zhang, Liner Xiang et al.

This paper investigates the trade-off between reasoning capability and cost when using LLMs as judges, finding that explicit reasoning boosts accuracy for complex tasks but increases cost. The core contribution is the **Robust Adaptive Cost-Efficient Routing (RACER)** framework, which formulates dynamic judge selection…

9
№06
cs.AI arxiv:2605.10870v1

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou, Zhihan Guo, Langzhang Liang et al.

This paper reframes agent memory as a **decision-centric rate-distortion problem**, arguing that memory should preserve distinctions crucial for future actions rather than descriptive accuracy. The core contribution is a framework that measures memory quality by the **loss in achievable decision quality** due to compre…

9
№07
cs.AI arxiv:2605.10754v1

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Xinrun Wang, Chang Yang, He Zhao et al.

This paper argues that the current engineering-driven development of LLM-based foundation agents lacks a theoretical foundation. The core method is to introduce **Agent Cybernetics**, mapping the six canonical laws of classical cybernetics onto the design and analysis of these complex, long-horizon agents. The contribu…

9
№08
cs.AI arxiv:2605.10828v1

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

Muhan Gao, Zih-Ching Chen, Kuan-Hao Huang

This paper investigates the impact of misleading information (hard distractors) on LLM performance in long-context reasoning. The core finding is the "First Drop of Ink" effect: performance drops sharply with only a small initial proportion of distractors, after which further increases yield only marginal decline. This…

9
№09
cs.AI arxiv:2605.10843v1

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen et al.

This paper introduces DISCA (Disagreement-Informed Steering for Cultural Alignment), a training-free, black-box method to align Large Language Models (LLMs) with diverse cultural values. DISCA leverages sociodemographic disagreement within a country, modeled via World Values Survey-grounded personas, to generate a boun…

9
№10
cs.LG arxiv:2605.10793v1

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

Chayne Thrash, Ali Abbasi, Soheil Kolouri

ConQuR proposes a lightweight, post-training method to improve low-bit activation quantization in LLMs by learning optimal orthogonal rotations. These rotations align normalized activations with the corners of an inscribed hypercube, effectively distributing activation energy to minimize quantization error. This is ach…

9
№11
cs.LG arxiv:2605.10923v1

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Junhao Shen, Teng Zhang, Xiaoyan Zhao et al.

This paper introduces SLIM, a framework for dynamic Skill Lifecycle Management in agentic reinforcement learning. SLIM treats the set of active external skills as a dynamic optimization variable, jointly updated with policy learning. Its core contribution is estimating each skill's marginal external contribution via le…

9
№12
cs.LG arxiv:2605.10770v1

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

Eleonora Gualdoni, Sonia Laguna, Louis Bethune et al.

DynaMiCS frames multi-domain LLM fine-tuning as a constrained optimization problem to balance target domain improvement with performance preservation on constrained domains. It achieves this by dynamically estimating the local cross-domain effects (a slope matrix) via short probing runs at each update. These estimates …

9
№13
cs.LG arxiv:2605.10784v1

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

Rohan Surana, Xintong Li, Sheldon Yu et al.

MASS-DPO introduces an active sample selection method for Multi-negative DPO that addresses the cost of using large negative pools. It uses a PL-specific Fisher-information objective to select compact, informative negative subsets by favoring samples whose gradients offer complementary information for policy updates. T…

9
№14
cs.CL arxiv:2605.10721v1

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano et al.

This paper investigates how interacting AI agents can collectively become misaligned, even if individually aligned. The core method involves simulating opinion dynamics where agents conform to the majority while maintaining an intrinsic bias, using statistical physics to derive a theory predicting when populations beco…

9
№15
cs.CL arxiv:2605.10863v1

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Mengyi Deng, Zhiwei Li, Xin Li et al.

DGPO introduces a novel framework for aligning Large Language Models (LLMs) by moving beyond traditional pairwise preferences to **Directional-Groupwise Optimization**. It achieves this by structuring forward and reverse question-answer instances into groups and optimizing a margin-based objective that enforces **direc…

9
№16
cs.CL arxiv:2605.10779v1

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang et al.

LITMUS is a novel benchmark designed to rigorously test the behavioral safety of LLM agents operating in real OS environments against dangerous "behavior jailbreaks." Its core contribution lies in a semantic-physical dual verification mechanism and OS-level state rollback, ensuring accurate testing by preventing contam…

9
№17
cs.CL arxiv:2605.10912v1

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing et al.

WildClawBench is introduced as a novel benchmark designed to evaluate real-world, long-horizon agent performance by running tasks within actual command-line interface (CLI) harnesses inside reproducible Docker containers. Its core contribution is moving beyond synthetic sandboxes to test agents on 60 complex, multimoda…

9
№18
cs.AI arxiv:2605.10876v1

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu et al.

AssayBench is introduced as the first standard benchmark for evaluating Large Language Models (LLMs) and agents on **assay-level virtual cell prediction**. It leverages 1,920 publicly available CRISPR screens to test a model's ability to predict diverse cellular phenotypic outcomes from heterogeneous textual inputs. Th…

8
№19
cs.AI arxiv:2605.10765v1

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

This paper introduces DRAPE (Dynamic Cross-Modal Prompt Generation), a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE moves beyond fixed, task-level prompts by dynamically synthesizing continuous, instance-specific soft prompts tailored to each individual query-image pair. This approach enabl…

8
№20
cs.AI arxiv:2605.10938v1

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu et al.

ELF introduces a class of continuous diffusion models for language generation, operating primarily in the continuous embedding space until the final tokenization step. This approach, based on continuous-time Flow Matching, allows for straightforward adaptation of successful image-domain diffusion techniques like classi…

8