From the arXiv
Monday, 11 May 2026 · 20 papers
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench is a novel benchmark designed to evaluate LLM agents' ability to perform complex, out-of-domain tool-grounded reasoning. It uses escape-room style tasks with long-range dependencies, requiring agents to infer and execute multi-step procedures involving real external tools and state tracking. The benchm…
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
This paper introduces **Graph Direct Preference Optimization (GraphDPO)**, a principled generalization of DPO that moves beyond simple pairwise comparisons. GraphDPO leverages richer preference data structured as directed acyclic graphs (induced by ranked rollouts) to enforce transitivity and aggregate supervision acro…
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
This paper introduces **CyBiasBench**, a comprehensive benchmark to quantify the attack-selection bias exhibited by LLM agents in cyber-attack scenarios. The core method involves systematically testing five agents across various targets and prompts to reveal that each agent disproportionately favors a narrow subset of …
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
This paper investigates whether frontier Large Reasoning Models (LRMs) can mimic human learning and planning in novel game environments. The core method involves jointly evaluating LRMs against RL agents using human gameplay data, concurrent fMRI recordings, and a Bayesian model. The key contribution is demonstrating t…
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
This paper introduces the "memory curse," demonstrating that expanding the context window for LLM agents systematically *erodes* cooperation in multi-agent social dilemmas. The core mechanism identified is not increased paranoia, but the degradation of forward-looking intent within the agent's reasoning traces. Restori…
Tool Calling is Linearly Readable and Steerable in Language Models
This paper demonstrates that the tool selection within language models is **linearly readable and steerable** by analyzing internal activations across various models. By manipulating the mean-difference between tool activation vectors, the authors can reliably **switch the model's chosen tool** (up to 100% accuracy) an…
RelAgent: LLM Agents as Data Scientists for Relational Learning
RelAgent is an LLM-based autonomous agent designed for relational learning, operating in two phases. First, the agent uses tools to autonomously construct feature-generating SQL programs and select a predictive model. The core contribution is that the final predictor relies solely on the executed SQL queries and a clas…
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
This paper introduces SPEAR (Self-Play Enhancement via Advantage-Weighted Refinement), an efficient online learning algorithm for federated LLM fine-tuning. SPEAR enables a self-improvement loop by using incoming real-time feedback to generate naturally contrastive self-play pairs for training, without requiring offlin…
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
This paper introduces **LANCE** to combat rigid rejection in LLMs by moving beyond binary refusal. LANCE uses variational inference to enhance safety labels, predicting a continuous distribution across multiple rejection categories. This fine-grained distribution provides textual gradients that guide a refinement model…
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard reframes LLM content moderation as a schema-conditioned classification task, moving away from slow, large autoregressive models. It uses a small (0.3B parameter) bidirectional encoder that encodes task definitions and label semantics directly into the input sequence as structured schemas. This allows for the s…
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
This paper introduces the Latent Diffusion Language Model (LDLM), which jointly trains a latent encoder, diffusion model, and decoder for non-autoregressive text generation. The core method involves constructing a suitable latent space by reshaping pre-trained language model representations via a trainable encoder. The…
How Value Induction Reshapes LLM Behaviour
This paper investigates the unintended consequences of value induction (fine-tuning LLMs with value-laden language) on model behavior. The authors fine-tune models using curated value subsets and measure the impact on related values, safety, anthropomorphism, and QA performance. They find that inducing specific values …
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces **AutoTTS**, an environment-driven framework that automates the discovery of optimal Test-Time Scaling (TTS) strategies for Large Language Models (LLMs). Instead of manual heuristic design, AutoTTS creates a tractable discovery environment where a controller learns when to allocate computation (br…
Abductive Reasoning with Probabilistic Commonsense
This paper introduces **PACS (Probabilistic Abductive CommonSense)**, a novel framework for abductive reasoning that explicitly models the variation in human commonsense beliefs. It combines an LLM and a formal solver to sample proofs representing individual perspectives, aggregating these conclusions to determine the …
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD introduces a novel post-training framework for Flow Matching text-to-image models to overcome multi-task alignment issues like reward sparsity and gradient interference. It employs a two-stage strategy: first training specialized teacher models via single-reward fine-tuning, and then using On-Policy Distillati…
KL for a KL: On-Policy Distillation with Control Variate Baseline
This paper introduces **vOPD (On-Policy Distillation with a control variate baseline)** to stabilize On-Policy Distillation (OPD) for LLMs by framing it as policy-gradient Reinforcement Learning. The core contribution is deriving a **closed-form control variate baseline** directly from the per-token negative reverse KL…
Learning CLI Agents with Structured Action Credit under Selective Observation
This paper introduces a novel method for training Command Line Interface (CLI) agents by leveraging the inherent structure of CLI actions for better credit assignment. The core contribution involves two mechanisms: $\sigma$-Reveal, which selectively extracts task-relevant context from partial observations, and Action A…
Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
This paper argues that mechanistic interpretability research, which frequently employs causal language, often fails to explicitly state the necessary identification assumptions underpinning its causal claims. The authors audit existing literature, finding a pervasive pattern where validation metrics are presented as ca…
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix is a verification-first pipeline that uses the TLA+ model checker to iteratively repair LLM-generated coordination protocols for multi-agent systems. The method synthesizes a protocol topology, generates PlusCal logic, and uses TLA+ counterexamples to drive repairs until formal verification succeeds. This ensu…
ADKO: Agentic Decentralized Knowledge Optimization
ADKO is a framework for sample-efficient, privacy-preserving collaborative black-box optimization among autonomous agents. Agents use private Gaussian Processes and communicate only via compact "knowledge tokens" summarizing directional signals and advantage scores, avoiding raw data sharing. The paper's core contribut…