From the arXiv
Wednesday, 20 May 2026 · 20 papers
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
This paper introduces the **Stochastic-Deterministic Boundary (SDB)** as the core architectural primitive for production LLM agents, defining it as a four-part contract governing how LLM outputs become system actions. The authors organize agent runtime design around this SDB across three concerns (Coordination, State, …
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw introduces a self-reinforcing, iterative autonomous research pipeline that moves beyond linear execution. Its core method involves structured multi-agent debate, a self-healing execution loop that learns from failures, and cross-run evolution to accumulate knowledge. This system significantly contribut…
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
CopT reformulates Chain-of-Thought reasoning by prioritizing a draft answer before engaging in subsequent "on-policy thinking" for reflection and correction. Its core method involves using continuous embeddings as inference-time contrastive verifiers, comparing the model's support for generated tokens under discrete an…
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
This paper introduces **CPD Online (CPD)**, a novel, training-free method for detecting fluent adversarial prompts by framing the problem as **online change-point detection** on the token-level next-token entropy stream. By establishing a baseline using the LLM's system prompt and applying a CUSUM statistic to standard…
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK introduces a novel method for LLM agents operating on recurring long contexts by caching reusable orientation knowledge as a "context map." This small, constant-sized artifact, maintained via a programmable cache policy (Distiller, Cartographer, Prioritizer), acts as an orientation cache within the agent's prompt.…
Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
This paper investigates how observation fidelity impacts embodied LLM agents solving a complex mechanical puzzle called the Lockbox. The core method involves testing LLMs with varying observation types (RGB, RGB-D, and ground-truth) on a physical robot and in simulation. The key contribution is the counterintuitive fin…
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
The paper introduces **ThoughtTrace**, the first large-scale dataset pairing real-world multi-turn human-AI conversations with users' self-reported thoughts (reasons for prompts and reactions to responses). The core contribution is providing this crucial "what they think" layer, which analysis shows is distinct from sp…
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
This paper introduces **AutoTool**, a method that enables Multimodal Large Language Models (MLLMs) to **adaptively decide whether to invoke external tools** during reasoning, addressing the issue that unnecessary tool use can hinder performance. It employs a **dual-mode reasoning strategy within a reinforcement learnin…
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
ClinSeekAgent is an automated agentic framework designed to shift clinical reasoning from passive evidence consumption to active evidence acquisition. It dynamically seeks, plans for, and synthesizes multimodal evidence from heterogeneous sources like knowledge bases, EHRs, and imaging tools based only on a clinical qu…
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
The paper introduces **MixRea**, a benchmark designed to test Large Language Models (LLMs) on **explicit-implicit reasoning**, inspired by human inattentional blindness. It evaluates whether LLMs fail to use subtle contextual cues when explicit instructions are present, revealing widespread "inattentional blindness" ac…
Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory
This paper introduces **TriMem**, a novel memory system for lifelong LLM agents that moves beyond purely atomic facts. TriMem maintains three coexisting representation granularities—raw dialogue segments, atomic facts, and synthesized profiles—to ensure both storage fidelity and deep, holistic reasoning over accumulate…
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE proposes an efficient and lossless inference method for Mixture-of-Experts (MoE) Diffusion Large Language Models (dLLMs) by exploiting the temporal stability of expert activations during the diffusion process. It introduces an interval-based expert refresh strategy that manages expert placement in an I/O-aware man…
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
This paper introduces **Agentic Tuning** via **PerfEvolve**, shifting system tuning from static documentation to dynamic action. PerfEvolve translates expert tuning methodologies into executable skills for LLM agents, enabling them to perform version verification, workload profiling, and joint optimization. This approa…
BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation
BalanceRAG addresses the challenge of setting risk thresholds in cascaded RAG systems, where decisions are made sequentially by an LLM-only branch and a RAG fallback. The core method frames threshold pairs as operating points on a 2D lattice and uses sequential graphical testing to identify "safe" pairs that meet a tar…
Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study
This paper investigates whether code cleanliness affects the performance of coding agents by introducing a controlled evaluation protocol using minimal pairs. These pairs are identical in functionality but differ only in code quality (style and complexity). The study found that while code cleanliness did not significan…
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
This paper introduces **Graft**, a hybrid tree construction method for speculative decoding that overcomes the trade-off between dense, high-overhead trees and pruned, lower-coverage trees. Graft couples **pruning** (to save budget) with **retrieval** (to recover lost coverage) as mutually reinforcing operations. This …
Less Back-and-Forth: A Comparative Study of Structured Prompting
This paper comparatively studies how structured prompting affects Large Language Model (LLM) output quality and user effort across different tasks and models. The core finding is that **checklist-improved prompts significantly outperform raw and clarifying-question prompts**, achieving the highest quality scores while …
Probabilistic Tiny Recursive Model
The paper introduces Probabilistic Tiny Recursive Models (PTRM) to overcome the deterministic convergence issue in standard Tiny Recursive Models (TRMs). PTRM achieves this by injecting Gaussian noise during each recursive step, enabling parallel exploration of diverse solution paths. This task-agnostic method signific…
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
This paper reframes safety guardrails for foundation models in sensitive domains as a problem of **runtime behavioral control over interaction trajectories**, inspired by robotics. The core method introduces the **Grounded Observer framework** to enforce formal constraints during closed-loop interactions, moving beyond…
What Do Evolutionary Coding Agents Evolve?
This paper investigates what evolutionary coding agents, driven by LLMs, actually evolve beyond just achieving a high final score. The core method involves introducing **EvoTrace**, a dataset of evolutionary coding traces, and **EvoReplay**, a replay-based methodology to analyze these traces. This allows the authors to…