№01
cs.AI arxiv:2605.20173v1

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Vasundra Srinivasan

This paper introduces the **Stochastic-Deterministic Boundary (SDB)** as the core architectural primitive for production LLM agents, defining it as a four-part contract governing how LLM outputs become system actions. The authors organize agent runtime design around this SDB across three concerns (Coordination, State, …

9
№02
cs.AI arxiv:2605.20025v1

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu, Shi Qiu, Mairui Li et al.

AutoResearchClaw introduces a self-reinforcing, iterative autonomous research pipeline that moves beyond linear execution. Its core method involves structured multi-agent debate, a self-healing execution loop that learns from failures, and cross-run evolution to accumulate knowledge. This system significantly contribut…

9
№03
cs.AI arxiv:2605.20075v1

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan et al.

CopT reformulates Chain-of-Thought reasoning by prioritizing a draft answer before engaging in subsequent "on-policy thinking" for reflection and correction. Its core method involves using continuous embeddings as inference-time contrastive verifiers, comparing the model's support for generated tokens under discrete an…

9
№04
cs.AI arxiv:2605.19966v1

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Mohammed Alshaalan, Miguel R. D. Rodrigues

This paper introduces **CPD Online (CPD)**, a novel, training-free method for detecting fluent adversarial prompts by framing the problem as **online change-point detection** on the token-level next-token entropy stream. By establishing a baseline using the LLM's system prompt and applying a CUSUM statistic to standard…

9
№05
cs.AI arxiv:2605.19932v1

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu, Qizheng Zhang, Omar Khattab et al.

PEEK introduces a novel method for LLM agents operating on recurring long contexts by caching reusable orientation knowledge as a "context map." This small, constant-sized artifact, maintained via a programmable cache policy (Distiller, Cartographer, Prioritizer), acts as an orientation cache within the agent's prompt.…

9
№06
cs.AI arxiv:2605.20072v1

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

Oussama Zenkri, Oliver Brock

This paper investigates how observation fidelity impacts embodied LLM agents solving a complex mechanical puzzle called the Lockbox. The core method involves testing LLMs with varying observation types (RGB, RGB-D, and ground-truth) on a physical robot and in simulation. The key contribution is the counterintuitive fin…

9
№07
cs.AI arxiv:2605.20087v1

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

Chuanyang Jin, Binze Li, Haopeng Xie et al.

The paper introduces **ThoughtTrace**, the first large-scale dataset pairing real-world multi-turn human-AI conversations with users' self-reported thoughts (reasons for prompts and reactions to responses). The core contribution is providing this crucial "what they think" layer, which analysis shows is distinct from sp…

9
№08
cs.CL arxiv:2605.19852v1

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Qinghe Ma, Zhen Zhao, Yiming Wu et al.

This paper introduces **AutoTool**, a method that enables Multimodal Large Language Models (MLLMs) to **adaptively decide whether to invoke external tools** during reasoning, addressing the issue that unnecessary tool use can hinder performance. It employs a **dual-mode reasoning strategy within a reinforcement learnin…

9
№09
cs.CL arxiv:2605.20176v1

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Juncheng Wu, Letian Zhang, Yuhan Wang et al.

ClinSeekAgent is an automated agentic framework designed to shift clinical reasoning from passive evidence consumption to active evidence acquisition. It dynamically seeks, plans for, and synthesizes multimodal evidence from heterogeneous sources like knowledge bases, EHRs, and imaging tools based only on a clinical qu…

9
№10
cs.CL arxiv:2605.20128v1

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

Yuanqing Cai, Ziyi Huang, Minhao Liu et al.

The paper introduces **MixRea**, a benchmark designed to test Large Language Models (LLMs) on **explicit-implicit reasoning**, inspired by human inattentional blindness. It evaluates whether LLMs fail to use subtle contextual cues when explicit instructions are present, revealing widespread "inattentional blindness" ac…

9
№11
cs.CL arxiv:2605.19952v1

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

Jingwei Sun, Jianing Zhu, Jiangchao Yao et al.

This paper introduces **TriMem**, a novel memory system for lifelong LLM agents that moves beyond purely atomic facts. TriMem maintains three coexisting representation granularities—raw dialogue segments, atomic facts, and synthesized profiles—to ensure both storage fidelity and deep, holistic reasoning over accumulate…

9
№12
cs.CL arxiv:2605.20179v1

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Zhiben Chen, Youpeng Zhao, Yang Sui et al.

TIDE proposes an efficient and lossless inference method for Mixture-of-Experts (MoE) Diffusion Large Language Models (dLLMs) by exploiting the temporal stability of expert activations during the diffusion process. It introduces an interval-based expert refresh strategy that manages expert placement in an I/O-aware man…

9
№13
cs.AI arxiv:2605.19988v1

A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

Hongyu Lin, Mingyu Li, Weichen Zhang et al.

This paper introduces **Agentic Tuning** via **PerfEvolve**, shifting system tuning from static documentation to dynamic action. PerfEvolve translates expert tuning methodologies into executable skills for LLM agents, enabling them to perform version verification, workload profiling, and joint optimization. This approa…

8
№14
cs.AI arxiv:2605.20084v1

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

Zijun Jia, Yuanchang Ye, Sen Jia et al.

BalanceRAG addresses the challenge of setting risk thresholds in cascaded RAG systems, where decisions are made sequentially by an LLM-only branch and a RAG fallback. The core method frames threshold pairs as operating points on a 2D lattice and uses sequential graphical testing to identify "safe" pairs that meet a tar…

8
№15
cs.AI arxiv:2605.20049v1

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

Priyansh Trivedi, Olivier Schmitt

This paper investigates whether code cleanliness affects the performance of coding agents by introducing a controlled evaluation protocol using minimal pairs. These pairs are identical in functionality but differ only in code quality (style and complexity). The study found that while code cleanliness did not significan…

8
№16
cs.AI arxiv:2605.20104v1

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Yuhao Shen, Tianyu Liu, Xinyi Hu et al.

This paper introduces **Graft**, a hybrid tree construction method for speculative decoding that overcomes the trade-off between dense, high-overhead trees and pruned, lower-coverage trees. Graft couples **pruning** (to save budget) with **retrieval** (to recover lost coverage) as mutually reinforcing operations. This …

8
№17
cs.AI arxiv:2605.20149v1

Less Back-and-Forth: A Comparative Study of Structured Prompting

Saurav Ghosh, Gabriella Polach, Abdou Sow

This paper comparatively studies how structured prompting affects Large Language Model (LLM) output quality and user effort across different tasks and models. The core finding is that **checklist-improved prompts significantly outperform raw and clarifying-question prompts**, achieving the highest quality scores while …

8
№18
cs.AI arxiv:2605.19943v1

Probabilistic Tiny Recursive Model

Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau

The paper introduces Probabilistic Tiny Recursive Models (PTRM) to overcome the deterministic convergence issue in standard Tiny Recursive Models (TRMs). PTRM achieves this by injecting Gaussian noise during each recursive step, enabling parallel exploration of diverse solution paths. This task-agnostic method signific…

8
№19
cs.AI arxiv:2605.19940v1

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

This paper reframes safety guardrails for foundation models in sensitive domains as a problem of **runtime behavioral control over interaction trajectories**, inspired by robotics. The core method introduces the **Grounded Observer framework** to enforce formal constraints during closed-loop interactions, moving beyond…

8
№20
cs.AI arxiv:2605.20086v1

What Do Evolutionary Coding Agents Evolve?

Nico Pelleriti, Sree Harsha Nelaturu, Zhanke Zhou et al.

This paper investigates what evolutionary coding agents, driven by LLMs, actually evolve beyond just achieving a high final score. The core method involves introducing **EvoTrace**, a dataset of evolutionary coding traces, and **EvoReplay**, a replay-based methodology to analyze these traces. This allows the authors to…

8