From the arXiv
Monday, 15 June 2026 · 20 papers
From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI
This paper conceptualizes the evolution of LLMs from simple "Chatbots" to "Digital Colleagues" by analyzing two core dimensions. The cognitive core shifts from fast, next-token prediction to deliberate reasoning via techniques like Chain-of-Thought and reflection. Concurrently, task execution moves from ad hoc tool-cal…
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails
This paper introduces a novel Denial-of-Service (DoS) attack targeting LLM-based agent guardrails by exploiting their reasoning capabilities. The core method involves crafting natural-language payloads using a beam-search optimization framework to force the guardrail into extended reasoning loops, thereby consuming exc…
GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge
GitOfThoughts addresses the ephemeral nature of LLM reasoning by treating the agent's thought process as a Git repository, where each thought is a commit, allowing reasoning to be version-controlled, replayed, and audited. The core contribution is demonstrating that while this system makes reasoning manageable, extensi…
SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model
SIMMER introduces a novel benchmark to evaluate "latent failures" in LLM-generated plans for household agents, which are errors that don't immediately halt execution but silently undermine the final goal. The method uses a human-curated symbolic world model, grounded in the kitchen domain with extensive actions and obj…
StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance
StreamMemBench is a novel streaming benchmark designed to evaluate how well agent memory systems utilize observations and interactions over time to provide future-oriented assistance. It constructs two-step task sequences based on egocentric data streams, testing both initial evidence use and the subsequent reuse of fe…
Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows
This paper introduces **Parallel-Synthesis**, a framework that enables a Large Language Model (LLM) synthesizer to directly consume the KV caches generated by parallel worker agents, bypassing sequential text concatenation. The core method involves a **cache mapper** to align independent branch caches and a **fine-tune…
When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime
This paper presents a longitudinal study of "silent failures" in a production LLM agent runtime, where errors occur without actionable human notification. The core contribution is a five-class, mechanism-oriented taxonomy for these failures, highlighting that LLM-specific issues like "chained hallucination and fabricat…
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
This paper investigates whether LLM agents truly exercise judgment when using Graph Neural Network (GNN) tools. The core finding is that agents overwhelmingly defer blindly to the GNN's raw output, acting as "GNN parrots" rather than selectively using the tool. Furthermore, this blind deference increases with the LLM's…
Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry
This paper investigates whether code correctness is encoded in the hidden states of a large language model (LLM) before generation and during repair. The core method involves linearly probing the prompt-final hidden states to predict correctness, controlling for prompt length via residualization. The contribution is de…
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
This paper introduces **GauntletBench**, a novel web-based benchmark designed to rigorously evaluate agent generalization beyond familiar, simple tasks. Its core contribution is focusing on three underexplored, vision-intensive capabilities—temporal perception, graphical understanding, and 3D reasoning—across five chal…
AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition
AgentSpec is a modular specification framework that standardizes the interfaces between components (like perception, memory, and reasoning) in complex LLM agent scaffolds. This allows researchers to systematically swap and recombine these typed policy components under controlled conditions. The core contribution is ena…
Abstracting Cross-Domain Action Sequences into Interpretable Workflows
This paper introduces **WorkflowView**, a framework that leverages Large Language Models (LLMs) to abstract noisy, low-level user action sequences from interaction logs into **interpretable, high-level workflows**. This method addresses the limitations of prior deep learning approaches by offering better generalization…
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
ClinHallu is a novel benchmark designed to diagnose the *source* of hallucinations in medical MLLMs by decomposing the reasoning process into three stages: Visual Recognition, Knowledge Recall, and Reasoning Integration. It provides 7,031 instances with structured reasoning traces and uses stage-replacement interventio…
No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements
This paper introduces **agent-first canonical code**, a proof-carrying substrate that transforms routine software into structured behavioral profiles and typed change algebras. The core method involves **quotienting software by behavior equivalence** under a declared oracle to collapse redundant encodings into governed…
tap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration
The paper introduces **tap**, a novel file-based protocol enabling heterogeneous LLM agents (like Claude and Codex) to collaborate on a shared codebase without requiring a common runtime or central server. Its core method relies on using **markdown files with embedded metadata as the primary communication mechanism**, …
Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success
This paper theoretically justifies the empirical success of non-Euclidean optimization methods like Muon in the heavy-tailed, non-convex regime where stochastic gradients have bounded $p$-th moments ($p \in (1,2]$). The core contribution is showing that Muon achieves optimal sample complexity by effectively absorbing h…
When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs
This paper applies causal-geometric analysis to multilingual LLMs to investigate how different languages are represented internally. The core method reveals that language concepts form stable, largely separable linear directions when adjusted for covariance. The key contribution is demonstrating that this separability …
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper addresses the semantic inconsistency between the reasoning steps and the final answer in Multimodal Reinforcement Learning with Verifiable Rewards (RLVR). The core method, CORA, introduces a lightweight, plug-and-play consistency reward model to align the thinking process with the answer during RLVR training…
Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning
This paper introduces Preference Coordinated Multi-agent Policy Optimization (PCMA) to address cooperative multi-objective multi-agent reinforcement learning (MOMARL). PCMA learns coordinated, agent-specific preferences to manage trade-offs arising from conflicting objectives and diverse agent contributions. The core c…
Regulating the Machine Contributor: Governance and Policy Alignment in Open Source
This paper investigates the governance challenges arising from the increasing use of autonomous AI agents in open-source software development. The core method involves comparing contribution policies across several major open-source organizations to map their alignment with emerging international AI governance framewor…