№01
cs.AI arxiv:2606.14502v1

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Yongheng Zhang, Ziang Liu, Jiaxuan Zhu et al.

This paper conceptualizes the evolution of LLMs from simple "Chatbots" to "Digital Colleagues" by analyzing two core dimensions. The cognitive core shifts from fast, next-token prediction to deliberate reasoning via techniques like Chain-of-Thought and reflection. Concurrently, task execution moves from ad hoc tool-cal…

9
№02
cs.AI arxiv:2606.14517v1

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

Yuguang Zhou, Xunguang Wang, Pingchuan Ma et al.

This paper introduces a novel Denial-of-Service (DoS) attack targeting LLM-based agent guardrails by exploiting their reasoning capabilities. The core method involves crafting natural-language payloads using a beam-search optimization framework to force the guardrail into extended reasoning loops, thereby consuming exc…

9
№03
cs.AI arxiv:2606.14470v1

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

GitOfThoughts addresses the ephemeral nature of LLM reasoning by treating the agent's thought process as a Git repository, where each thought is a commit, allowing reasoning to be version-controlled, replayed, and audited. The core contribution is demonstrating that while this system makes reasoning manageable, extensi…

9
№04
cs.AI arxiv:2606.14574v1

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

SIMMER introduces a novel benchmark to evaluate "latent failures" in LLM-generated plans for household agents, which are errors that don't immediately halt execution but silently undermine the final goal. The method uses a human-curated symbolic world model, grounded in the kitchen domain with extensive actions and obj…

9
№05
cs.AI arxiv:2606.14571v1

StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

Guanming Liu, Yuqi Ren, Hansu Gu et al.

StreamMemBench is a novel streaming benchmark designed to evaluate how well agent memory systems utilize observations and interactions over time to provide future-oriented assistance. It constructs two-step task sequences based on egocentric data streams, testing both initial evidence use and the subsequent reuse of fe…

9
№06
cs.AI arxiv:2606.14672v1

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Shikun Liu, Mufei Li, Dongqi Fu et al.

This paper introduces **Parallel-Synthesis**, a framework that enables a Large Language Model (LLM) synthesizer to directly consume the KV caches generated by parallel worker agents, bypassing sequential text concatenation. The core method involves a **cache mapper** to align independent branch caches and a **fine-tune…

9
№07
cs.AI arxiv:2606.14589v1

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

Wei Wu

This paper presents a longitudinal study of "silent failures" in a production LLM agent runtime, where errors occur without actionable human notification. The core contribution is a five-class, mechanism-oriented taxonomy for these failures, highlighting that LLM-specific issues like "chained hallucination and fabricat…

9
№08
cs.AI arxiv:2606.14476v1

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

Zhongyuan Wang, Pratyusha Vemuri

This paper investigates whether LLM agents truly exercise judgment when using Graph Neural Network (GNN) tools. The core finding is that agents overwhelmingly defer blindly to the GNN's raw output, acting as "GNN parrots" rather than selectively using the tool. Furthermore, this blind deference increases with the LLM's…

9
№09
cs.LG arxiv:2606.14530v1

Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

Carlo Di Cicco

This paper investigates whether code correctness is encoded in the hidden states of a large language model (LLM) before generation and during repair. The core method involves linearly probing the prompt-final hidden states to predict correctness, controlling for prompt length via residualization. The contribution is de…

9
№10
cs.LG arxiv:2606.14397v1

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel et al.

This paper introduces **GauntletBench**, a novel web-based benchmark designed to rigorously evaluate agent generalization beyond familiar, simple tasks. Its core contribution is focusing on three underexplored, vision-intensive capabilities—temporal perception, graphical understanding, and 3D reasoning—across five chal…

9
№11
cs.CL arxiv:2606.14674v1

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Jixuan Chen, Jianzhi Shen, Haoqiang Kang et al.

AgentSpec is a modular specification framework that standardizes the interfaces between components (like perception, memory, and reasoning) in complex LLM agent scaffolds. This allows researchers to systematically swap and recombine these typed policy components under controlled conditions. The core contribution is ena…

9
№12
cs.AI arxiv:2606.14654v1

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Gaurav Verma, Scott Counts

This paper introduces **WorkflowView**, a framework that leverages Large Language Models (LLMs) to abstract noisy, low-level user action sequences from interaction logs into **interpretable, high-level workflows**. This method addresses the limitations of prior deep learning approaches by offering better generalization…

8
№13
cs.AI arxiv:2606.14697v1

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Sicheng Yang, Hangjie Yuan, Wenjun Zhang et al.

ClinHallu is a novel benchmark designed to diagnose the *source* of hallucinations in medical MLLMs by decomposing the reasoning process into three stages: Visual Recognition, Knowledge Recall, and Reasoning Integration. It provides 7,031 instances with structured reasoning traces and uses stage-replacement interventio…

8
№14
cs.AI arxiv:2606.14357v1

No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements

Jepson Taylor

This paper introduces **agent-first canonical code**, a proof-carrying substrate that transforms routine software into structured behavioral profiles and typed change algebras. The core method involves **quotienting software by behavior equivalence** under a declared oracle to collapse redundant encodings into governed…

8
№15
cs.AI arxiv:2606.14445v1

tap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration

Minseo Kim

The paper introduces **tap**, a novel file-based protocol enabling heterogeneous LLM agents (like Claude and Codex) to collaborate on a shared codebase without requiring a common runtime or central server. Its core method relies on using **markdown files with embedded metadata as the primary communication mechanism**, …

8
№16
cs.LG arxiv:2606.14560v1

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

Florian Hübler, Thomas Pethick, Suvrit Sra

This paper theoretically justifies the empirical success of non-Euclidean optimization methods like Muon in the heavy-tailed, non-convex regime where stochastic gradients have bounded $p$-th moments ($p \in (1,2]$). The core contribution is showing that Muon achieves optimal sample complexity by effectively absorbing h…

8
№17
cs.LG arxiv:2606.14347v1

When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs

Boris Marinov, Angira Sharma, Christian Schroeder de Witt et al.

This paper applies causal-geometric analysis to multilingual LLMs to investigate how different languages are represented internally. The core method reveals that language concepts form stable, largely separable linear directions when adjusted for covariance. The key contribution is demonstrating that this separability …

8
№18
cs.CL arxiv:2606.14691v1

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Jiayue Cao, Zhicong Lu, Xuehan Sun et al.

This paper addresses the semantic inconsistency between the reasoning steps and the final answer in Multimodal Reinforcement Learning with Verifiable Rewards (RLVR). The core method, CORA, introduces a lightweight, plug-and-play consistency reward model to align the thinking process with the answer during RLVR training…

8
№19
cs.AI arxiv:2606.14693v1

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

Pengxin Wang, Lihao Guo, Yi Xie et al.

This paper introduces Preference Coordinated Multi-agent Policy Optimization (PCMA) to address cooperative multi-objective multi-agent reinforcement learning (MOMARL). PCMA learns coordinated, agent-specific preferences to manage trade-offs arising from conflicting objectives and diverse agent contributions. The core c…

7
№20
cs.AI arxiv:2606.14594v1

Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

Jassem Manita, Aziz Amari

This paper investigates the governance challenges arising from the increasing use of autonomous AI agents in open-source software development. The core method involves comparing contribution policies across several major open-source organizations to map their alignment with emerging international AI governance framewor…

7