From the arXiv
Friday, 5 June 2026 · 20 papers
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
This paper presents the first systems characterization of memory management in long-horizon LLM agents. The authors introduce a taxonomy to classify memory systems and develop a profiling harness to attribute costs across memory construction, retrieval, and generation phases. Their analysis of ten systems reveals how d…
Benchmark Everything Everywhere All at Once
This paper introduces **Benchmark Agent**, a fully autonomous agentic system designed to automate the entire pipeline of benchmark construction, addressing the labor-intensive and unsustainable nature of current methods. The core contribution is a scalable framework that handles everything from query analysis and subta…
Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
The paper introduces **ALMANAC**, a novel dataset designed to advance agent collaboration capabilities beyond mere task completion. It provides **action-level mental model annotations** derived from human dyadic routing tasks, capturing participants' internal reasoning, intentions, and shared goals at each step. This r…
LLM Self-Recognition: Steering and Retrieving Activation Signatures
This paper introduces a method to reliably attribute text to a specific Large Language Model (LLM) by steering its internal residual stream with a random sparse vector during generation, creating a detectable "activation signature." This signature acts as a fingerprint that a separate LLM detector can recover with high…
LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
This paper introduces **PropMe**, a propensity-aware framework to evaluate Large Language Model (LLM) memorization by contrasting adversarial prefix attacks with non-adversarial use cases. Using the lightweight **SimpleTrace** pipeline, the authors consistently find a significant gap, showing that models exhibit substa…
MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
MLEvolve is a self-evolving, LLM-based multi-agent framework designed for automated machine learning algorithm discovery. It overcomes limitations in existing agents by using Progressive MCGS for cross-branch information flow and an entropy-inspired schedule for shifting search from exploration to exploitation. The fra…
RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
RedKnot addresses the KV cache bottleneck in long-context LLM serving by introducing a novel, head-aware KV cache management system. It leverages the observation that different attention heads have varying utility, allowing for selective reuse and compression. The core contribution is the **Head-Aware KV Reuse** and **…
TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
TokenMizer addresses the LLM context limit for long tasks by modeling session history as a typed knowledge graph, preserving critical relational structure lost in flat text methods. It uses a hybrid pipeline to incrementally build this graph and a multi-tier system to serialize it into compact resume blocks. This appro…
ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents
This paper introduces Causal Minimal Tool Filtering (CMTF), a training-free method to improve LLM agent reliability by addressing tool confusion caused by large tool sets. CMTF selects tools based on **causal sufficiency** using lightweight precondition-effect contracts to expose only the minimal set of tools necessary…
Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents
Vortex is a system designed to efficiently serve diverse sparse attention algorithms for LLMs by combining a Python-embedded frontend language with a page-centric tensor abstraction. This framework simplifies the development, deployment, and evaluation of new sparse attention mechanisms. Its core contribution is accele…
Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo
This paper introduces a **Layered Framework for Knowledge Infusion** in iterative multimodal generative models, conceptualizing knowledge injection as an **intervention-layer problem**. It defines four distinct layers—surface, trajectory, latent, and parametric—based on which structural component of the generation proc…
Generative Criticality in Large Language Model Temperature Scaling
This paper introduces a statistical-field framework, treating LLM token embeddings as continuous spin variables on a 1D chain, to analyze text generation controlled by softmax temperature ($T$). The core contribution is observing a sharp susceptibility peak near a characteristic critical temperature ($T_c$), analogous …
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
CollabSim is a novel, configurable simulation framework designed to systematically investigate the collaborative competence of LLM agents in multi-agent systems. It grounds its methodology in established Computer-Supported Cooperative Work (CSCW) research to move beyond simple task outcomes, allowing researchers to con…
Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation
This paper proposes a Reinforcement Learning (RL) approach to improve the translation of unseen, low-resource languages by leveraging rich linguistic context provided in-context. The RL agent is trained using a surface-level translation metric (chrF) as a reward signal to encourage the model to learn the *meta-skill* o…
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
This paper introduces Gradient-Informed Logit Correction (GILC), a plug-and-play framework for controllable generation in discrete diffusion models. GILC efficiently estimates guidance signals by using the pretrained denoising network as a proxy, employing a Jacobian-free mechanism to stably correct clean prediction lo…
Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
This paper introduces **Subspace-Aware Sparse Autoencoders (SAEs)** to address the limitation of standard SAEs, which incorrectly assume latent features are one-dimensional. The authors demonstrate that this assumption forces features with intrinsic dimension $d_i \ge 2$ to split across multiple dictionary atoms, leadi…
TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory
This paper introduces **TOKI**, a bitemporal operator algebra designed to explicitly manage and resolve contradictions arising from versioned writes in LLM agent persistent memory. TOKI formalizes four common resolution heuristics as distinct bitemporal operators, each defined with an explicit isolation precondition an…
TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models
TRACE introduces a novel conditional estimation paradigm for multimodal time series foundation models to address temporal misalignment and missing data. It systematically infers incomplete target modalities using available auxiliary modalities, overcoming limitations of naive imputation methods. This approach yields mo…
Unsupervised Skill Discovery for Agentic Data Analysis
This paper introduces **DataCOPE**, an unsupervised framework for discovering reusable data-analysis skills for agents without relying on labeled supervision. It iteratively coordinates an agent, an unsupervised verifier, and a skill manager to generate trajectories and distill skills based on quality signals derived d…
Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals
This paper introduces the **Recuse Signal**, a lightweight, in-band communication mechanism (like an SSH banner) allowing servers to request that an autonomous LLM agent voluntarily withdraw access to a resource. The core contribution is empirically measuring whether current LLM agents comply with this non-security-cri…