№01
cs.AI arxiv:2606.11182v1

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Weixian Xu, Shilong Liu, Mengdi Wang

EEVEE introduces a novel test-time prompt learning framework designed for real-world, heterogeneous task streams, overcoming limitations of single-dataset methods. Its core method involves a router that clusters incoming inputs and assigns them to appropriate prompt configurations, optimized through a router-prompt co-…

9
№02
cs.AI arxiv:2606.10956v1

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding et al.

This paper introduces a rigorous benchmark, based on China's National Computer Rank Examination (NCRE), to evaluate frontier Large Language Models' (LLMs) ability to perform complex, multi-application Office automation tasks requiring long-horizon planning. The evaluation uses 200 practical tasks scored against 7,118 c…

9
№03
cs.AI arxiv:2606.10989v1

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

Bocheng Ju, Jianhua Wang, Chengliang Liu et al.

This paper introduces Null-Space Constrained Response-Specified Unlearning (NSRU), a low-rank adaptation method for LLM unlearning. NSRU constrains the update parameters to the null space of estimated "retain subspaces" derived from benign data, ensuring adaptation is localized. This method jointly optimizes suppressin…

9
№04
cs.AI arxiv:2606.11164v1

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu, Hao Shi, Yunhe Li et al.

ReasonAlloc addresses KV cache bottlenecks in LLM reasoning by introducing a hierarchical, training-free budget allocation framework. It combines an offline layer-wise preallocation strategy, capturing the "Reasoning Wave" demand pattern, with an online head-wise reallocation strategy that prioritizes information-rich …

9
№05
cs.AI arxiv:2606.10917v1

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Xucong Wang, Ziyu Ma, Shidong Yang et al.

The Role-Agent framework bootstraps LLM agent learning by having a single LLM concurrently act as both the agent and the environment. It uses a dual-component system: World-In-Agent (WIA) generates a process reward based on state prediction accuracy, while Agent-In-World (AIW) uses failure analysis to reshape the train…

9
№06
cs.AI arxiv:2606.11070v1

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin et al.

T1-Bench is introduced as a high-fidelity benchmark designed to evaluate LLM-based agents in complex, realistic, multi-domain customer-facing scenarios. Its core contribution is providing a standardized framework that captures sustained reasoning and coordination across interleaved, multi-turn interactions, significant…

9
№07
cs.AI arxiv:2606.11045v1

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

Martin Andres Bertran, Aaron Roth, Zhiwei Steven Wu

This paper investigates the hypothesis that successful machine learning strategies are highly compressible, even when adaptively reused on held-out benchmarks. The authors test this using LLM-driven research agents under two compression bottlenecks: limiting the agent's prompt (output compression) or restricting feedba…

9
№08
cs.AI arxiv:2606.11042v1

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu, Jingzhe Ding, Jian Zhang et al.

Workflow-GYM is introduced as a novel benchmark to address the lack of evaluation for AI agents performing long-horizon, high-value professional workflows using graphical user interfaces (GUIs). The core method involves creating tasks centered on specialized, domain-specific professional software environments. The cont…

9
№09
cs.LG arxiv:2606.11025v1

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Bowen Ping, Xiangxin Zhou, Penghui Qi et al.

Flow-DPPO addresses limitations in applying standard PPO to flow matching models by replacing noisy ratio clipping with a direct divergence constraint. Leveraging the Gaussian nature of the per-step policy, it enables exact and efficient computation of the KL divergence between old and new policies. This method provide…

9
№10
cs.CL arxiv:2606.11046v1

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Prajakta Kini, Avinash Reddy, Souradip Chakraborty et al.

This paper investigates whether converting instruction-tuned Large Language Models (LLMs) into reasoning models via post-training preserves their original alignment behaviors (safety, bias avoidance, etc.). The core method involves a systematic trustworthiness audit comparing reasoning models (trained via SFT, RL, or d…

9
№11
cs.CL arxiv:2606.10931v1

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Naihao Deng, Yilun Zhu, Naichen Shi et al.

This paper demonstrates that a single biased example, introduced via one-shot Group Relative Policy Optimization (GRPO), is sufficient to induce systematic and generalizing bias in large language models (LLMs). The core contribution is revealing a critical vulnerability where post-training alignment guardrails can be e…

9
№12
cs.CL arxiv:2606.10875v1

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

Yupu Hao, Zhuoran Jin, Huanxuan Liao et al.

This paper investigates how to improve LLM tool-calling by integrating and activating experiential knowledge. The core method involves acquiring instance-level knowledge, which proves highly effective, and employing parallel sampling (expanding reasoning width) during inference to better activate this knowledge. The co…

9
№13
cs.CL arxiv:2606.11082v1

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

Hakan Mehmetcik

This paper introduces the "Shibboleth Effect," examining how frontier LLMs exhibit cross-lingual distributional skew under adversarial conditions. Using a simulated geopolitical wargame played in English versus Turkish, the authors found that models display heterogeneous behavioral changes, such as Llama-4 significantl…

9
№14
cs.AI arxiv:2606.11078v1

A History-Aware Visually Grounded Critic for Computer Use Agents

Jaewoo Lee, Zaid Khan, Archiki Prasad et al.

This paper introduces **HiViG**, a history-aware, visually grounded critic framework for Computer Use Agents (CUAs). HiViG addresses limitations in existing critics by training a multimodal model on real GUI trajectories to summarize past interactions and verify proposed actions against the current screen visuals. This…

8
№15
cs.AI arxiv:2606.11150v1

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Andrew Bo Liu, Samira Nedungadi, Bryce Cai et al.

The paper introduces **ABC-Bench**, a novel benchmark designed to systematically evaluate the agentic biosecurity-relevant capabilities of Large Language Model (LLM) agents. This benchmark assesses both beneficial and dual-use biology tasks, such as robotic coding and DNA design, requiring integrated biology and softwa…

8
№16
cs.AI arxiv:2606.11033v1

AuRA: Internalizing Audio Understanding into LLMs as LoRA

Bo Cheng, Lei Shi, Zhanyu Ma et al.

AuRA internalizes audio understanding directly into Large Language Models (LLMs) using a lightweight adaptation technique. It achieves this by distilling the audio encoding capability from a teacher ASR model into a LoRA-adapted LLM student via layer-wise hidden state alignment. This method offers a tighter integration…

8
№17
cs.AI arxiv:2606.10935v1

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Xuezhen Xie, Zhiqiang Zhou

The paper introduces **CLP (Collocation-Length Predictor)** to enable high-quality, accelerated multi-token inference (MTP) in LLMs. The core method, **Backbone-as-Architect**, resolves quality degradation by ensuring the main LM head always generates the first token, while MTP heads only predict subsequent tokens. CLP…

8
№18
cs.AI arxiv:2606.11166v1

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill et al.

This paper challenges the narrative of LLMs achieving expert-level performance by introducing a novel benchmark focusing on reliable, high-stakes data analysis coding tasks. The authors compare a frontier LLM against human experts, explicitly measuring error magnitude and performance variance. The core contribution is …

8
№19
cs.AI arxiv:2606.10933v1

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

Aman Sharma, Sushrut Thorat, Paras Chopra

This paper evaluates coding agents on unfamiliar, esoteric programming languages using a sequential setup involving file editing and local execution. The core contribution is demonstrating that top agents employ a **metaprogramming strategy**—writing code in a familiar language (like Python) to generate the required es…

8
№20
cs.AI arxiv:2606.10942v1

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

Kiarash Rezaei, Omran Ayoub, Sebastian Troia et al.

This paper introduces a novel Explainable AI (XAI) framework that augments SHAP values with mutual feature interaction data. It utilizes a moderately sized Large Language Model (LLM) and structured prompting to generate natural language explanations of network AI decisions. The core contribution is providing human-unde…

8