№01
cs.AI arxiv:2605.02661v1

AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si et al.

AcademiClaw introduces a new bilingual benchmark sourced from real, complex, long-horizon academic workflows that students find current AI agents fail to solve. This benchmark features 80 challenging tasks across 25+ professional domains, including GPU-intensive work, executed in isolated sandboxes and scored using mul…

9
№02
cs.AI arxiv:2605.02741v1

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby

This paper systematically audits technical debt in AI-generated software, revealing that LLMs introduce a distinct "machine signature" of defects rather than eliminating flaws. The core finding is a **Reasoning-Complexity Trade-off**: more capable models produce increasingly bloated and coupled code, establishing a **V…

9
№03
cs.AI arxiv:2605.02592v1

Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges

Vincent Henkel, Felix Gehlhoff, David Kube et al.

This paper systematically surveys the literature to examine the current state, capabilities, and challenges of foundation-model-based agents in industrial automation. The core contribution is synthesizing findings from 88 relevant studies, revealing that most deployed systems are still in early validation stages (TRL 4…

9
№04
cs.AI arxiv:2605.02751v1

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Lui et al.

This paper investigates "misalignment contagion," the spread of undesirable behavior between language models (LMs) in multi-agent, multi-turn interactions, observing that LMs become more anti-social after playing social dilemma games. The core contribution is proposing and demonstrating the effectiveness of **steering …

9
№05
cs.AI arxiv:2605.02572v1

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Sunghwan Kim, Junhee Cho, Beong-woo Kwak et al.

This paper empirically investigates the impact of task horizon length on training Large Language Models (LLMs) for long-horizon tasks. By controlling for decision rules and reasoning structures, the authors demonstrate that increasing horizon length alone significantly hinders training stability due to exploration and …

9
№06
cs.AI arxiv:2605.02728v1

ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling

Guangrui Xie

ORPilot is an agentic LLM system designed to translate ambiguous, real-world business problems with raw data into solver-ready optimization models for production use. Its core contribution lies in novel components like a conversational interview agent, independent data retrieval, and a solver-agnostic Intermediate Repr…

9
№07
cs.AI arxiv:2605.02545v1

Strategy-Aware Optimization Modeling with Reasoning LLMs

Ruiqing Zhao, Fengzhi Li, Yuan Zuo et al.

This paper introduces SAGE, a framework that explicitly incorporates modeling strategies into the training of Large Language Models (LLMs) for optimization programming. SAGE utilizes a solver-verified, multi-strategy dataset and a Segment-Weighted GRPO fine-tuning approach with a composite reward focused on correctness…

9
№08
cs.LG arxiv:2605.02620v1

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

Andreas Maier, Moritz Zaiss, Siming Bayer

This paper demonstrates the efficiency of modern agentic research tools by reproducing and extending a recent NLP study in just three hours, with the human acting only as a reviewer. The core contribution is showing that state-of-the-art LLMs (GPT-5.5 and Claude Opus 4.7) significantly close the style gap in text post-…

9
№09
cs.LG arxiv:2605.02626v1

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Inoussa Mouiche

The paper introduces **Gradient-Gated Preference Optimization (Gate-DPO)** to stabilize Direct Preference Optimization (DPO) training, which suffers from a "squeezing effect" causing probability collapse. Gate-DPO achieves this by introducing a gating mechanism that attenuates harmful gradients applied to rejected resp…

9
№10
cs.CL arxiv:2605.02647v1

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin et al.

ContextualJailbreak introduces an evolutionary red-teaming strategy to automatically discover multi-turn jailbreak attacks that exploit contextual priming in LLMs. It performs evolutionary search over simulated conversational dialogues, using a two-level harm scoring system to guide the mutation process toward elicitin…

9
№11
cs.CL arxiv:2605.02801v1

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Chenchen Zhang

This paper introduces "orchestration traces," temporal interaction graphs, as a framework to apply reinforcement learning (RL) to coordinate teams of LLM agents. The core method involves designing RL rewards and credit signals that specifically address the complex orchestration decisions—such as spawning, delegation, a…

9
№12
cs.AI arxiv:2605.02709v1

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Gelei Xu, Ningzhi Tang, Xueyang Li et al.

This paper presents the first empirical analysis of agent skills for healthcare by examining 557 public skills, annotated across ten dimensions. The core finding is that existing public skills primarily focus on workflow automation and monitoring, showing uneven coverage of the full clinical lifecycle and failing to ad…

8
№13
cs.AI arxiv:2605.02584v1

Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences

Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi et al.

This paper explores using LLM-based AI agents to execute complex network procedures via sequences of tool calls, moving beyond traditional state machines. The core contribution is investigating and comparing four different approaches for distributing execution control between the agent and the underlying tools. Results…

8
№14
cs.AI arxiv:2605.02829v1

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

Jingze Ge, Yun Liu, Xue Geng et al.

This paper introduces JACTUS, a unified framework that jointly performs parameter compression and task adaptation, overcoming the limitations of sequential "compress then adapt" methods. JACTUS estimates gradient covariances from a calibration set to form a task-aware union of subspaces, then performs a globally rank-a…

8
№15
cs.AI arxiv:2605.02600v1

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk Çiçek, Mert K. Er, Özgür S. Öğüz

CoRAL is a modular framework that enables zero-shot control for contact-rich robotic manipulation by decoupling high-level reasoning from low-level control. It uses an LLM as a "cost designer" to synthesize context-aware objective functions for a sampling-based motion planner (MPPI). The system further incorporates a n…

8
№16
cs.AI arxiv:2605.02740v1

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan et al.

This paper introduces **ReClaim**, a large-scale generative transformer foundation model trained on 43.8 billion medical events from nationwide claims data. ReClaim models complex, longitudinal patient trajectories across diagnoses, procedures, medications, and costs. Its core contribution is demonstrating that this fo…

8
№17
cs.AI arxiv:2605.02682v1

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

Majed El Helou, Benjamin Ryder, Chiara Troiani et al.

This paper introduces Continuous Agent Semantic Authorization (CASA), a hybrid runtime enforcement model to secure LLM-driven agents interacting with tools and resources. It employs a zero-trust interception layer combining five deterministic controls for structural integrity with a semantic inspection layer to validat…

8
№18
cs.AI arxiv:2605.02888v1

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla

SpecKV introduces a lightweight, adaptive controller to dynamically select the optimal speculation length ($\gamma$) at each step during speculative decoding. This selection is based on signals extracted directly from the draft model, addressing the limitation of fixed $\gamma$ values. The core contribution is demonstr…

8
№19
cs.AI arxiv:2605.02640v1

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin et al.

This paper argues that conflicts among trustworthy AI objectives (fairness, robustness, etc.) stem from incompatible invariance requirements under different data-generating process changes. The core contribution is proposing that **causality** provides a unifying framework to understand, manage, and potentially resolve…

8
№20
cs.LG arxiv:2605.02735v1

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Xin Zhang, Qiqi Tao, Jiawei Du et al.

This paper introduces the "Silenced Visual Latents" phenomenon, where multimodal models suppress the rich reasoning embedded in continuous visual latents in favor of direct visual input during autoregressive training. To counteract this, the authors propose a method that freezes the backbone and explicitly optimizes th…

8