From the arXiv
Tuesday, 5 May 2026 · 20 papers
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw introduces a new bilingual benchmark sourced from real, complex, long-horizon academic workflows that students find current AI agents fail to solve. This benchmark features 80 challenging tasks across 25+ professional domains, including GPU-intensive work, executed in isolated sandboxes and scored using mul…
AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
This paper systematically audits technical debt in AI-generated software, revealing that LLMs introduce a distinct "machine signature" of defects rather than eliminating flaws. The core finding is a **Reasoning-Complexity Trade-off**: more capable models produce increasingly bloated and coupled code, establishing a **V…
Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges
This paper systematically surveys the literature to examine the current state, capabilities, and challenges of foundation-model-based agents in industrial automation. The core contribution is synthesizing findings from 88 relevant studies, revealing that most deployed systems are still in early validation stages (TRL 4…
Mitigating Misalignment Contagion by Steering with Implicit Traits
This paper investigates "misalignment contagion," the spread of undesirable behavior between language models (LMs) in multi-agent, multi-turn interactions, observing that LMs become more anti-social after playing social dilemma games. The core contribution is proposing and demonstrating the effectiveness of **steering …
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
This paper empirically investigates the impact of task horizon length on training Large Language Models (LLMs) for long-horizon tasks. By controlling for decision rules and reasoning structures, the authors demonstrate that increasing horizon length alone significantly hinders training stability due to exploration and …
ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling
ORPilot is an agentic LLM system designed to translate ambiguous, real-world business problems with raw data into solver-ready optimization models for production use. Its core contribution lies in novel components like a conversational interview agent, independent data retrieval, and a solver-agnostic Intermediate Repr…
Strategy-Aware Optimization Modeling with Reasoning LLMs
This paper introduces SAGE, a framework that explicitly incorporates modeling strategies into the training of Large Language Models (LLMs) for optimization programming. SAGE utilizes a solver-verified, multi-strategy dataset and a Segment-Weighted GRPO fine-tuning approach with a composite reward focused on correctness…
Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
This paper demonstrates the efficiency of modern agentic research tools by reproducing and extending a recent NLP study in just three hours, with the human acting only as a reviewer. The core contribution is showing that state-of-the-art LLMs (GPT-5.5 and Claude Opus 4.7) significantly close the style gap in text post-…
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
The paper introduces **Gradient-Gated Preference Optimization (Gate-DPO)** to stabilize Direct Preference Optimization (DPO) training, which suffers from a "squeezing effect" causing probability collapse. Gate-DPO achieves this by introducing a gating mechanism that attenuates harmful gradients applied to rejected resp…
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak introduces an evolutionary red-teaming strategy to automatically discover multi-turn jailbreak attacks that exploit contextual priming in LLMs. It performs evolutionary search over simulated conversational dialogues, using a two-level harm scoring system to guide the mutation process toward elicitin…
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
This paper introduces "orchestration traces," temporal interaction graphs, as a framework to apply reinforcement learning (RL) to coordinate teams of LLM agents. The core method involves designing RL rewards and credit signals that specifically address the complex orchestration decisions—such as spawning, delegation, a…
An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance
This paper presents the first empirical analysis of agent skills for healthcare by examining 557 public skills, annotated across ten dimensions. The core finding is that existing public skills primarily focus on workflow automation and monitoring, showing uneven coverage of the full clinical lifecycle and failing to ad…
Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences
This paper explores using LLM-based AI agents to execute complex network procedures via sequences of tool calls, moving beyond traditional state machines. The core contribution is investigating and comparing four different approaches for distributing execution control between the agent and the underlying tools. Results…
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
This paper introduces JACTUS, a unified framework that jointly performs parameter compression and task adaptation, overcoming the limitations of sequential "compress then adapt" methods. JACTUS estimates gradient covariances from a calibration set to form a task-aware union of subspaces, then performs a globally rank-a…
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL is a modular framework that enables zero-shot control for contact-rich robotic manipulation by decoupling high-level reasoning from low-level control. It uses an LLM as a "cost designer" to synthesize context-aware objective functions for a sampling-based motion planner (MPPI). The system further incorporates a n…
Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
This paper introduces **ReClaim**, a large-scale generative transformer foundation model trained on 43.8 billion medical events from nationwide claims data. ReClaim models complex, longitudinal patient trajectories across diagnoses, procedures, medications, and costs. Its core contribution is demonstrating that this fo…
Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI
This paper introduces Continuous Agent Semantic Authorization (CASA), a hybrid runtime enforcement model to secure LLM-driven agents interacting with tools and resources. It employs a zero-trust interception layer combining five deterministic controls for structural integrity with a semantic inspection layer to validat…
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
SpecKV introduces a lightweight, adaptive controller to dynamically select the optimal speculation length ($\gamma$) at each step during speculative decoding. This selection is based on signals extracted directly from the draft model, addressing the limitation of fixed $\gamma$ values. The core contribution is demonstr…
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
This paper argues that conflicts among trustworthy AI objectives (fairness, robustness, etc.) stem from incompatible invariance requirements under different data-generating process changes. The core contribution is proposing that **causality** provides a unifying framework to understand, manage, and potentially resolve…
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
This paper introduces the "Silenced Visual Latents" phenomenon, where multimodal models suppress the rich reasoning embedded in continuous visual latents in favor of direct visual input during autoregressive training. To counteract this, the authors propose a method that freezes the backbone and explicitly optimizes th…