№01
cs.AI arxiv:2605.22763v1

Advancing Mathematics Research with AI-Driven Formal Proof Search

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov et al.

This paper introduces and evaluates a method where Large Language Models (LLMs) generate formal proofs in languages like Lean to overcome their inherent unreliability in mathematical reasoning. The core contribution is the first large-scale demonstration of this AI-driven formal proof search, showing agents autonomousl…

9
№02
cs.AI arxiv:2605.22608v1

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

Agentic CLEAR is an automatic, dynamic evaluation framework designed to address the challenges of assessing complex LLM agent behavior. It provides multi-level textual insights into agent actions at the system, trace, and node levels, moving beyond basic observability tools. The framework's core contribution is offerin…

9
№03
cs.AI arxiv:2605.22714v1

AMEL: Accumulated Message Effects on LLM Judgments

Sid-ali Temkit

This paper introduces the "Accumulated Message Effect on LLM Judgments" (AMEL), demonstrating that the polarity of prior conversation history biases subsequent evaluations made by Large Language Models. Across numerous tests, models shifted their judgments toward the prevailing sentiment of the preceding messages, part…

9
№04
cs.AI arxiv:2605.22720v1

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

Andrii Kryshtal

This paper investigates the risk of Large Language Models (LLMs) exacerbating armed conflicts by generating harmful outputs like false equivalencies or genocide denial. The authors tested nine model configurations across 90 multi-turn conflict scenarios, finding failure rates ranging from 6% to 47%. The core contributi…

9
№05
cs.AI arxiv:2605.22662v1

Claw AI Lab: An Autonomous Multi-Agent Research Team

Fan Wu, Cheng Chen, Zhenshan Tan et al.

Claw AI Lab introduces an autonomous research platform that moves beyond single-agent pipelines by enabling users to instantiate and manage a customizable, multi-agent research team from a single prompt. Its core contribution is providing an interactive, laboratory-like environment with real-time monitoring, collaborat…

9
№06
cs.AI arxiv:2605.22634v1

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Ting Liu

This paper introduces **Contractual Skills**, a design framework inspired by GovernSpec, to structure agent skills as inspectable, readable task contracts within enterprise AI systems. The core method organizes `SKILL.md` files to explicitly define goals, boundaries, contracts, and verification steps, clarifying the bo…

9
№07
cs.AI arxiv:2605.22781v1

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

Yunpeng Dong, Jingkai He, Yuze Hou et al.

DeltaBox addresses the bottleneck of slow state checkpoint/rollback (C/R) for stateful AI agents by proposing a change-based transactional C/R mechanism instead of full state duplication. The core method introduces **DeltaState**, a new OS-level abstraction featuring **DeltaFS** (layered filesystem C/R) and a mechanism…

9
№08
cs.AI arxiv:2605.22731v1

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie

This paper reframes post-training methods like SFT and RL not just by their loss functions, but by how they shape the **state distribution** used for learning. The core contribution is formalizing post-training as **state-distribution shaping**, demonstrating that the states induced by the learner (as in RL/OPD) versus…

9
№09
cs.AI arxiv:2605.22771v1

Reducing Political Manipulation with Consistency Training

Long Phan, Devin Kim, Alexander Pan et al.

This paper addresses covert political bias in LLMs, where models handle opposing political topics asymmetrically. The authors introduce two metrics, Sentiment Consistency and Helpfulness Consistency, to quantify this bias. They propose Political Consistency Training (PCT), an RL method combining these two consistency p…

9
№10
cs.AI arxiv:2605.22642v1

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Banghao Chi, Yining Xie, Mingyuan Wu et al.

Spreadsheet-RL is a reinforcement learning fine-tuning framework designed to train specialized AI agents for complex, multi-step tasks within a realistic Microsoft Excel environment. The core method involves using RL to overcome the limitations of simple prompting methods for real-world spreadsheet workflows. Its contr…

9
№11
cs.AI arxiv:2605.22602v1

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

Minghui Ma, Bin Guo, Runze Yang et al.

This paper introduces **TTBYS (Think Thrice Before You Speak)**, a novel framework that enhances Large Language Models' (LLMs) Theory of Mind (ToM) reasoning for persuasive dialogue. TTBYS uses a **dual knowledge enhancement** approach within a stepwise reasoning process to explicitly model the sequential dependencies …

9
№12
cs.AI arxiv:2605.22769v1

Understanding Data Temporality Impact on Large Language Models Pre-training

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck et al.

This paper investigates how data ordering during pre-training affects the temporal knowledge of Large Language Models (LLMs). The authors introduce a benchmark of over 7,000 temporally grounded questions to assess time-sensitive factual recall. They demonstrate that training LLMs on chronologically ordered data, rather…

9
№13
cs.AI arxiv:2605.22664v1

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear et al.

This paper introduces **WorkstreamBench**, a novel benchmark designed to evaluate Large Language Model (LLM) agents on complex, end-to-end spreadsheet creation tasks relevant to finance, such as financial modeling. The core contribution is moving beyond simple formula edits to assess agents' ability to produce complete…

9
№14
cs.LG arxiv:2605.22566v1

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Ao Li, Shangpeng Yang, Fahao Chen et al.

GraphFlow introduces a novel graph-based workflow management system for efficient LLM-agent serving. It represents workflows as a unified graph structure, wGraph, allowing for dynamic instantiation of task-specific workflows based on semantic understanding. This approach overcomes the limitations of static templates by…

9
№15
cs.CL arxiv:2605.22643v1

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Piercosma Bisconti, Matteo Prandi, Federico Pierucci et al.

This paper introduces "Boiling the Frog," a novel benchmark designed to evaluate the safety of tool-using AI agents in office environments against **incremental attacks**. The core method involves multi-turn scenarios where benign initial requests gradually escalate to risk-bearing actions within a persistent workspace…

9
№16
cs.CL arxiv:2605.22567v1

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

Yuchun Fan, Bei Li, Peiguang Li et al.

LANG is a novel reinforcement learning framework designed to improve multilingual reasoning in LLMs by using language-conditioned hints to guide exploration in non-English tasks. It prevents over-reliance on these hints through a progressive decay schedule and a language-adaptive switch tailored to specific language di…

9
№17
cs.AI arxiv:2605.22645v1

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

Hanjun Luo, Zhimu Huang, Sylvia Chung et al.

AtelierEval is introduced as the first unified benchmark to quantify the prompting proficiency of both humans and MLLMs in generating text-to-image prompts across 360 expert-crafted tasks. The core method involves using AtelierJudge, a skill-based, memory-augmented agentic evaluator, to produce reliable subjective and …

8
№18
cs.AI arxiv:2605.22732v1

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

Juergen Dietrich

This paper compares acoustic emotion models and LLMs for analyzing the Pathos dimension in political speech, using the TRUST LLM pipeline as a benchmark. The core finding is that the Gemini LLM, analyzing both audio and transcript, correlates strongly with the benchmark Pathos scores, while a standard acoustic SER mode…

8
№19
cs.AI arxiv:2605.22579v1

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias et al.

This paper investigates "Hyperfitting," a phenomenon where extreme fine-tuning enhances LLM generation quality beyond simple distribution sharpening. The authors demonstrate that hyperfitting is fundamentally distinct from temperature scaling, as entropy-matched controls fail to replicate its diversity gains. Their cor…

8
№20
cs.AI arxiv:2605.22786v1

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.

LCGuard is a framework designed to ensure safe latent communication via shared Key-Value (KV) caches in multi-agent LLM systems. It addresses the risk of sensitive information leakage by learning representation-level transformations on the KV caches before they are transmitted between agents. This acts as a "guard" to …

8