From the arXiv
Friday, 22 May 2026 · 20 papers
Advancing Mathematics Research with AI-Driven Formal Proof Search
This paper introduces and evaluates a method where Large Language Models (LLMs) generate formal proofs in languages like Lean to overcome their inherent unreliability in mathematical reasoning. The core contribution is the first large-scale demonstration of this AI-driven formal proof search, showing agents autonomousl…
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR is an automatic, dynamic evaluation framework designed to address the challenges of assessing complex LLM agent behavior. It provides multi-level textual insights into agent actions at the system, trace, and node levels, moving beyond basic observability tools. The framework's core contribution is offerin…
AMEL: Accumulated Message Effects on LLM Judgments
This paper introduces the "Accumulated Message Effect on LLM Judgments" (AMEL), demonstrating that the polarity of prior conversation history biases subsequent evaluations made by Large Language Models. Across numerous tests, models shifted their judgments toward the prevailing sentiment of the preceding messages, part…
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
This paper investigates the risk of Large Language Models (LLMs) exacerbating armed conflicts by generating harmful outputs like false equivalencies or genocide denial. The authors tested nine model configurations across 90 multi-turn conflict scenarios, finding failure rates ranging from 6% to 47%. The core contributi…
Claw AI Lab: An Autonomous Multi-Agent Research Team
Claw AI Lab introduces an autonomous research platform that moves beyond single-agent pipelines by enabling users to instantiate and manage a customizable, multi-agent research team from a single prompt. Its core contribution is providing an interactive, laboratory-like environment with real-time monitoring, collaborat…
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents
This paper introduces **Contractual Skills**, a design framework inspired by GovernSpec, to structure agent skills as inspectable, readable task contracts within enterprise AI systems. The core method organizes `SKILL.md` files to explicitly define goals, boundaries, contracts, and verification steps, clarifying the bo…
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
DeltaBox addresses the bottleneck of slow state checkpoint/rollback (C/R) for stateful AI agents by proposing a change-based transactional C/R mechanism instead of full state duplication. The core method introduces **DeltaState**, a new OS-level abstraction featuring **DeltaFS** (layered filesystem C/R) and a mechanism…
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
This paper reframes post-training methods like SFT and RL not just by their loss functions, but by how they shape the **state distribution** used for learning. The core contribution is formalizing post-training as **state-distribution shaping**, demonstrating that the states induced by the learner (as in RL/OPD) versus…
Reducing Political Manipulation with Consistency Training
This paper addresses covert political bias in LLMs, where models handle opposing political topics asymmetrically. The authors introduce two metrics, Sentiment Consistency and Helpfulness Consistency, to quantify this bias. They propose Political Consistency Training (PCT), an RL method combining these two consistency p…
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Spreadsheet-RL is a reinforcement learning fine-tuning framework designed to train specialized AI agents for complex, multi-step tasks within a realistic Microsoft Excel environment. The core method involves using RL to overcome the limitations of simple prompting methods for real-world spreadsheet workflows. Its contr…
Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents
This paper introduces **TTBYS (Think Thrice Before You Speak)**, a novel framework that enhances Large Language Models' (LLMs) Theory of Mind (ToM) reasoning for persuasive dialogue. TTBYS uses a **dual knowledge enhancement** approach within a stepwise reasoning process to explicitly model the sequential dependencies …
Understanding Data Temporality Impact on Large Language Models Pre-training
This paper investigates how data ordering during pre-training affects the temporal knowledge of Large Language Models (LLMs). The authors introduce a benchmark of over 7,000 temporally grounded questions to assess time-sensitive factual recall. They demonstrate that training LLMs on chronologically ordered data, rather…
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
This paper introduces **WorkstreamBench**, a novel benchmark designed to evaluate Large Language Model (LLM) agents on complex, end-to-end spreadsheet creation tasks relevant to finance, such as financial modeling. The core contribution is moving beyond simple formula edits to assess agents' ability to produce complete…
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
GraphFlow introduces a novel graph-based workflow management system for efficient LLM-agent serving. It represents workflows as a unified graph structure, wGraph, allowing for dynamic instantiation of task-specific workflows based on semantic understanding. This approach overcomes the limitations of static templates by…
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
This paper introduces "Boiling the Frog," a novel benchmark designed to evaluate the safety of tool-using AI agents in office environments against **incremental attacks**. The core method involves multi-turn scenarios where benign initial requests gradually escalate to risk-bearing actions within a persistent workspace…
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG is a novel reinforcement learning framework designed to improve multilingual reasoning in LLMs by using language-conditioned hints to guide exploration in non-English tasks. It prevents over-reliance on these hints through a progressive decay schedule and a language-adaptive switch tailored to specific language di…
AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
AtelierEval is introduced as the first unified benchmark to quantify the prompting proficiency of both humans and MLLMs in generating text-to-image prompts across 360 expert-crafted tasks. The core method involves using AtelierJudge, a skill-based, memory-augmented agentic evaluator, to produce reliable subjective and …
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
This paper compares acoustic emotion models and LLMs for analyzing the Pathos dimension in political speech, using the TRUST LLM pipeline as a benchmark. The core finding is that the Gemini LLM, analyzing both audio and transcript, correlates strongly with the benchmark Pathos scores, while a standard acoustic SER mode…
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
This paper investigates "Hyperfitting," a phenomenon where extreme fine-tuning enhances LLM generation quality beyond simple distribution sharpening. The authors demonstrate that hyperfitting is fundamentally distinct from temperature scaling, as entropy-matched controls fail to replicate its diversity gains. Their cor…
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
LCGuard is a framework designed to ensure safe latent communication via shared Key-Value (KV) caches in multi-agent LLM systems. It addresses the risk of sensitive information leakage by learning representation-level transformations on the KV caches before they are transmitted between agents. This acts as a "guard" to …