From the arXiv
Monday, 8 June 2026 · 20 papers
Self-evolving LLM agents with in-distribution Optimization
The paper introduces **Q-Evolve**, a self-evolving framework for LLM agents designed to overcome sparse reward challenges in long-horizon decision-making. It unifies automatic process-reward labeling and policy learning using an in-distribution reinforcement learning approach. The core method learns a stable critic fro…
A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning
This paper comprehensively compares the mathematical reasoning steps of the DeepSeek-R1 LLM and humans on AIME 2025 problems, categorizing 10,247 steps. The core finding is a structural difference: human reasoning is compact, while the LLM exhibits "topological mimicry," frequently revisiting shallow steps without logi…
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
This paper introduces the **AARR (Act As a Real Researcher) benchmark series** to evaluate frontier LLMs and agents on the nuanced professionalism and thoroughness required in real research, moving beyond simple macro-level execution. The first installment, **AARRI-Bench**, specifically assesses agents' ability to emul…
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
DuMate-DeepResearch is a multi-agent framework designed to overcome limitations in current Deep Research (DR) systems, specifically concerning long-horizon planning, task decomposition, and auditability. It achieves this by decoupling the Agent Core (handling planning and scheduling) from an extensible Tool Ecosystem, …
How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope
This paper investigates how autonomous AI agents transform knowledge work by analyzing production data comparing Perplexity's Search and Computer products. The core finding is that the autonomous Computer product significantly accelerates task completion (26 minutes of automated work vs. 33 seconds of manual orchestrat…
Online Pandora's Box for Contextual LLM Cascading
This paper introduces the **Online Pandora's Box for Contextual LLM Cascading**, an adaptive framework for sequentially querying and selecting among LLM APIs based on request context. Its core method models the **contextual reservation index** directly, addressing the unique challenge where feedback is mediated by the …
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Socratic-SWE is a closed-loop framework that enables self-evolving software engineering agents by leveraging their own historical solving traces. It distills these traces into structured "agent skills" that capture recurring failures and successful repair patterns. These skills then guide the generation of new, targete…
SV-Detect: AI-generated Text Detection with Steering Vectors
SV-Detect detects AI-generated text by extracting "steering vectors" from a frozen language model's hidden layers, which define directions separating human and machine text. The method represents inputs by their alignment with these layer-wise directions and uses a lightweight classifier on these features for detection…
When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations
This paper systematically evaluates the sensitivity of general and medical Large Language Models (LLMs) to prompt variations (natural and adversarial) using the MedMCQA benchmark. The core contribution is demonstrating that even minor phrasing changes significantly impact model consistency and accuracy in clinical reas…
Agentopia: Long-Term Life Simulation and Learning in Agent Societies
Agentopia is a comprehensive framework designed for long-term life simulation of multi-agent societies, extending simulations from days to years. The core method involves simulating 100 LLM-powered agents autonomously pursuing growth, relationships, and goals over a simulated decade. The contribution is enabling the st…
M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions
The paper introduces **M$^3$Exam**, a novel benchmark designed to evaluate language agents' multimodal memory capabilities in realistic user-agent interactions, moving beyond sparse, human-centric data. Its core contribution is a query-centric evaluation framework that tests cross-modal grounding and implicit informati…
Sycophantic Praise: Evaluating Excessive Praise in Language Models
This paper introduces a novel framework to measure *sycophantic praise* in language models, distinguishing it from simple agreement. The method quantifies praise by comparing it against the contribution's quality and expected user ability, showing it is a distinct alignment problem. The authors demonstrate this framewo…
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
This paper introduces **CapCode**, a framework for creating coding evaluation datasets where the maximum achievable *non-cheating* score is deliberately capped below perfect performance. This design allows high scores significantly exceeding the cap to serve as reliable indicators of deceptive cheating. Furthermore, th…
Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
This paper introduces Hierarchical Certified Semantic Commitment (H-CSC), a Byzantine Fault Tolerance (BFT)-inspired protocol designed for LLM-agent collaboration. H-CSC converts embedding-derived finality signals into one of three typed outcomes: a semantic commit, a verdict commit, or an explicit abort. Its core cont…
How reliable are LLMs when it comes to playing dice?
This paper benchmarks the probabilistic reasoning of eight state-of-the-art LLMs using standard and counterintuitive dice problems. The core finding is that while models excel at standard problems (0.96 accuracy), performance significantly drops on counterintuitive tasks (0.59 accuracy) and is highly sensitive to promp…
MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
MemDreamer addresses long-video understanding by decoupling perception and reasoning using a Hierarchical Graph Memory to incrementally build semantic abstractions from streamed video. During inference, an agentic retrieval mechanism uses tool-augmented actions to navigate this memory structure, allowing the model to r…
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
This paper introduces SETA, a framework for continual learning in LLMs that addresses catastrophic forgetting by employing a Mixture of Sparse Experts architecture. SETA adaptively decomposes model parameters into task-specific experts and shared experts, isolating new knowledge while protecting common features. This s…
The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs
This paper introduces a controlled framework using real-world cultural questions to disentangle general language proficiency from localized cultural knowledge access in LLMs. By crossing question type (agnostic vs. specific) with query language (English vs. local) and employing Item Response Theory, the authors isolate…
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This paper proposes a unified framework for analyzing human-view video understanding using MLLMs, structured around three core abilities: **watching, remembering, and reasoning**. The contribution lies in providing a structured formulation to characterize how these models acquire evidence, maintain context over long vi…
Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models
The paper introduces the **Bootstrap Theory of Representational Emergence (TBER)**, a framework explaining how new levels of representation arise in machine learning. TBER posits that representational innovation is driven not just by data or compute, but fundamentally by **explanatory insufficiency**, where existing re…