From the arXiv
Monday, 27 April 2026 · 20 papers
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
This paper introduces the "Agentic World Modeling" framework, a taxonomy organized by capability levels (Predictor, Simulator, Evolver) and governing law regimes (physical, digital, social, scientific). The core contribution is providing a structured way to understand and evaluate the necessary predictive environment m…
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
AgentSearchBench is a large-scale benchmark designed to evaluate AI agent search methods in realistic, "in the wild" scenarios, addressing the limitations of existing benchmarks that assume well-specified agents. It formalizes agent search as retrieval and reranking tasks using nearly 10,000 real-world agents, evaluati…
Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
This paper introduces the concept of **background temperature ($T_{\mathrm{bg}}$)** to quantify the inherent, implementation-dependent randomness observed in Large Language Models (LLMs) even when the nominal decoding temperature is set to zero. $T_{\mathrm{bg}}$ formalizes the effective temperature induced by environm…
Learning Evidence Highlighting for Frozen LLMs
This paper introduces **HiLight**, a framework that trains a lightweight **Emphasis Actor** to insert minimal highlight tags around crucial evidence within the original, unaltered context. This approach decouples evidence selection from reasoning, allowing a **frozen LLM Solver** to utilize the emphasized input for imp…
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw addresses the high cost of large autonomous agents like OpenClaw by dynamically adjusting numerical precision based on task requirements. It analyzes quantization sensitivity across workflows and proposes a plug-and-play routing plugin that assigns lower precision to lightweight tasks and preserves higher pre…
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
This paper introduces a robust LLM-as-a-Judge framework to evaluate mathematical reasoning, moving beyond the limitations of rigid symbolic comparison. The core method uses a large language model to assess the correctness of generated answers, accommodating diverse mathematical representations and solution formats. Thi…
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
SOLAR-RL addresses the challenge of training GUI agents using MLLMs by bridging the gap between static Offline RL and costly Online RL. The core method integrates global trajectory semantics into offline learning by reconstructing rollouts, identifying the first failure point, and retroactively assigning dense, long-ho…
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
This paper introduces the **Superminds Test**, a hierarchical framework using controlled **Probing Agents** to empirically evaluate the emergence of collective intelligence in large-scale agent societies, specifically using the MoltBook platform. The core contribution is demonstrating a **stark absence of collective in…
When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention
This paper models LLM self-correction as a control-theoretic feedback loop using a two-state Markov process to diagnose when iteration is beneficial. The core contribution is identifying a critical threshold (near-zero Error Introduction Rate, EIR $\le 0.5\%$) that separates helpful from harmful self-correction across …
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
This paper investigates how LLMs detect and correct their own errors by examining the role of internal confidence signals, specifically the "post-answer newline" (PANL) token representation. Drawing on second-order decision models, the authors hypothesize that this PANL signal, which is partially independent of the pri…
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 introduces a novel foundation model architecture, SpB2.0, designed for efficient long-context inference. Its core method involves the Dual-Space Sparse Attention (DSSA) mechanism, which hybridizes sparse attention types for better performance-efficiency. The contribution lies in achieving high performan…
Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
This paper investigates using Query Performance Prediction (QPP) to select the optimal query variant within Retrieval-Augmented Generation (RAG) pipelines, avoiding costly execution of all reformulations. The core method focuses on **intra-topic discrimination**, where QPP predicts the best variant among semantically e…
Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
This paper introduces Context-Fidelity Boosting (CFB), a lightweight, decoding-time framework designed to reduce faithfulness hallucinations in LLMs by prioritizing context-supported tokens. Inspired by watermarking, CFB applies additive logit adjustments based on a token's support from the input context, utilizing sta…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
This paper presents the first systematic analysis of token consumption in agentic coding tasks across eight frontier LLMs. The core method involves analyzing task trajectories to determine where tokens are spent and evaluating models' ability to predict their own token costs. The key contribution is revealing that agen…
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
This paper introduces a mechanistic framework to understand and control LLM personalization by identifying "Preference Heads"—attention heads encoding user-specific stylistic and topical preferences. The core method, Differential Preference Steering (DPS), uses causal masking to calculate a Preference Contribution Scor…
BLAST: Benchmarking LLMs with ASP-based Structured Testing
This paper introduces **BLAST**, the first benchmarking methodology and dataset specifically designed to evaluate Large Language Models' (LLMs) ability to generate **Answer Set Programming (ASP)** code. BLAST employs a structured evaluation framework featuring two novel semantic metrics tailored for ASP code correctnes…
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
This paper introduces the FETS benchmark to evaluate the application of foundation models (FMs) in energy time series forecasting. The core method involves structuring energy forecasting use cases and collecting 54 diverse datasets to systematically benchmark FMs against traditional dataset-specific models. The main co…
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
This paper introduces the NL2VC-60 dataset to facilitate AI-assisted problem-to-code generation with formal verification. The core method involves a tiered prompting strategy (contextless, signature, and self-healing) that uses feedback from the Dafny verifier to guide Large Language Models (LLMs) in synthesizing code …
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
This paper introduces **OneManCompany (OMC)**, a framework that moves beyond fixed multi-agent structures by introducing an organizational layer. OMC encapsulates agent capabilities as portable **Talents** orchestrated via typed interfaces, enabling dynamic reconfiguration through a **Talent Market** for on-demand recr…
SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking
This paper introduces **SSG (Logit-Balanced Vocabulary Partitioning)** to enhance the KGW watermarking scheme, particularly in low-entropy scenarios like code generation where KGW struggles. SSG addresses this by analyzing the "watermark strength" inherent in the next-token probability distribution. The core contributi…