№01
cs.AI arxiv:2604.22748v1

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.

This paper introduces the "Agentic World Modeling" framework, a taxonomy organized by capability levels (Predictor, Simulator, Evolver) and governing law regimes (physical, digital, social, scientific). The core contribution is providing a structured way to understand and evaluate the necessary predictive environment m…

9
№02
cs.AI arxiv:2604.22436v1

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang et al.

AgentSearchBench is a large-scale benchmark designed to evaluate AI agent search methods in realistic, "in the wild" scenarios, addressing the limitations of existing benchmarks that assume well-specified agents. It formalizes agent search as retrieval and reranking tasks using nearly 10,000 real-world agents, evaluati…

9
№03
cs.AI arxiv:2604.22411v1

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Alberto Messina, Stefano Scotta

This paper introduces the concept of **background temperature ($T_{\mathrm{bg}}$)** to quantify the inherent, implementation-dependent randomness observed in Large Language Models (LLMs) even when the nominal decoding temperature is set to zero. $T_{\mathrm{bg}}$ formalizes the effective temperature induced by environm…

9
№04
cs.AI arxiv:2604.22565v1

Learning Evidence Highlighting for Frozen LLMs

Shaoang Li, Yanhang Shi, Yufei Li et al.

This paper introduces **HiLight**, a framework that trains a lightweight **Emphasis Actor** to insert minimal highlight tags around crucial evidence within the original, unaltered context. This approach decouples evidence selection from reasoning, allowing a **frozen LLM Solver** to utilize the emphasized input for imp…

9
№05
cs.AI arxiv:2604.22577v1

QuantClaw: Precision Where It Matters for OpenClaw

Manyi Zhang, Ji-Fu Li, Zhongao Sun et al.

QuantClaw addresses the high cost of large autonomous agents like OpenClaw by dynamically adjusting numerical precision based on task requirements. It analyzes quantization sensitivity across workflows and proposes a plug-and-play routing plugin that assigns lower precision to lightweight tasks and preserves higher pre…

9
№06
cs.AI arxiv:2604.22597v1

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

Erez Yosef, Oron Anschel, Shunit Haviv Hakimi et al.

This paper introduces a robust LLM-as-a-Judge framework to evaluate mathematical reasoning, moving beyond the limitations of rigid symbolic comparison. The core method uses a large language model to assess the correctness of generated answers, accommodating diverse mathematical representations and solution formats. Thi…

9
№07
cs.AI arxiv:2604.22558v1

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

Jichao Wang, Liuyang Bian, Yufeng Zhou et al.

SOLAR-RL addresses the challenge of training GUI agents using MLLMs by bridging the gap between static Offline RL and costly Online RL. The core method integrates global trajectory semantics into offline learning by reconstructing rollouts, identifying the first failure point, and retroactively assigning dense, long-ho…

9
№08
cs.AI arxiv:2604.22452v1

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

Xirui Li, Ming Li, Yunze Xiao et al.

This paper introduces the **Superminds Test**, a hierarchical framework using controlled **Probing Agents** to empirically evaluate the emergence of collective intelligence in large-scale agent societies, specifically using the MoltBook platform. The core contribution is demonstrating a **stark absence of collective in…

9
№09
cs.AI arxiv:2604.22273v1

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

Aofan Liu, Jingxiang Meng

This paper models LLM self-correction as a control-theoretic feedback loop using a two-state Markov process to diagnose when iteration is beneficial. The core contribution is identifying a critical threshold (near-zero Error Introduction Rate, EIR $\le 0.5\%$) that separates helpful from harmful self-correction across …

9
№10
cs.LG arxiv:2604.22271v1

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

Dharshan Kumaran, Viorica Patraucean, Simon Osindero et al.

This paper investigates how LLMs detect and correct their own errors by examining the role of internal confidence signals, specifically the "post-answer newline" (PANL) token representation. Drawing on second-order decision models, the authors hypothesize that this PANL signal, which is partially independent of the pri…

9
№11
cs.LG arxiv:2604.22575v1

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

Yuqi Pan, Jinghao Zhuang, Yupeng Feng et al.

SpikingBrain2.0 introduces a novel foundation model architecture, SpB2.0, designed for efficient long-context inference. Its core method involves the Dual-Space Sparse Attention (DSSA) mechanism, which hybridizes sparse attention types for better performance-efficiency. The contribution lies in achieving high performan…

9
№12
cs.CL arxiv:2604.22661v1

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Negar Arabzadeh, Andrew Drozdov, Michael Bendersky et al.

This paper investigates using Query Performance Prediction (QPP) to select the optimal query variant within Retrieval-Augmented Generation (RAG) pipelines, avoiding costly execution of all reformulations. The core method focuses on **intra-topic discrimination**, where QPP predicts the best variant among semantically e…

9
№13
cs.CL arxiv:2604.22335v1

Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

Weixu Zhang, Fanghua Ye, Qiang Gao et al.

This paper introduces Context-Fidelity Boosting (CFB), a lightweight, decoding-time framework designed to reduce faithfulness hallucinations in LLMs by prioritizing context-supported tokens. Inspired by watermarking, CFB applies additive logit adjustments based on a token's support from the input context, utilizing sta…

9
№14
cs.CL arxiv:2604.22750v1

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang et al.

This paper presents the first systematic analysis of token consumption in agentic coding tasks across eight frontier LLMs. The core method involves analyzing task trajectories to determine where tokens are spent and evaluating models' ability to predict their own token costs. The key contribution is revealing that agen…

9
№15
cs.CL arxiv:2604.22345v1

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Weixu Zhang, Ye Yuan, Changjiang Han et al.

This paper introduces a mechanistic framework to understand and control LLM personalization by identifying "Preference Heads"—attention heads encoding user-specific stylistic and topical preferences. The core method, Differential Preference Steering (DPS), uses causal masking to calculate a Preference Contribution Scor…

9
№16
cs.AI arxiv:2604.22306v1

BLAST: Benchmarking LLMs with ASP-based Structured Testing

Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri et al.

This paper introduces **BLAST**, the first benchmarking methodology and dataset specifically designed to evaluate Large Language Models' (LLMs) ability to generate **Answer Set Programming (ASP)** code. BLAST employs a structured evaluation framework featuring two novel semantic metrics tailored for ASP code correctnes…

8
№17
cs.AI arxiv:2604.22328v1

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Marco Obermeier, Marco Pruckner, Florian Haselbeck et al.

This paper introduces the FETS benchmark to evaluate the application of foundation models (FMs) in energy time series forecasting. The core method involves structuring energy forecasting use cases and collecting 54 diverse datasets to systematically benchmark FMs against traditional dataset-specific models. The main co…

8
№18
cs.AI arxiv:2604.22601v1

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan et al.

This paper introduces the NL2VC-60 dataset to facilitate AI-assisted problem-to-code generation with formal verification. The core method involves a tiered prompting strategy (contextless, signature, and self-healing) that uses feedback from the Dafny verifier to guide Large Language Models (LLMs) in synthesizing code …

8
№19
cs.AI arxiv:2604.22446v1

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Zhengxu Yu, Yu Fu, Zhiyuan He et al.

This paper introduces **OneManCompany (OMC)**, a framework that moves beyond fixed multi-agent structures by introducing an organizational layer. OMC encapsulates agent capabilities as portable **Talents** orchestrated via typed interfaces, enabling dynamic reconfiguration through a **Talent Market** for on-demand recr…

8
№20
cs.AI arxiv:2604.22438v1

SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

Chenxi Gu, Xiaoning Du, John Grundy

This paper introduces **SSG (Logit-Balanced Vocabulary Partitioning)** to enhance the KGW watermarking scheme, particularly in low-entropy scenarios like code generation where KGW struggles. SSG addresses this by analyzing the "watermark strength" inherent in the next-token probability distribution. The core contributi…

8