From the arXiv
Monday, 4 May 2026 · 20 papers
LLM-Oriented Information Retrieval: A Denoising-First Perspective
This paper argues that the shift to LLM-centric information retrieval (IR) makes noise a critical bottleneck, causing hallucinations and reasoning failures due to limited LLM attention. The core contribution is conceptualizing this paradigm shift through a four-stage framework of IR challenges (inaccessible to unverifi…
Position: agentic AI orchestration should be Bayes-consistent
This paper argues that while making Large Language Models (LLMs) themselves explicitly Bayesian is difficult, the **orchestration layer** of agentic AI systems should adopt **Bayesian Decision Theory (BDT)**. This provides a principled framework for managing uncertainty, updating beliefs based on interactions, and maki…
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
SAGA addresses the inefficiency of scheduling independent LLM calls for AI agent workflows on GPU clusters by shifting to **program-level scheduling**. It treats the entire agent workflow as the first-class schedulable unit, using Agent Execution Graphs to predict and reuse intermediate states (like KV caches) across t…
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
This paper systematically analyzes the performance and efficiency trade-offs for running large LLMs (70B+ parameters) on consumer hardware, comparing Nvidia and Apple Silicon. It identifies a "Backend Dichotomy" on Nvidia, where the new NVFP4 format boosts throughput significantly but imposes runtime latency constraint…
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
This paper introduces a principled framework, inspired by decision-making theory, to assess and optimize when Large Language Models (LLMs) should use external tools, focusing specifically on web search. The framework evaluates tool-use decisions based on necessity, utility, and affordability, using both normative (opti…
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
This paper introduces the Obfuscated Natural Number Game to evaluate LLMs' **Architectural Reasoning**, defined as synthesizing proofs using only local axioms in an unfamiliar domain. By renaming identifiers in the Lean 4 Natural Number Game, they created a zero-knowledge benchmark. The study found that while obfuscati…
RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution
RunAgent is a multi-agent platform designed to reliably execute natural-language plans by enforcing stepwise execution through constraints and rubrics. It translates flexible natural language into a deterministic, agentic language with explicit control flow constructs. The core contribution is its ability to autonomous…
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
This paper introduces **Stable-GFlowNet (S-GFN)** to improve the stability and diversity of LLM red-teaming using Generative Flow Networks (GFNs). S-GFN achieves stability by eliminating the need for partition function ($Z$) estimation via pairwise comparisons and using robust masking against noisy rewards. This result…
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ introduces a novel quantization scheme for memory-efficient LLM training by employing layer-aware quantization for near 4-bit activations and precision-preserving 8-bit quantization for gradients. This method effectively reduces GPU memory usage by up to 52% and accelerates training speed by up to 1.34$\times$ com…
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
This paper introduces **MathArena** as a continuously maintained evaluation platform designed to overcome the limitations of static benchmarks for assessing LLM mathematical reasoning. It significantly broadens the original scope to include diverse tasks like proof generation, research-level problems, and competition m…
FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
FinSafetyBench is a bilingual (English-Chinese) red-teaming benchmark designed to systematically evaluate the safety and compliance refusal capabilities of Large Language Models (LLMs) in real-world financial scenarios. Grounded in actual financial crime cases, it comprises 14 subcategories testing violations across fi…
ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models
This paper introduces **ML-Bench**, a novel multilingual safety benchmark grounded directly in regional regulations across 14 languages, moving beyond general risk taxonomies. This policy-grounded approach allows for culturally and legally aligned safety evaluation. Based on this benchmark, the authors also develop **M…
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
ReLay introduces a novel dataset of participant-summary pairs to study the effectiveness of personalized Plain Language Summaries (PLS) generated by Large Language Models (LLMs). The core method involves comparing static, expert-written summaries against LLM-personalized summaries across various user characteristics an…
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
This paper introduces a diagnostic benchmark to evaluate whether Large Language Models (LLMs) faithfully execute multi-step arithmetic procedures provided in prompts, moving beyond just final answer accuracy. The study reveals that as procedure length increases, model accuracy significantly degrades, showing failures l…
Can Coding Agents Reproduce Findings in Computational Materials Science?
This paper introduces **AutoMat**, a new benchmark designed to evaluate the capability of LLM-based coding agents to reproduce findings in computational materials science. AutoMat tests agents on three core challenges: recovering underspecified procedures, navigating specialized toolchains, and validating scientific cl…
Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment
This paper addresses the challenge of applying Graph Foundation Models to multi-domain heterogeneous graphs by proposing Decoupled Relation Subspace Alignment (DRSA). DRSA shifts the paradigm from blind global feature alignment to a relation-driven approach that explicitly decouples feature semantics from relation stru…
Jailbreaking Vision-Language Models Through the Visual Modality
This paper introduces four novel jailbreaking attacks that specifically exploit the visual modality of Vision-Language Models (VLMs) to bypass safety alignment. The core contribution is demonstrating a significant cross-modality alignment gap, showing that text-based safety training fails to generalize when harmful int…
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
This paper introduces GUI-SD, the first On-Policy Self-Distillation (OPSD) framework specifically designed for GUI grounding. It addresses the limitations of traditional reinforcement learning by generating dense, token-level supervision from a single agent rollout. The core method uses a visually enriched context for …
Make Your LVLM KV Cache More Lightweight
LightKV addresses the significant GPU memory overhead of KV caches in LVLMs caused by numerous vision tokens during prefill. The core method uses prompt-aware, cross-modality message passing to aggregate and progressively compress redundant vision-token embeddings. This results in halving the vision-token KV cache size…
Space Network of Experts: Architecture and Expert Placement
This paper introduces the **Space Network of Experts (Space-XNet)** framework to efficiently deploy large language models (LLMs) on resource-constrained satellite networks for space-based AI. The core method involves a **two-level expert placement strategy** that partitions and maps Mixture-of-Experts (MoE) model compo…