№01
cs.AI arxiv:2605.00505v1

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Lu Dai, Liang Sun, Fanpu Cao et al.

This paper argues that the shift to LLM-centric information retrieval (IR) makes noise a critical bottleneck, causing hallucinations and reasoning failures due to limited LLM attention. The core contribution is conceptualizing this paradigm shift through a four-stage framework of IR challenges (inaccessible to unverifi…

9
№02
cs.AI arxiv:2605.00742v1

Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer et al.

This paper argues that while making Large Language Models (LLMs) themselves explicitly Bayesian is difficult, the **orchestration layer** of agentic AI systems should adopt **Bayesian Decision Theory (BDT)**. This provides a principled framework for managing uncertainty, updating beliefs based on interactions, and maki…

9
№03
cs.AI arxiv:2605.00528v1

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Dongxin Guo, Jikun Wu, Siu Ming Yiu

SAGA addresses the inefficiency of scheduling independent LLM calls for AI agent workflows on GPU clusters by shifting to **program-level scheduling**. It treats the entire agent workflow as the first-class schedulable unit, using Agent Execution Graphs to predict and reuse intermediate states (like KV caches) across t…

9
№04
cs.AI arxiv:2605.00519v1

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Allan Kazakov, Abdurrahman Javat

This paper systematically analyzes the performance and efficiency trade-offs for running large LLMs (70B+ parameters) on consumer hardware, comparing Nvidia and Apple Silicon. It identifies a "Backend Dichotomy" on Nvidia, where the new NVFP4 format boosts throughput significantly but imposes runtime latency constraint…

9
№05
cs.AI arxiv:2605.00737v1

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu, Soumi Das, Mahsa Amani et al.

This paper introduces a principled framework, inspired by decision-making theory, to assess and optimize when Large Language Models (LLMs) should use external tools, focusing specifically on web search. The framework evaluates tool-use decisions based on necessity, utility, and affordability, using both normative (opti…

9
№06
cs.LG arxiv:2605.00677v1

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

This paper introduces the Obfuscated Natural Number Game to evaluate LLMs' **Architectural Reasoning**, defined as synthesizing proofs using only local axioms in an unfamiliar domain. By renaming identifiers in the Lean 4 Natural Number Game, they created a zero-knowledge benchmark. The study found that while obfuscati…

9
№07
cs.LG arxiv:2605.00798v1

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Arunabh Srivastava, Mohammad A., Khojastepour et al.

RunAgent is a multi-agent platform designed to reliably execute natural-language plans by enforcing stepwise execution through constraints and rubrics. It translates flexible natural language into a deterministic, agentic language with explicit control flow constructs. The core contribution is its ability to autonomous…

9
№08
cs.LG arxiv:2605.00553v1

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Minchan Kwon, Sunghyun Baek, Minseo Kim et al.

This paper introduces **Stable-GFlowNet (S-GFN)** to improve the stability and diversity of LLM red-teaming using Generative Flow Networks (GFNs). S-GFN achieves stability by eliminating the need for partition function ($Z$) estimation via pairwise comparisons and using robust masking against noisy rewards. This result…

9
№09
cs.CL arxiv:2605.00539v1

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Wenxiang Lin, Juntao Huang, Luhan Zhang et al.

AGoQ introduces a novel quantization scheme for memory-efficient LLM training by employing layer-aware quantization for near 4-bit activations and precision-preserving 8-bit quantization for gradients. This method effectively reduces GPU memory usage by up to 52% and accelerates training speed by up to 1.34$\times$ com…

9
№10
cs.CL arxiv:2605.00674v1

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger et al.

This paper introduces **MathArena** as a continuously maintained evaluation platform designed to overcome the limitations of static benchmarks for assessing LLM mathematical reasoning. It significantly broadens the original scope to include diverse tasks like proof generation, research-level problems, and competition m…

9
№11
cs.CL arxiv:2605.00706v1

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Yutao Hou, Yihan Jiang, Yuhan Xie et al.

FinSafetyBench is a bilingual (English-Chinese) red-teaming benchmark designed to systematically evaluate the safety and compliance refusal capabilities of Large Language Models (LLMs) in real-world financial scenarios. Grounded in actual financial crime cases, it comprises 14 subcategories testing violations across fi…

9
№12
cs.CL arxiv:2605.00689v1

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma et al.

This paper introduces **ML-Bench**, a novel multilingual safety benchmark grounded directly in regional regulations across 14 languages, moving beyond general risk taxonomies. This policy-grounded approach allows for culturally and legally aligned safety evaluation. Based on this benchmark, the authors also develop **M…

9
№13
cs.CL arxiv:2605.00468v1

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

Joey Chan, Yikun Han, Jingyuan Chen et al.

ReLay introduces a novel dataset of participant-summary pairs to study the effectiveness of personalized Plain Language Summaries (PLS) generated by Large Language Models (LLMs). The core method involves comparing static, expert-written summaries against LLM-personalized summaries across various user characteristics an…

9
№14
cs.CL arxiv:2605.00817v1

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

This paper introduces a diagnostic benchmark to evaluate whether Large Language Models (LLMs) faithfully execute multi-step arithmetic procedures provided in prompts, moving beyond just final answer accuracy. The study reveals that as procedure length increases, model accuracy significantly degrades, showing failures l…

9
№15
cs.AI arxiv:2605.00803v1

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang, Yi Cao, Ali K. Shargh et al.

This paper introduces **AutoMat**, a new benchmark designed to evaluate the capability of LLM-based coding agents to reproduce findings in computational materials science. AutoMat tests agents on three core challenges: recovering underspecified procedures, navigating specialized toolchains, and validating scientific cl…

8
№16
cs.AI arxiv:2605.00731v1

Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment

Ziyu Zheng, Yaming Yang, Zhe Wang et al.

This paper addresses the challenge of applying Graph Foundation Models to multi-domain heterogeneous graphs by proposing Decoupled Relation Subspace Alignment (DRSA). DRSA shifts the paradigm from blind global feature alignment to a relation-driven approach that explicitly decouples feature semantics from relation stru…

8
№17
cs.AI arxiv:2605.00583v1

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay, Jan Dubiński, Zhuoyun Li et al.

This paper introduces four novel jailbreaking attacks that specifically exploit the visual modality of Vision-Language Models (VLMs) to bypass safety alignment. The core contribution is demonstrating a significant cross-modality alignment gap, showing that text-based safety training fails to generalize when harmful int…

8
№18
cs.AI arxiv:2605.00642v1

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen et al.

This paper introduces GUI-SD, the first On-Policy Self-Distillation (OPSD) framework specifically designed for GUI grounding. It addresses the limitations of traditional reinforcement learning by generating dense, token-level supervision from a single agent rollout. The core method uses a visually enriched context for …

8
№19
cs.AI arxiv:2605.00789v1

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann

LightKV addresses the significant GPU memory overhead of KV caches in LVLMs caused by numerous vision tokens during prefill. The core method uses prompt-aware, cross-modality message passing to aggregate and progressively compress redundant vision-token embeddings. This results in halving the vision-token KV cache size…

8
№20
cs.AI arxiv:2605.00515v1

Space Network of Experts: Architecture and Expert Placement

Zhanwei Wang, Huiling Yang, Min Sheng et al.

This paper introduces the **Space Network of Experts (Space-XNet)** framework to efficiently deploy large language models (LLMs) on resource-constrained satellite networks for space-based AI. The core method involves a **two-level expert placement strategy** that partitions and maps Mixture-of-Experts (MoE) model compo…

8