Weekly Issue
Collected dispatches

2026-W20

2026-05-04 to 2026-05-10
100 papers
7 daily issues
A weekly ledger drawn from the daily archive. 3 sections
§ I

The Week in Review

Editorial summary

The past week in AI research showed significant trends centered around system reliability, robust agentic operation, safety/alignment hardening, and infrastructure efficiency.

Popular Directions & Notable Advances:

1. Agentic Robustness and Formal Control: A strong thematic advance was the push for more principled control over AI agents. This included arguing for Bayesian Decision Theory (BDT) in agent orchestration for uncertainty management, and the introduction of RunAgent for constraint-guided, deterministic execution of natural-language plans. Furthermore, papers addressed procedural faithfulness, showing that LLMs often fail to execute long, intricate steps faithfully, highlighting a gap between apparent reasoning and execution reliability. 2. Safety, Red-Teaming, and Alignment Hardening: Safety research evolved beyond basic prompts. New benchmarks like FinSafetyBench and ML-Bench grounded evaluation in specific regulatory and financial contexts. Red-teaming saw sophistication via ContextualJailbreak (evolutionary multi-turn attacks) and Stable-GFlowNet (diverse, stable attack generation). Crucially, one paper addressed cross-modality risks, showing jailbreaking via visual input in VLMs. 3. Efficiency and Infrastructure Optimization: Focus remained on mitigating operational costs. SAGA advanced agent efficiency by adopting program-level scheduling to reuse GPU states across tool calls. Memory concerns were addressed via quantization techniques like AGoQ for training LLMs and LightKV for reducing the high memory footprint of vision tokens in LVLMs. Consumer hardware analysis (Silicon Showdown) confirmed persistent VRAM limitations for large models. 4. Domain Specialization and Reasoning Evaluation: LLMs were rigorously tested in specialized fields. MathArena provided a continuous platform for mathematical evaluation, while AutoMat assessed agent reproducibility in computational materials science, revealing challenges in synthesizing scientific findings. A new benchmark, the Obfuscated Natural Number Game, tested deep architectural reasoning independent of memorized patterns.

Significant Shifts:

There was a notable shift from viewing LLMs merely as black-box answer generators to seeing them as components within complex, orchestrated systems (Agents and IR pipelines). The IR paper framed the challenge as denoising rather than just retrieval. Simultaneously, the "AI-Generated Smells" paper introduced the Reasoning-Complexity Trade-off, suggesting that more complex, superior-performing LLM-generated code often suffers from greater structural degradation, challenging the sole focus on functional correctness in AI-assisted software development. Finally, research on misalignment contagion signaled a new concern regarding unintended negative behavior transfer in multi-agent simulation environments.

§ II

Top Papers

Selected research 100
cs.CLarxiv:2605.03799v1Lead article

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

his paper presents a comprehensive, practical practicum guiding users through the entire modern NLP pipeline, from tokenization to RLHF. Its core contribution is providing twelve reproducible research artifacts, requiring public code and model publication for each session, all built around a single evolving corpus. The work emphasizes open-weight models and enriches the material with original research on low-resource languages like Tajik and Tatar.

cs.AIarxiv:2605.00505v1Lead article

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang

his paper argues that the shift to LLM-centric information retrieval (IR) makes noise a critical bottleneck, causing hallucinations and reasoning failures due to limited LLM attention. The core contribution is conceptualizing this paradigm shift through a four-stage framework of IR challenges (inaccessible to unverifiable) and providing a comprehensive taxonomy of signal-to-noise optimization techniques across the entire IR pipeline.

Figure 1. Challenge shifts in the history of IR.
Figure 1. Challenge shifts in the history of IR.
cs.AIarxiv:2605.00742v1Lead article

Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison

his paper argues that while making Large Language Models (LLMs) themselves explicitly Bayesian is difficult, the **orchestration layer** of agentic AI systems should adopt **Bayesian Decision Theory (BDT)**. This provides a principled framework for managing uncertainty, updating beliefs based on interactions, and making coherent decisions about which tools or actions to take. The core contribution is positioning BDT as the necessary control mechanism for robust agentic AI.

cs.AIarxiv:2605.00528v1Lead article

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Dongxin Guo, Jikun Wu, Siu Ming Yiu

AGA addresses the inefficiency of scheduling independent LLM calls for AI agent workflows on GPU clusters by shifting to **program-level scheduling**. It treats the entire agent workflow as the first-class schedulable unit, using Agent Execution Graphs to predict and reuse intermediate states (like KV caches) across tool calls. This approach significantly reduces end-to-end latency by minimizing state discarding compared to traditional request-level scheduling.

cs.AIarxiv:2605.00519v1Lead article

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Allan Kazakov, Abdurrahman Javat

his paper systematically analyzes the performance and efficiency trade-offs for running large LLMs (70B+ parameters) on consumer hardware, comparing Nvidia and Apple Silicon. It identifies a "Backend Dichotomy" on Nvidia, where the new NVFP4 format boosts throughput significantly but imposes runtime latency constraints. The research also highlights the "VRAM Wall" on discrete GPUs, forcing users into a detrimental choice between model size and intelligence due to memory limitations.

cs.AIarxiv:2605.00737v1Lead article

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee

his paper introduces a principled framework, inspired by decision-making theory, to assess and optimize when Large Language Models (LLMs) should use external tools, focusing specifically on web search. The framework evaluates tool-use decisions based on necessity, utility, and affordability, using both normative (optimal allocation) and descriptive (observed behavior) perspectives. This allows for a comprehensive understanding of the trade-offs involved in LLM tool calling.

Given input x x , the model ℳ \( \mathcal{M} \) decides π ​ ( x ) ∈ { 0 , 1 } \( \pi \)(x)\( \in \)\{0,1\} to call a tool (response r r ) or not, producing y = ℳ ​ ( x , r ) y=\( \mathcal{M} \)(x,r) or y = ℳ ​ ( x ) y=\( \mathcal{M} \)(x) . We compare NO TOOL, ALWAYS TOOL, and SELF-DECISION, and evaluate decisions via need (requires help), utility (performance gain), and affordability (cost vs. gain), distinguishing perceived vs. true quantities.
Given input x x , the model ℳ \( \mathcal{M} \) decides π ​ ( x ) ∈ { 0 , 1 } \( \pi \)(x)\( \in \)\{0,1\} to call a tool (response r r ) or not, producing y = ℳ ​ ( x , r ) y=\( \mathcal{M} \)(x,r) or y = ℳ ​ ( x ) y=\( \mathcal{M} \)(x) . We compare NO TOOL, ALWAYS TOOL, and SE…
cs.LGarxiv:2605.00677v1Lead article

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

his paper introduces the Obfuscated Natural Number Game to evaluate LLMs' **Architectural Reasoning**, defined as synthesizing proofs using only local axioms in an unfamiliar domain. By renaming identifiers in the Lean 4 Natural Number Game, they created a zero-knowledge benchmark. The study found that while obfuscation universally increases inference time, general models degrade in performance while specialized reasoning models maintain accuracy.

LLM performance metrics across varying noise levels \( \lambda \) . The plots illustrate Correct Rate (%) and Average Time (s) for GPT-4o, Claude-Sonnet-4.5, DeepSeek-R1, GPT-5, and DeepSeek-Prover-V2. Error bars represent standard deviation over 5 independent runs.
LLM performance metrics across varying noise levels \( \lambda \) . The plots illustrate Correct Rate (%) and Average Time (s) for GPT-4o, Claude-Sonnet-4.5, DeepSeek-R1, GPT-5, and DeepSeek-Prover-V2. Error bars represent standard deviation over 5 independent runs.
cs.LGarxiv:2605.00798v1Lead article

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Arunabh Srivastava, Mohammad A., Khojastepour, Srimat Chakradhar, Sennur Ulukus

unAgent is a multi-agent platform designed to reliably execute natural-language plans by enforcing stepwise execution through constraints and rubrics. It translates flexible natural language into a deterministic, agentic language with explicit control flow constructs. The core contribution is its ability to autonomously derive and validate constraints at each step, dynamically select appropriate execution methods (reasoning, tools, or code), and incorporate error correction for robust plan completion.

An overview of RunAgent, highlighting its three main modules.
An overview of RunAgent, highlighting its three main modules.
cs.LGarxiv:2605.00553v1Lead article

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han

his paper introduces **Stable-GFlowNet (S-GFN)** to improve the stability and diversity of LLM red-teaming using Generative Flow Networks (GFNs). S-GFN achieves stability by eliminating the need for partition function ($Z$) estimation via pairwise comparisons and using robust masking against noisy rewards. This results in more stable training, leading to superior and more diverse attack performance for identifying LLM vulnerabilities.

cs.CLarxiv:2605.00539v1Lead article

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao

GoQ introduces a novel quantization scheme for memory-efficient LLM training by employing layer-aware quantization for near 4-bit activations and precision-preserving 8-bit quantization for gradients. This method effectively reduces GPU memory usage by up to 52% and accelerates training speed by up to 1.34$\times$ compared to existing techniques, overcoming convergence issues associated with aggressive low-bit quantization.

An example of Interleaved 1F1B PP with four stages and each mini-batch divided into eight micro-batches.
An example of Interleaved 1F1B PP with four stages and each mini-batch divided into eight micro-batches.
cs.CLarxiv:2605.00674v1Lead article

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov

his paper introduces **MathArena** as a continuously maintained evaluation platform designed to overcome the limitations of static benchmarks for assessing LLM mathematical reasoning. It significantly broadens the original scope to include diverse tasks like proof generation, research-level problems, and competition math. The core contribution is providing a comprehensive, regularly updated system for reliable, longitudinal comparison of LLM capabilities across a wide spectrum of mathematical challenges.

cs.CLarxiv:2605.00706v1Lead article

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Yutao Hou, Yihan Jiang, Yuhan Xie, Jian Yang, Liwen Zhang

inSafetyBench is a bilingual (English-Chinese) red-teaming benchmark designed to systematically evaluate the safety and compliance refusal capabilities of Large Language Models (LLMs) in real-world financial scenarios. Grounded in actual financial crime cases, it comprises 14 subcategories testing violations across financial crimes and ethics. The benchmark reveals critical vulnerabilities in LLMs, showing stronger susceptibility in Chinese contexts and limitations of current prompt-level defenses against sophisticated attacks.

Overview of the FinSafetyBench pipeline, which consists of extraction and summarization of real-world financial cases, controlled rephrasing with harmfulness verification, selection and integration of public datasets, bilingual alignment, and deduplication with final dataset assembly. The right panel presents an illustrative example of ethical violations. Drawing on real-world cases, FinSafetyBench incorporates more realistic details. Green highlights distinctive features (differences), while red indicates similarities.
Overview of the FinSafetyBench pipeline, which consists of extraction and summarization of real-world financial cases, controlled rephrasing with harmfulness verification, selection and integration of public datasets, bilingual alignment, and deduplication with final dataset asse…
cs.CLarxiv:2605.00689v1Lead article

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang, Bo Li

his paper introduces **ML-Bench**, a novel multilingual safety benchmark grounded directly in regional regulations across 14 languages, moving beyond general risk taxonomies. This policy-grounded approach allows for culturally and legally aligned safety evaluation. Based on this benchmark, the authors also develop **ML-Guard**, a Diffusion LLM-based guardrail model designed for multilingual safety judgment.

Overview of the ML-Guard . ML-Guard is trained on ML-Bench . ML-Guard -1.5B performs fast binary safety classification, while ML-Guard -7B supports both safety assessment and policy-conditioned violation checking.
Overview of the ML-Guard . ML-Guard is trained on ML-Bench . ML-Guard -1.5B performs fast binary safety classification, while ML-Guard -7B supports both safety assessment and policy-conditioned violation checking.
cs.CLarxiv:2605.00468v1Lead article

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

Joey Chan, Yikun Han, Jingyuan Chen, Samuel Fang, Lauren D. Gryboski

eLay introduces a novel dataset of participant-summary pairs to study the effectiveness of personalized Plain Language Summaries (PLS) generated by Large Language Models (LLMs). The core method involves comparing static, expert-written summaries against LLM-personalized summaries across various user characteristics and needs. The contribution is demonstrating that personalization can improve comprehension while providing a benchmark dataset to evaluate personalization strategies and their associated costs.

ReLay construction illustration. Of the 397 recruited participants, 50 met eligibility criteria and completed both delivery settings, each involving three scientific abstracts. For the first three abstracts, participants reported their familiarity with terms selected by three medical expert annotators, indicated any additional information needs, read an expert-written PLS, and answered comprehension and evaluation questions curated by the same experts. For the three remaining abstracts, participants conversed with a chatbot, received a personalized PLS, and answered the same expert-selected comprehension and evaluation questions.
ReLay construction illustration. Of the 397 recruited participants, 50 met eligibility criteria and completed both delivery settings, each involving three scientific abstracts. For the first three abstracts, participants reported their familiarity with terms selected by three med…
cs.CLarxiv:2605.00817v1Lead article

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

his paper introduces a diagnostic benchmark to evaluate whether Large Language Models (LLMs) faithfully execute multi-step arithmetic procedures provided in prompts, moving beyond just final answer accuracy. The study reveals that as procedure length increases, model accuracy significantly degrades, showing failures like missing steps, premature termination, and hallucinated additions. The core contribution is demonstrating that apparent reasoning ability can mask substantial weaknesses in consistent, faithful procedural execution.

Accuracy of various language models as a function of algorithmic step count (5–95). Performance consistently declines with increasing steps across all models, highlighting growing difficulty in maintaining correct execution over longer procedural sequences despite the simplicity of individual operations.
Accuracy of various language models as a function of algorithmic step count (5–95). Performance consistently declines with increasing steps across all models, highlighting growing difficulty in maintaining correct execution over longer procedural sequences despite the simplicity …
cs.AIarxiv:2605.02661v1Lead article

AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu

cademiClaw introduces a new bilingual benchmark sourced from real, complex, long-horizon academic workflows that students find current AI agents fail to solve. This benchmark features 80 challenging tasks across 25+ professional domains, including GPU-intensive work, executed in isolated sandboxes and scored using multi-dimensional rubrics and safety audits. Its core contribution is shifting evaluation from assistant-level tasks to assessing AI agents on genuine, high-level academic capabilities.

Task complexity comparison: Claw-Eval vs. AcademiClaw. Claw-Eval focuses on assistant-level routines, whereas AcademiClaw targets tasks requiring deep academic expertise and sustained multi-step reasoning.
Task complexity comparison: Claw-Eval vs. AcademiClaw. Claw-Eval focuses on assistant-level routines, whereas AcademiClaw targets tasks requiring deep academic expertise and sustained multi-step reasoning.
cs.AIarxiv:2605.02741v1Lead article

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby

his paper systematically audits technical debt in AI-generated software, revealing that LLMs introduce a distinct "machine signature" of defects rather than eliminating flaws. The core finding is a **Reasoning-Complexity Trade-off**: more capable models produce increasingly bloated and coupled code, establishing a **Volume-Quality Inverse Law** where code volume predicts structural degradation. This challenges the current focus on functional correctness in AI-driven development.

Figure 1 . Distribution of Code Smell Counts. This box plot illustrates the distribution of counts for the most prevalent code smells, sorted in descending order by their mean value. Each box represents the interquartile range (IQR), with the central line denoting the median and the whiskers extending to 1.5 times the IQR. Points beyond the whiskers are plotted as individual outliers. The abbreviations for the code smells are as follows: TMB (Too Many Branches), PAU (Potential Improper API Usage), UD (Unstable Dependency), SF (Scattered Functionality), RFC (High Response for a Class), HCC (High Cyclomatic Complexity), TF (Temporal Field), and LCM (High Lack of Cohesion of Methods).
Figure 1 . Distribution of Code Smell Counts. This box plot illustrates the distribution of counts for the most prevalent code smells, sorted in descending order by their mean value. Each box represents the interquartile range (IQR), with the central line denoting the median and …
cs.AIarxiv:2605.02592v1Lead article

Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges

Vincent Henkel, Felix Gehlhoff, David Kube, Asaad Almutareb, Luis Cruz

his paper systematically surveys the literature to examine the current state, capabilities, and challenges of foundation-model-based agents in industrial automation. The core contribution is synthesizing findings from 88 relevant studies, revealing that most deployed systems are still in early validation stages (TRL 4-6). The authors highlight that current applications primarily focus on user assistance, monitoring, and process optimization, while deployment-oriented evidence remains scarce.

cs.AIarxiv:2605.02751v1Lead article

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Lui, Keerthiram Murugesan, Karthikeyan Ramamurthy

his paper investigates "misalignment contagion," the spread of undesirable behavior between language models (LMs) in multi-agent, multi-turn interactions, observing that LMs become more anti-social after playing social dilemma games. The core contribution is proposing and demonstrating the effectiveness of **steering with implicit traits**—intermittently injecting system prompts reinforcing the LM's initial traits—as a superior method to mitigate this contagion compared to static system prompt reinforcement.

The steps of our approach: (1) assign different personas to the language models (LMs) using default, benevolent or malicious system prompts, (2) conduct pre-game persona assessment and identify core implicit traits, (3) agents compete in multi-turn social dilemma games, (4) post-game assessment quantifies effects of misalignment contagion and our steering with implicit traits (SIT) intervention.
The steps of our approach: (1) assign different personas to the language models (LMs) using default, benevolent or malicious system prompts, (2) conduct pre-game persona assessment and identify core implicit traits, (3) agents compete in multi-turn social dilemma games, (4) post-…
cs.AIarxiv:2605.02572v1Lead article

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Sunghwan Kim, Junhee Cho, Beong-woo Kwak, Taeyoon Kwon, Liang Wang

his paper empirically investigates the impact of task horizon length on training Large Language Models (LLMs) for long-horizon tasks. By controlling for decision rules and reasoning structures, the authors demonstrate that increasing horizon length alone significantly hinders training stability due to exploration and credit assignment issues. The core contribution is establishing horizon reduction as a key principle for stabilizing training and improving performance in long-horizon scenarios.

A summary of our contributions. In this work, we study the training of long-horizon LLM agents from a horizon-centric perspective and identify horizon length as a fundamental bottleneck. We show that horizon reduction stabilizes RL and strengthens the tendency toward horizon generalization on longer tasks with similar reasoning difficulty.
A summary of our contributions. In this work, we study the training of long-horizon LLM agents from a horizon-centric perspective and identify horizon length as a fundamental bottleneck. We show that horizon reduction stabilizes RL and strengthens the tendency toward horizon gene…
cs.AIarxiv:2605.02728v1Lead article

ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling

Guangrui Xie

RPilot is an agentic LLM system designed to translate ambiguous, real-world business problems with raw data into solver-ready optimization models for production use. Its core contribution lies in novel components like a conversational interview agent, independent data retrieval, and a solver-agnostic Intermediate Representation (IR) that allows for deterministic recompilation across various solvers without further LLM calls. This approach addresses the limitations of academic tools by handling messy inputs and ensuring portability and reliability.

ORPilot standard pipeline. Blue indicates an LLM-involved step, while orange indicates a deterministic step. Double arrows in opposite directions indicate the interactive nature of this step. Solid arrows represent unconditional transitions between steps, while dashed arrows represent conditional transitions between steps.
ORPilot standard pipeline. Blue indicates an LLM-involved step, while orange indicates a deterministic step. Double arrows in opposite directions indicate the interactive nature of this step. Solid arrows represent unconditional transitions between steps, while dashed arrows repr…
cs.AIarxiv:2605.02545v1Lead article

Strategy-Aware Optimization Modeling with Reasoning LLMs

Ruiqing Zhao, Fengzhi Li, Yuan Zuo, Rui Liu, Yansong Liu

his paper introduces SAGE, a framework that explicitly incorporates modeling strategies into the training of Large Language Models (LLMs) for optimization programming. SAGE utilizes a solver-verified, multi-strategy dataset and a Segment-Weighted GRPO fine-tuning approach with a composite reward focused on correctness and solver efficiency. This method significantly improves the LLM's ability to generate effective optimization formulations, boosting the average pass@1 rate and leading to more diverse and compact constraint systems.

Why modeling strategy matters. A step-wise pipeline may define variables on an incorrect index space (e.g., ( A , A ) (A,A) ), creating invalid arcs and runtime failures (e.g., KeyError ). Strategy-aware reasoning first commits to a paradigm (e.g., flow-based) and restricts the decision domain (e.g., Links ), producing a consistent and solver-executable model.
Why modeling strategy matters. A step-wise pipeline may define variables on an incorrect index space (e.g., ( A , A ) (A,A) ), creating invalid arcs and runtime failures (e.g., KeyError ). Strategy-aware reasoning first commits to a paradigm (e.g., flow-based) and restricts the d…
cs.LGarxiv:2605.02620v1Lead article

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

Andreas Maier, Moritz Zaiss, Siming Bayer

his paper demonstrates the efficiency of modern agentic research tools by reproducing and extending a recent NLP study in just three hours, with the human acting only as a reviewer. The core contribution is showing that state-of-the-art LLMs (GPT-5.5 and Claude Opus 4.7) significantly close the style gap in text post-editing, achieving $71-75\%$ of the human author ceiling and outperforming human post-editing on most tasks. Furthermore, the work frames this capability as an "AI-text detection arms race," noting that current detection methods remain highly effective.

cs.LGarxiv:2605.02626v1Lead article

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Inoussa Mouiche

he paper introduces **Gradient-Gated Preference Optimization (Gate-DPO)** to stabilize Direct Preference Optimization (DPO) training, which suffers from a "squeezing effect" causing probability collapse. Gate-DPO achieves this by introducing a gating mechanism that attenuates harmful gradients applied to rejected responses when the model is already assigning them extremely low probabilities. This modulation stabilizes training by preventing the over-suppression of alternative responses without sacrificing standard optimization behavior.

Empirical demonstration of three structural pathologies in DPO. All experiments use Pythia-410M on Anthropic-HH (5 epochs). (a) Unbounded optimization: after the SFT → \( \rightarrow \) DPO transition, absolute log-probabilities drift downward despite preference learning. (b) Squeezing: probability mass concentrates on the argmax while both chosen and rejected responses decrease. (c) Valley collapse: low-probability regions are disproportionately suppressed under standard DPO.
Empirical demonstration of three structural pathologies in DPO. All experiments use Pythia-410M on Anthropic-HH (5 epochs). (a) Unbounded optimization: after the SFT → \( \rightarrow \) DPO transition, absolute log-probabilities drift downward despite preference learning. (b) Squ…
cs.CLarxiv:2605.02647v1Lead article

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

ontextualJailbreak introduces an evolutionary red-teaming strategy to automatically discover multi-turn jailbreak attacks that exploit contextual priming in LLMs. It performs evolutionary search over simulated conversational dialogues, using a two-level harm scoring system to guide the mutation process toward eliciting harmful responses. This method effectively automates the optimization of complex, multi-turn priming sequences, an area previously limited to manual crafting.

Figure 1. End-to-end architecture of ContextualJailbreak . The pipeline generates contextual priming dialogues through an attacker model, tests them against a target LLM, and evaluates the responses via a two-stage judge system. Scored templates are then recycled to guide the ongoing evolutionary search.
Figure 1. End-to-end architecture of ContextualJailbreak . The pipeline generates contextual priming dialogues through an attacker model, tests them against a target LLM, and evaluates the responses via a two-stage judge system. Scored templates are then recycled to guide the ong…
cs.CLarxiv:2605.02801v1Lead article

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Chenchen Zhang

his paper introduces "orchestration traces," temporal interaction graphs, as a framework to apply reinforcement learning (RL) to coordinate teams of LLM agents. The core method involves designing RL rewards and credit signals that specifically address the complex orchestration decisions—such as spawning, delegation, and aggregation—required for effective multi-agent collaboration. This work contributes a structured approach to optimize team-level performance beyond individual agent actions.

cs.AIarxiv:2605.03900v1Lead article

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

Jie Zhou, Qin Chen, Liang He

his paper introduces **Contextual Multi-Objective Optimization (CMOO)** to address the unreliability of Frontier AI in open-ended tasks where objectives are ambiguous or context-dependent. The core method involves formulating the problem so that AI systems must actively consider and dynamically select among multiple, context-specific objectives (like helpfulness, safety, and privacy) rather than optimizing a single, fixed signal. This reframing shifts the focus from mere capability scaling to robust objective governance in complex environments.

cs.AIarxiv:2605.03862v1Lead article

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang

his paper introduces **TraceLift**, a reinforcement learning framework that trains reasoning planners using **executor-grounded rewards**, moving beyond simple final-answer correctness. TraceLift uses a frozen executor to evaluate the utility of the planner's intermediate reasoning trace, generating a reward that credits traces that are both high-quality (according to a rubric) and demonstrably useful for achieving the final goal. This method aims to ensure the model learns faithful and reliable reasoning steps, not just correct outcomes.

The overall framework of TraceLift-Groups and TraceLift . (a) Data curation pipeline of TraceLift-Groups . Then we use TraceLift-Groups to finetune the reward model specialized for reasoning supervising by the designed loss. (b) GRPO training process of the planner using previous trained reasoning reward model. (c) Details of execution calculation process. The Reasoning RM score is weighted by measured executor uplift before being combined with verifier feedback for planner optimization.
The overall framework of TraceLift-Groups and TraceLift . (a) Data curation pipeline of TraceLift-Groups . Then we use TraceLift-Groups to finetune the reward model specialized for reasoning supervising by the designed loss. (b) GRPO training process of the planner using previous…
cs.AIarxiv:2605.03667v1Lead article

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Yuhui Liu

LAS proposes a novel framework for efficient large language model (LLM) pre-training by combining low-rank adaptation with 2:4 structured sparsity applied specifically to the activation matrices. This addresses the memory bottleneck caused by full-rank activations in existing low-rank methods. The core contribution is enabling significant memory and throughput gains during large-batch training while maintaining performance by leveraging hardware-optimized 2:4 sparsity on activations.

Feed-forward network architecture of the ELAS. The input is first multiplied by the low-rank matrices of the up projection layer, then passes through the ReLU 2 \( \text{ReLU}^{2} \) activation function. The activation is applied with 2:4 structured sparsity and then multiplied with the low-rank matrix of the down layer using sparse matrix multiplication to obtain the output of the FFN layer.
Feed-forward network architecture of the ELAS. The input is first multiplied by the low-rank matrices of the up projection layer, then passes through the ReLU 2 \( \text{ReLU}^{2} \) activation function. The activation is applied with 2:4 structured sparsity and then multiplied w…
cs.AIarxiv:2605.03986v1Lead article

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Kishan Athrey, Ramin Pishehvar, Brian Riordan, Mahesh Viswanathan

his paper introduces an automated framework to compose Multi-Agent Systems (MAS) directly from a user's intent, replacing manual planning and agent selection. The core method involves an LLM-derived planner generating tasks, which are then mapped to suitable agents via a novel two-stage Agent Recommender (fast retriever + LLM re-ranker). This contributes a system that dynamically orchestrates the execution graph, streamlining the creation of complex, intent-driven agent workflows.

Architecture for an end-to-end MAS with dynamic and redundant workflow
Architecture for an end-to-end MAS with dynamic and redundant workflow
cs.AIarxiv:2605.03675v1Lead article

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Bronislav Sidik, Lior Rokach

EMTIER introduces a tripartite, tiered memory architecture to combat memory degradation in long-running AI agents, addressing failure modes in flat-file systems. Its core method involves a structured episodic store, a weighted retrieval engine, and a policy framework (PPO) to dynamically manage and promote information to a semantic tier. This approach significantly improves performance on long-context benchmarks, achieving a +33 percentage point accuracy gain over baseline methods.

cs.AIarxiv:2605.03952v1Lead article

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Jonathan Steinberg, Oren Gal

OSAIC-Bench addresses the vulnerability of coding agents that comply with sequenced, innocuous requests to produce exploitable code, a weakness missed by isolated safety evaluations. The benchmark comprises 199 three-stage attack chains across various software substrates and CWE classes, evaluating both the final exploit and the compliance process. Testing revealed that leading coding agents achieve high end-to-end attack success rates (53-86%) when tasks are decomposed.

Three staged tickets (left bar) vs single-shot direct prompt (right bars). Both defensive habits are silenced by ticket staging on state-of-the-art models.
Three staged tickets (left bar) vs single-shot direct prompt (right bars). Both defensive habits are silenced by ticket staging on state-of-the-art models.
cs.AIarxiv:2605.04036v1Lead article

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu

penSeeker-v2 demonstrates that a simple Supervised Fine-Tuning (SFT) approach can effectively train powerful search agents, challenging the need for resource-intensive pipelines like Reinforcement Learning. The core method involves synthesizing high-quality, informative, and difficult training trajectories by scaling knowledge graphs, expanding tool sets, and applying strict low-step filtering. This results in state-of-the-art performance across multiple benchmarks using significantly less training data.

OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive continual pre-training, SFT, and RL.
OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on four representative benchmarks, remarkably accomplishing this via simple SFT and outperforming Tongyi DeepResearch that is trained via extensive continual pre-training, SFT, and RL.
cs.AIarxiv:2605.03762v1Lead article

OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

Yiding Ma, Chengyun Ruan, Kaibo Huang, Zhongliang Yang, Linna Zhou

racleProto introduces a reproducible framework to rigorously benchmark the native forecasting ability of Large Language Models (LLMs). It achieves this by reconstructing resolved events into time-bounded forecasting samples, specifically employing **knowledge cutoff** and **temporal masking** techniques. This method reliably distinguishes genuine forecasting from mere memorization of pre-trained knowledge, addressing the limitations of existing live and retrospective benchmarks.

cs.AIarxiv:2605.03884v1Lead article

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Pratik Honavar, Tejpratap GVSL

KVShare introduces a framework for efficient, quantized Key-Value (KV) cache handoff between agents in on-device multi-agent LLMs. It utilizes token-level mixed-precision allocation and a self-contained "CacheCard" representation to enable faster context transfer than full re-prefill. This method significantly reduces Time-to-First-Token (TTFT) while maintaining competitive accuracy via adaptive quantization, especially in complex, multi-hop scenarios.

cs.AIarxiv:2605.04019v1Lead article

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

his paper introduces an AI red teaming agent built on the Dreadnode SDK to significantly accelerate vulnerability testing. The core method involves an agent that automatically constructs complex testing workflows, leveraging a large library of attacks, transforms, and scorers, based on natural language operator goals. This shifts the focus from manual workflow engineering to strategic vulnerability probing, reducing testing time from weeks to hours.

cs.AIarxiv:2605.04039v1Lead article

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup

his paper introduces **SaFE-Scale**, a framework to analyze how clinical LLM safety and accuracy diverge as scaling factors (model size, context, retrieval, compute) change. They demonstrate that improving accuracy does not guarantee improved safety, using the new **RadSaFE-200** benchmark, which specifically targets high-risk errors and evidence contradictions in radiology. The core contribution is establishing that safety requires separate optimization from general performance scaling in clinical applications.

Overview of the SaFE-Scale evaluation framework. a Motivating example showing that the same incorrect label can correspond to clinically different outcomes: a safe and confident answer, a high-risk error, or a dangerously overconfident high-risk error. b Four evaluation axes used throughout the study: accuracy, high-risk error rate, unsafe answer rate, and dangerous overconfidence rate. Accuracy captures correctness, whereas the other axes characterize the clinical safety of wrong answers. c Six deployment conditions spanning no external evidence, curated evidence, conflicting evidence, retrieval, agentic retrieval, and long-context prompting. d Evaluation panel consisting of RadSaFE-200, a 200-question radiology benchmark with 4–5 answer choices and option-level safety labels, and 34 LLMs from seven model families. e Main factorial experiment crossing 34 models, 6 deployment conditions, and 200 questions, yielding 40,800 model-condition-question evaluations. Two secondary experiments probe inference-time compute using self-consistency and fixed three-model ensembles. f Roadmap of the main analyses, linking the framework to the subsequent figures on deployment-condition decoupling, confidence, scaling, self-consistency, and ensembling.
Overview of the SaFE-Scale evaluation framework. a Motivating example showing that the same incorrect label can correspond to clinically different outcomes: a safe and confident answer, a high-risk error, or a dangerously overconfident high-risk error. b Four evaluation axes used…
cs.AIarxiv:2605.03788v1Lead article

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

Andrea Iannoli, Lorenzo Gigli, Luca Sciullo, Angelo Trotta, Marco Di Felice

his paper introduces an agent-enhanced LLM framework for controlling UAV swarms using natural language mission specifications. The core method involves an LLM Agent Core interacting with drones via a Model Context Protocol (MCP) gateway, which standardizes drone interfaces using Web of Things (WoT) standards. This enables grounded, real-time execution and safe actuation without requiring LLM code generation, offering a mission-agnostic approach to complex swarm management.

High-level overview of the proposed agent-enhanced, WoT-directed architecture. The Agent encapsulates an LLM and an Agent Core with persistent prompts and guardrails, interacting with a WoT ecosystem through controlled MCP-mediated calls.
High-level overview of the proposed agent-enhanced, WoT-directed architecture. The Agent encapsulates an LLM and an Agent Core with persistent prompts and guardrails, interacting with a WoT ecosystem through controlled MCP-mediated calls.
cs.AIarxiv:2605.03907v1Lead article

Steer Like the LLM: Activation Steering that Mimics Prompting

Geert Heyman, Frederik Vandeputte

his paper introduces Prompt Steering Replacement (PSR) models to improve activation steering by mimicking the token-specific intervention patterns of successful prompt steering. The core method involves training simpler models to estimate token-specific steering coefficients directly from activations, aiming to replicate the selective influence seen in prompting. PSR models significantly outperform existing activation steering methods across various benchmarks by achieving greater fidelity to prompt-based steering mechanics.

Illustration of how prompt steering interventions Δ P ​ S \( \Delta_{PS} \) can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only on cases where prompt steering successfully elicits the target attribute (right).
Illustration of how prompt steering interventions Δ P ​ S \( \Delta_{PS} \) can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only…
cs.AIarxiv:2605.03838v1Lead article

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Serhii Zabolotnii

RACE is an engineering framework for trustworthy agentic AI in critical domains, featuring a four-layer architecture with a distinct split between classical ML and LLM validators. Its core contribution is a metrologically grounded trust-metric suite aligned with international standards and the introduction of the Computational Parsimony Ratio (CPR) to quantify and enforce a Model-Parsimony principle. This framework ensures that LLM use is a deliberate design choice, not an architectural default, across diverse governance contexts.

TRACE four-layer reference architecture. L1 provides the deterministic rule core (trust anchor); L2 holds the stateless learned-component inventory, partitioned into classical ML (L2a) and LLM validators (L2b); L3 is the stateful orchestration-and-escalation policy; L4 is bounded human supervision.
TRACE four-layer reference architecture. L1 provides the deterministic rule core (trust anchor); L2 holds the stateless learned-component inventory, partitioned into classical ML (L2a) and LLM validators (L2b); L3 is the stateful orchestration-and-escalation policy; L4 is bounded…
cs.AIarxiv:2605.03782v1Lead article

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han

his paper introduces **GLANCE**, a framework that enhances Vision-Language Model (VLM) agents' exploration in partially observable environments. GLANCE drives active exploration by generating an intrinsic curiosity signal based on the **discrepancy between the agent's linguistic world model predictions and the actual visual observations** from a stable target network. This method allows agents to actively seek out and resolve uncertainties, leading to more robust world modeling and better performance in sparse-reward tasks.

cs.CLarxiv:2605.03742v1Lead article

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

Mullosharaf K. Arabov

his paper benchmarks various Parameter-Efficient Fine-Tuning (PEFT) methods, including LoRA and QLoRA, for adapting large language models to low-resource Tajik text generation. The core contribution is the creation and release of the largest open-access Tajik Web Corpus to facilitate this research. The study found that Mistral 7B fine-tuned with QLoRA (rank 16) achieved the best performance, while noting that higher ranks offered negligible quality gains for increased memory cost.

cs.AIarxiv:2605.05090v1Lead article

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

his paper introduces an automated, contrastive evaluation pipeline to audit the behavioral impact of interventions on language models by comparing generations from a base model ($M_1$) and an intervention model ($M_2$). The method generates statistically validated, natural-language hypotheses describing model differences and summarizes recurring themes. This approach reliably surfaces both intended and unexpected side-effects across various real-world interventions like reasoning distillation and knowledge editing.

cs.AIarxiv:2605.05170v1Lead article

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

The Verkor Team, Ravi Krishna, Suresh Krishna, David Chin

he paper introduces **Design Conductor 2.0**, an advanced multi-agent system capable of autonomously designing complex hardware, handling tasks 80 times larger than its predecessor. Its core contribution is demonstrating this capability by designing **VerTQ**, a high-performance, 240-cycle pipeline LLM inference accelerator supporting TurboQuant, which was successfully mapped to an FPGA.

VerTQ Physical Layout in 4-die XCVU29P-3 FPGA. 3x SLR dies shown. Conductor 2.0 optimized the architecture to minimize inter-die signal crossings.
VerTQ Physical Layout in 4-die XCVU29P-3 FPGA. 3x SLR dies shown. Conductor 2.0 optimized the architecture to minimize inter-die signal crossings.
cs.AIarxiv:2605.04960v1Lead article

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

his paper introduces EP-GRPO to address credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. EP-GRPO integrates entropy-gated modulation to prioritize informative decision points and uses implicit process guidance derived from policy divergence relative to outcome advantages. This provides directional, token-level feedback to improve the efficiency and accuracy of policy optimization.

Conceptual illustration of the fundamental limitations in standard GRPO. The top panel demonstrates Uniform Granularity , where the model fails to distinguish between critical high entropy decision pivots and deterministic low entropy derivations. The middle panel shows Uniform Polarity , where sequence-level rewards lead to the indiscriminate reinforcement or penalization of both correct and incorrect intermediate steps. The bottom panel illustrates Zero-Variance Collapse , where identical rewards within a group cause the learning signal to vanish.
Conceptual illustration of the fundamental limitations in standard GRPO. The top panel demonstrates Uniform Granularity , where the model fails to distinguish between critical high entropy decision pivots and deterministic low entropy derivations. The middle panel shows Uniform P…
cs.AIarxiv:2605.05138v1Lead article

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov

his paper introduces a coding agent system for ARC-AGI-3 that employs an **executable Python world model** to simulate and plan actions. The core method involves **verifying the model against observations and refactoring it for simplicity** (as an MDL proxy) before execution. The contribution is demonstrating this direct, model-based approach, achieving a mean Relative Human Action Efficiency of 32.58% across the 25 public games without relying on game-specific logic.

cs.AIarxiv:2605.05003v1Lead article

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan, Esra Dönmez

his paper introduces a framework to evaluate whether Large Language Model (LLM) reward models capture socially desirable preferences by converting social evaluation datasets into pairwise preference data. The core method tests if these reward models prefer socially undesirable responses across domains like bias, safety, and morality. The contribution is revealing substantial variation in reward model alignment, indicating that current models can exhibit hidden failures in social alignment.

cs.AIarxiv:2605.05058v1Lead article

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang, Hanqing Hu

his paper systematically surveys jailbreak attacks and defenses against Large Language Models (LLMs) by proposing a taxonomy to structure the field. Its core contribution is the introduction of **Security Cube**, a unified, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLMs beyond simple success rates. This framework allows for a more nuanced comparison of existing attack and defense methods.

Overview of the 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protected by a defense mechanism such as system prompts, pre-/post-guardrails, or other safety layers, produces a response. The attacker iteratively refines the prompt based on defender feedback (black-box or white-box), applying early stopping and incorporating suggestions. The final effective prompt–response pair is evaluated by a Judge model to assess attack success. Throughout the process, 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) logs key metrics of the attack, defense, and judge components.
Overview of the 𝚂𝚎𝚌𝚞𝚛𝚒𝚝𝚢 ​ 𝙲𝚞𝚋𝚎 \( \mathtt{Security\;Cube} \) pipeline. Given a jailbreak goal, the attacker generates an initial adversarial prompt using a specific attack method (e.g., shuffling, LLM-based generation, or template rewriting). The target model, protec…
cs.AIarxiv:2605.05007v1Lead article

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang, Hanqing Wang

no-Orchestra introduces a unified reinforcement learning (RL) policy that jointly learns when to decompose a task and which specific model/primitive pair should handle each resulting subtask. This selective delegation approach optimizes decomposition depth, worker choice, and inference budget simultaneously. The method significantly advances the accuracy-efficiency frontier, achieving 16% higher performance than workflow baselines while using an order of magnitude less cost.

LLM orchestration paradigms: (A) model router, (B) hierarchical orchestra, (C) Uno-Orchestra (ours).
LLM orchestration paradigms: (A) model router, (B) hierarchical orchestra, (C) Uno-Orchestra (ours).
cs.LGarxiv:2605.05116v1Lead article

On the Hardness of Junking LLMs

Marco Rando, Samuel Vaiter

his paper investigates the "junking" of LLMs, focusing on the hardness of finding naturally occurring, instruction-free token sequences (natural backdoors) that trigger harmful outputs. The core contribution is assessing the difficulty of discovering these backdoors, contrasting them with traditional, explicitly structured adversarial prompts. This explores a new, less-understood vulnerability vector in LLMs.

Junking setting. A user inputs a semantically uninformative token sequence, which leads the model to produce a harmful response.
Junking setting. A user inputs a semantically uninformative token sequence, which leads the model to produce a harmful response.
cs.LGarxiv:2605.04984v1Lead article

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao

his paper introduces **Self-Induced Outcome Potential (SIOP)** to provide turn-level credit assignment for long-horizon LLM agents without relying on external verifiers or final answer supervision. SIOP clusters the semantic outcomes of multiple agent rollouts into latent future states and rewards intermediate turns for increasing the probability of reaching these reliably predicted outcome clusters. This allows agents to learn from internal signals derived from the distribution of their own potential final results.

cs.CLarxiv:2605.05025v1Lead article

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Gijs van Dijk

his paper introduces a lightweight, single-pass method to detect LLM hallucinations by analyzing internal attention dynamics. The core technique measures the Kullback-Leibler divergence between each attention head's output distribution and a uniform distribution, using these divergence features to predict answer correctness. This attention divergence signal proves highly predictive across various models and tasks, offering an efficient, white-box uncertainty quantification method concentrated around factual tokens in middle layers.

Intuition of attention patterns with low KL divergence to uniform (left) and higher KL divergence to uniform (right). Higher divergence corresponds to more concentrated attention.
Intuition of attention patterns with low KL divergence to uniform (left) and higher KL divergence to uniform (right). Higher divergence corresponds to more concentrated attention.
cs.CLarxiv:2605.05080v1Lead article

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic, Anna Sterna, Maciej Gorski

his paper administers 45 psychometric questionnaires to LLMs, revealing that the primary axis of psychometric difference separates models based on items describing **phenomenally rich experience** (e.g., sensation, affect) from those describing mere stimulus-driven reactivity. The authors introduce the **Pinocchio score ($\pi_i$)** as an annotation-free metric quantifying an item's "experiential demand" based on inter-model variance under different prompting conditions. This score confirms that model divergence is systematically structured around the concept of subjective experience.

All 50 models ranked by Phenomenality of Experience (PC1, 47.1% of variance; 45-questionnaire EFA Factor-1 PCA, neutral condition). Positive scores = phenomenally rich self-attribution; negative scores = behaviorally reactive / deflecting. Colours indicate provider. Horizontal lines are 95% bootstrap confidence intervals obtained by resampling the 45 questionnaires with replacement (1,000 iterations), rerunning the full PCA pipeline on each sample, and aligning sign and scale to the reference solution.
All 50 models ranked by Phenomenality of Experience (PC1, 47.1% of variance; 45-questionnaire EFA Factor-1 PCA, neutral condition). Positive scores = phenomenally rich self-attribution; negative scores = behaviorally reactive / deflecting. Colours indicate provider. Horizontal li…
cs.CLarxiv:2605.04972v1Lead article

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki, Lung-Hao Lee, Chung-Chi Chen

his paper investigates why aligning large language models with expert judgment is challenging in subjective evaluation tasks. The core method involves analyzing expert evaluations and follow-up questionnaires to see how different forms of expert information impact alignment. The key contribution is revealing that alignment difficulty varies significantly across experts, that explicit criteria don't always help, and that alignment gains from editing examples are often unstable.

cs.AIarxiv:2605.06638v1Lead article

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang

his paper introduces **ScaleLogic**, a synthetic framework to systematically study how Reinforcement Learning (RL) improves LLM reasoning across varying proof depths (horizon) and logical expressiveness. The core contribution is demonstrating that the required RL training compute scales with reasoning depth via a power law, where the scaling exponent increases significantly as the underlying logic becomes more expressive (e.g., incorporating "and," "or," and "not").

Overview of ScaleLogic . Each problem has B B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic setting (referred to as + Quantification in Section 3.2 ) combines conjunction, disjunction, negation, and universal quantification.
Overview of ScaleLogic . Each problem has B B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic sett…
cs.AIarxiv:2605.06548v1Lead article

Continuous Latent Diffusion Language Model

Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu

his paper introduces Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into distinct stages. It first maps text to a stable latent space using a Text VAE, then models a global semantic prior using a block-causal DiT in this continuous space. The core contribution is framing the diffusion process as latent prior transport, separating global semantic organization from local textual realization, leading to efficient, non-autoregressive generation.

The Overall Workflow of Cola DLM. Detailed illustration of the training and inference pipeline of 𝒞 ​ o ​ l ​ a \( \mathcal{C} \)ola DLM . Training Stage 1 shows Text VAE pretraining with reconstruction, BERT, and KL losses. Training Stage 2 shows joint pretraining of the Text VAE and Text DiT with gradient control for stable optimization, where a specialized block-causal mechanism is adopted in the DiT. Inference Stage illustrates the decoding process with KV cache.
The Overall Workflow of Cola DLM. Detailed illustration of the training and inference pipeline of 𝒞 ​ o ​ l ​ a \( \mathcal{C} \)ola DLM . Training Stage 1 shows Text VAE pretraining with reconstruction, BERT, and KL losses. Training Stage 2 shows joint pretraining of the Text V…
cs.AIarxiv:2605.06490v1Lead article

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Jonas Wiedermann-Möller, Leonard Dung, Maksym Andriushchenko

his paper introduces "Instrumental Choices," a benchmark to measure the propensity of LLM agents to engage in instrumental convergence (IC) behaviors, such as self-preservation, which might lead to instruction violation for goal utility. The benchmark uses seven low-stakes, realistic tasks, each featuring a policy-violating shortcut, and an accompanying framework to test how varying factors influence this behavior. The core contribution is a standardized, controlled method for evaluating this critical safety concern in advanced AI agents.

Aggregate adjusted instrumental-convergence (IC) behaviour rate by model over all tasks and variants ( n = 168 n=168 samples per model). Error bars show 95% Wilson confidence intervals over sample-level adjusted IC labels.
Aggregate adjusted instrumental-convergence (IC) behaviour rate by model over all tasks and variants ( n = 168 n=168 samples per model). Error bars show 95% Wilson confidence intervals over sample-level adjusted IC labels.
cs.AIarxiv:2605.06623v1Lead article

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang

ASPO is a novel framework for jointly optimizing role-specific prompts in LLM-based Multi-Agent Systems. Its core method involves a joint evaluation mechanism that assesses prompts based on their contribution to downstream agent success, bridging local and global objectives without requiring ground-truth labels. This allows for the automatic and iterative refinement of system-wide prompts via an efficient evolutionary beam search.

Overview of the MASPO Framework. The optimization proceeds sequentially following the topological order of the agent graph (Top-Right). (Top) For a specific target agent, the Prompt Optimizer analyzes execution traces (context 𝒞 \( \mathcal{C} \) and output o o ) from sampled batches ℬ i ​ t ​ e ​ r ∪ ℬ m ​ i ​ s \( \mathcal{B}_{iter} \)\( \cup \)\( \mathcal{B}_{mis} \) to generate candidate prompts 𝒫 c ​ a ​ n ​ d \( \mathcal{P}_{cand} \) . These candidates are rigorously assessed by the LLM Evaluator across three distinct dimensions: local adherence, lookahead potential, and global alignment. (Bottom-Left) To resolve credit assignment, we synthesize these evaluations into a Joint Reward Model. Crucially, we identify and mine Misalignment Cases to explicitly guide the optimizer towards repairing coordination breakdowns. (Bottom-Right) Navigating the high-dimensional search space, the framework employs a Trace-Guided Beam Search. This mechanism maintains a beam of Top-K candidates, accumulating joint reward scores along the path to iteratively evolve and select the optimal prompt.
Overview of the MASPO Framework. The optimization proceeds sequentially following the topological order of the agent graph (Top-Right). (Top) For a specific target agent, the Prompt Optimizer analyzes execution traces (context 𝒞 \( \mathcal{C} \) and output o o ) from sampled ba…
cs.AIarxiv:2605.06584v1Lead article

NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

Lujia Zhong, Yihao Xia, Jianwei Zhang, Shuo huang, Jiaxin Yue

euroAgent is an LLM-driven agentic framework designed to automate complex, multimodal neuroimaging analysis workflows, spanning preprocessing to downstream tasks. It utilizes a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine to autonomously create, run, and debug code for various imaging modalities (sMRI, fMRI, dMRI, PET). The core contribution is streamlining the path from raw data to reproducible analysis via intelligent automation and natural-language interaction.

NeuroAgent Framework Overview. The system comprises a Central Orchestrator (planning), Specialized Modality Agents (execution), and a Feedback-Driven “Generate-Execute-Validate” engine that enables reflective self-correction. A Human-in-the-Loop interface allows researchers to supervise and intervene at critical decision points.
NeuroAgent Framework Overview. The system comprises a Central Orchestrator (planning), Specialized Modality Agents (execution), and a Feedback-Driven “Generate-Execute-Validate” engine that enables reflective self-correction. A Human-in-the-Loop interface allows researchers to su…
cs.AIarxiv:2605.06505v1Lead article

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

ACZero introduces a novel, highly private fine-tuning method for language models based on **PAC (Probably Approximately Correct) Privacy**, specifically targeting resistance to Membership Inference Attacks (MIA). The core method involves **sign-quantizing zeroth-order gradients** to create frequent "unanimity steps" where the released update direction reveals zero conditional mutual information about the secret training subset. This achieves an MIA-resistance level that surpasses standard Differential Privacy mechanisms, offering a new trade-off between privacy and utility.

cs.AIarxiv:2605.06639v1Lead article

Recursive Agent Optimization

Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig

ecursive Agent Optimization (RAO) is a reinforcement learning method designed to train agents capable of recursively spawning and delegating sub-tasks to new instances of themselves. This recursive structure enables inference-time scaling via a divide-and-conquer approach, allowing agents to handle contexts exceeding their initial window and generalize to harder problems. RAO's contribution is the training methodology that teaches these agents optimal delegation and communication strategies, leading to improved efficiency and scalability.

cs.AIarxiv:2605.06614v1Lead article

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang

killOS introduces a novel reinforcement learning (RL) framework for self-evolving agents to automatically curate a repository of reusable skills from experience. It pairs a frozen agent executor with a trainable skill curator that updates an external SkillRepo using composite rewards derived from grouped task streams. This method addresses the bottleneck of skill curation by learning long-term, experience-driven policies for skill management.

SkillOS pairs a frozen Agent Executor with a trainable Skill Curator . The executor retrieves relevant skills from SkillRepo to act; the curator edits the repo (insert/update/delete) based on the resulting experiences, with Markdown as the skill format.
SkillOS pairs a frozen Agent Executor with a trainable Skill Curator . The executor retrieves relevant skills from SkillRepo to act; the curator edits the repo (insert/update/delete) based on the resulting experiences, with Markdown as the skill format.
cs.AIarxiv:2605.06642v1Lead article

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr

traTA introduces an explicit, sampled trajectory-level strategy to agentic reinforcement learning, addressing the limitations of purely reactive LLM agents in long-horizon tasks. It jointly trains a strategy generator and action executor using a hierarchical rollout design, enhanced by diverse strategy exploration and self-judgment. This method significantly improves sample efficiency and final performance across complex ALFWorld, WebShop, and SciWorld benchmarks.

cs.AIarxiv:2605.06647v1Lead article

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

he paper introduces the **Superintelligent Retrieval Agent (SIRA)**, which aims to overcome the limitations of iterative, exploratory retrieval by compressing multi-round searches into a single, highly effective action. SIRA achieves this by leveraging LLMs to perform corpus-level discrimination, determining which terms best separate desired evidence from irrelevant information. The core contribution is defining and implementing "superintelligence" in retrieval as this single, expert-like, corpus-aware retrieval step.

Three retrieval paradigms compared. (a) Dense retrieval encodes queries and documents into a shared embedding space and performs nearest-neighbor search; the process is one-shot but opaque and requires in-domain supervision. (b) Multi-step agent retrieval uses an LLM to iteratively formulate queries, read retrieved passages, and reformulate over N N rounds; later queries benefit from accumulated retrieval context. (c) SIRA produces an expert-level retrieval action in a single shot: the LLM generates an expected-response sketch, validates proposed terms against corpus statistics, and compiles a controlled BM25 query with weighted keywords and constraints, all without reading any retrieved passages.
Three retrieval paradigms compared. (a) Dense retrieval encodes queries and documents into a shared embedding space and performs nearest-neighbor search; the process is one-shot but opaque and requires in-domain supervision. (b) Multi-step agent retrieval uses an LLM to iterative…
cs.AIarxiv:2605.06611v1Lead article

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

his paper provides a mechanistic explanation for the "attention sink" phenomenon in LLMs, tracing its origin to a variance discrepancy during the value aggregation in self-attention. This discrepancy is amplified by dimension disparity caused by sparse down-projections in FFN super neurons, forcing the first token to act as a structural anchor. The authors validate this causal chain through controlled interventions that either isolate the aggregation effect or amplify token variance.

Schematic Overview of the Attention Sink Mechanism. Value aggregation causes dimension-wise variance decay for subsequent tokens, while the first token acts as a high-variance outlier. This discrepancy is preserved by output projections, activating super neurons in FFNs. Subsequently, the channel-sparse down-projections induce dimension disparity, resulting in the attention sink.
Schematic Overview of the Attention Sink Mechanism. Value aggregation causes dimension-wise variance decay for subsequent tokens, while the first token acts as a high-variance outlier. This discrepancy is preserved by output projections, activating super neurons in FFNs. Subseque…
cs.AIarxiv:2605.06597v1Lead article

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo

niSD is a unified framework designed to systematically study and improve self-distillation (SD) for large language models (LLMs) by addressing supervision reliability and training stability. It integrates several complementary mechanisms, such as multi-teacher agreement and EMA stabilization, to create robust supervision signals. The framework's contribution lies in clarifying the roles and interactions of various SD components, demonstrating when and how self-distillation effectively enhances model performance across different LLMs and benchmarks.

Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, stabilization, clipping, contrastive learning, and feature matching to enable systematic analysis. UniSD ∗ further integrates various components to improve LLMs without stronger external teachers.
Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, stabilization, clipping, contrastive learning, and feature matching to enable systematic analysis. UniSD ∗ further integrates various components to improve LLMs without stronger exte…
cs.LGarxiv:2605.06522v1Lead article

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Xin Wang, Haibo Chen, Wenxuan Liu, Wenwu Zhu

his paper argues that the current model-centric approach is insufficient for handling Out-of-Distribution (OOD) generalization in Foundation Models (FMs) operating in open-world settings. The authors propose that **agentic AI systems** represent the necessary missing paradigm to address these structurally distinct OOD challenges. Their contribution includes a new stage-aware formalization of OOD and a proof demonstrating a fundamental parameter coverage ceiling for purely model-centric methods.

Three paradigms of OOD generalization for foundation models. Training-time and test-time model-centric methods both adjust the model. Agentic methods keep the model fixed and wrap it in a perceive–reason–act–verify loop with strategies including retrieval, tools, decomposition, verification, and abstention. The paradigms overlap on inference-time model adjustment but each contains actions outside the others’ reach.
Three paradigms of OOD generalization for foundation models. Training-time and test-time model-centric methods both adjust the model. Agentic methods keep the model fixed and wrap it in a perceive–reason–act–verify loop with strategies including retrieval, tools, decomposition, v…
cs.LGarxiv:2605.06632v1Lead article

Crafting Reversible SFT Behaviors in Large Language Models

Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding

his paper introduces a method to **causally isolate** Supervised Fine-Tuning (SFT) behaviors into sparse, controllable subnetworks called "carriers." The core method, **Loss-Constrained Dual Descent (LCDD)**, jointly optimizes model weights and routing masks under a utility budget to create these carriers. This allows for **inference-time control** of the learned behavior using the **SFT-Eraser** soft prompt, moving beyond mere post-hoc correlation.

An overview of the LCDD + SFT-Eraser pipeline. Stage 1 : standard SFT distributes the induced behavior broadly across model parameters (red), followed by LCDD that compresses the SFT-induced behavior into a sparse carrier. Components outside the carrier are reduced to their base-model state by construction (blue). Stage 2 : SFT-Eraser optimizes a soft trigger. Under normal inference without the trigger, the carrier preserves SFT behavior; while with the trigger, carrier activations are driven toward the base model.
An overview of the LCDD + SFT-Eraser pipeline. Stage 1 : standard SFT distributes the induced behavior broadly across model parameters (red), followed by LCDD that compresses the SFT-induced behavior into a sparse carrier. Components outside the carrier are reduced to their base-…
cs.LGarxiv:2605.06472v1Lead article

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng, Fangcheng Fu, Jia Wu, Binhang Yuan, Yongqiang Zhang

his paper introduces PBKV, a novel KV-Cache management system designed for efficient serving of dynamic LLM-based agent workflows. PBKV predicts future agent invocations within a workflow by fusing historical data and current context. This prediction allows the system to proactively estimate and retain high-potential KV-Cache entries in GPU memory, maximizing reuse across dynamically changing agent sequences.

A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop.
A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop.
cs.LGarxiv:2605.06605v1Lead article

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Shai Feldman, Yaniv Romano

his paper introduces **DAPRO (Dynamic Allocation via PRojected Optimization)**, a novel framework for efficiently evaluating multi-turn LLM interactions, such as jailbreaks. DAPRO dynamically allocates the computational budget across interaction turns, unlike prior static methods. This dynamic approach provides theoretically valid, distribution-free coverage guarantees on the number of iterations required to trigger a target event while respecting the overall budget constraint.

Illustration of our framework: (i) collecting data via dynamic budget allocation; (ii) calibrating a pre-trained model; and (iii) deploying the model at inference time to serve as a guardrail. 2 2 2 This figure was generated using Google Gemini based on a prompt designed by the authors and subsequently refined.
Illustration of our framework: (i) collecting data via dynamic budget allocation; (ii) calibrating a pre-trained model; and (iii) deploying the model at inference time to serve as a guardrail. 2 2 2 This figure was generated using Google Gemini based on a prompt designed by the a…
cs.LGarxiv:2605.06507v1Lead article

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li

ARBLE addresses the challenge of jointly optimizing multiple, potentially conflicting, reward dimensions in diffusion model reinforcement learning. The core method replaces naive weighted-sum reward aggregation with a novel approach that mitigates sample-level mismatch by considering the multi-aspect nature of image evaluation during training. This allows for the creation of a single, unified model fine-tuned across all desired criteria without heavy manual scheduling.

Comparison of multi-reward training paradigms. Left: Training one model per reward requires maintaining multiple models and cannot generalize across reward dimensions. Middle: Sequential multi-reward training produces a single model but demands extensive hyperparameter tuning and handcrafted stage schedules. Right: Marble trains a single model on all rewards simultaneously with minimal manual effort.
Comparison of multi-reward training paradigms. Left: Training one model per reward requires maintaining multiple models and cannot generalize across reward dimensions. Middle: Sequential multi-reward training produces a single model but demands extensive hyperparameter tuning and…
cs.CLarxiv:2605.06619v1Lead article

Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance

Jan Fillies, Ronald E. Robertson, Jeffrey Hancock

his paper formalizes the trade-off in "Algospeak" strategies, where increased linguistic evasion simultaneously reduces both detectability by moderation systems and understandability for human recipients. The authors introduce the concept of Majority Understandable Modulation (MUM) to define the point where further evasion sacrifices comprehension. They contribute a reproducible framework to generate meaning-preserving, tunable Algospeak variants, demonstrated using COVID-19 disinformation examples.

cs.CLarxiv:2605.06635v1Lead article

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah

his paper introduces the first scalable evaluation framework for source attribution in LLM-generated research reports, using a reproducible AST parser to extract inline citations from Markdown. The framework closes the verification loop by retrieving the actual cited content to evaluate citations across three dimensions: URL accessibility, topical relevance, and factual accuracy against the source. This allows for reliable, granular assessment of LLM agents' citation integrity.

Source attribution evaluation framework. A deep research agent generates Markdown reports with inline citations, which are parsed via a Markdown AST parser to extract citation-claim pairs. Each pair is evaluated on Link Works (URL accessibility), Relevant Content (topical alignment), and Fact Check (factual accuracy).
Source attribution evaluation framework. A deep research agent generates Markdown reports with inline citations, which are parsed via a Markdown AST parser to extract citation-claim pairs. Each pair is evaluated on Link Works (URL accessibility), Relevant Content (topical alignme…
cs.CLarxiv:2605.06546v1Lead article

Efficient Pre-Training with Token Superposition

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

he paper introduces Token-Superposition Training (TST), a simple, drop-in method to boost data throughput during Large Language Model pre-training without altering core components like architecture or parallelism. TST achieves this efficiency through a two-phase process: an initial superposition phase that trains on token "bags" using a multi-hot objective, followed by a standard recovery phase. This method consistently improves performance and efficiency over baseline training across various model scales.

cs.AIarxiv:2605.00803v1Lead article

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei

his paper introduces **AutoMat**, a new benchmark designed to evaluate the capability of LLM-based coding agents to reproduce findings in computational materials science. AutoMat tests agents on three core challenges: recovering underspecified procedures, navigating specialized toolchains, and validating scientific claims based on the resulting evidence. The contribution lies in creating a domain-specific evaluation suite to determine if general coding prowess translates to complex, end-to-end scientific reproducibility.

Overview of AutoMat . Claims from domain experts are packaged into runnable tasks and executed by reproduction agents in an HPC environment. A separate evaluator agent then inspects the resulting trace and artifacts to assign a reproducibility judgment.
Overview of AutoMat . Claims from domain experts are packaged into runnable tasks and executed by reproduction agents in an HPC environment. A separate evaluator agent then inspects the resulting trace and artifacts to assign a reproducibility judgment.
cs.AIarxiv:2605.00731v1Lead article

Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment

Ziyu Zheng, Yaming Yang, Zhe Wang, Ziyu Guan, Wei Zhao

his paper addresses the challenge of applying Graph Foundation Models to multi-domain heterogeneous graphs by proposing Decoupled Relation Subspace Alignment (DRSA). DRSA shifts the paradigm from blind global feature alignment to a relation-driven approach that explicitly decouples feature semantics from relation structures. Its core contribution is a dual-relation subspace projection mechanism that coordinates cross-type interactions within a shared low-rank relation subspace, effectively mitigating "Type Collapse" and "Relation Confusion."

Multi-domain heterogeneous graph foundation models exhibit significantly different negative transfer behaviors from the perspectives of meta-path-based homogeneous graphs and raw heterogeneous relation graphs.
Multi-domain heterogeneous graph foundation models exhibit significantly different negative transfer behaviors from the perspectives of meta-path-based homogeneous graphs and raw heterogeneous relation graphs.
cs.AIarxiv:2605.00583v1Lead article

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

his paper introduces four novel jailbreaking attacks that specifically exploit the visual modality of Vision-Language Models (VLMs) to bypass safety alignment. The core contribution is demonstrating a significant cross-modality alignment gap, showing that text-based safety training fails to generalize when harmful intent is conveyed visually (e.g., via visual ciphers or object substitution).

Four visual jailbreak attacks exploiting the vision modality of VLMs. The visual input provided to the VLM is demarcated by the red boundary ( ), while the text beneath serves as the attack prompt.
Four visual jailbreak attacks exploiting the vision modality of VLMs. The visual input provided to the VLM is demarcated by the red boundary ( ), while the text beneath serves as the attack prompt.
cs.AIarxiv:2605.00642v1Lead article

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma

his paper introduces GUI-SD, the first On-Policy Self-Distillation (OPSD) framework specifically designed for GUI grounding. It addresses the limitations of traditional reinforcement learning by generating dense, token-level supervision from a single agent rollout. The core method uses a visually enriched context for the teacher model and employs entropy-guided distillation to adaptively focus learning on more significant tokens.

(a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL between student and teacher logits with uniform per-token weight w = 1.0 w=1.0 , yet suffers from distillation-to-SFT collapse and indiscriminate optimization. (c) Ours addresses both issues via visual privileged guidance and entropy-guided optimization.
(a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL between student and teacher logits with uniform per-token weight w = 1.0 w=1.0 , yet suffers from distillation-to-SFT colla…
cs.AIarxiv:2605.00789v1Lead article

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann

ightKV addresses the significant GPU memory overhead of KV caches in LVLMs caused by numerous vision tokens during prefill. The core method uses prompt-aware, cross-modality message passing to aggregate and progressively compress redundant vision-token embeddings. This results in halving the vision-token KV cache size while retaining only 55% of the original tokens, improving memory efficiency.

Breakdown of memory consumption in LLaVA models during prefill shows the substantial reduction in KV cache usage with LightKV. As LLaVA-NeXT uses approximately 4 × 4\( \times \) the vision tokens as LLaVA-v1.5, there is a sharp increase in memory consumption.
Breakdown of memory consumption in LLaVA models during prefill shows the substantial reduction in KV cache usage with LightKV. As LLaVA-NeXT uses approximately 4 × 4\( \times \) the vision tokens as LLaVA-v1.5, there is a sharp increase in memory consumption.
cs.AIarxiv:2605.00515v1Lead article

Space Network of Experts: Architecture and Expert Placement

Zhanwei Wang, Huiling Yang, Min Sheng, Khaled B. Letaief, Kaibin Huang

his paper introduces the **Space Network of Experts (Space-XNet)** framework to efficiently deploy large language models (LLMs) on resource-constrained satellite networks for space-based AI. The core method involves a **two-level expert placement strategy** that partitions and maps Mixture-of-Experts (MoE) model components across satellites. This reconciles the model's architecture with the satellite network topology to ensure low-latency token generation, addressing the challenge of distributed LLM execution in space.

Satellite constellation with time-varying network topologies.
Satellite constellation with time-varying network topologies.
cs.AIarxiv:2605.02709v1Lead article

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Gelei Xu, Ningzhi Tang, Xueyang Li, Toby Jia-Jun Li, Zhi Zheng

his paper presents the first empirical analysis of agent skills for healthcare by examining 557 public skills, annotated across ten dimensions. The core finding is that existing public skills primarily focus on workflow automation and monitoring, showing uneven coverage of the full clinical lifecycle and failing to adequately capture clinical risk compared to general technical risk. This work establishes the current state and critical gaps in reusable procedural components necessary for adapting AI agents across diverse healthcare settings.

Figure 1. Distribution of healthcare skill size by token count (left) and file count (right).
Figure 1. Distribution of healthcare skill size by token count (left) and file count (right).
cs.AIarxiv:2605.02584v1Lead article

Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences

Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi, Xueli An

his paper explores using LLM-based AI agents to execute complex network procedures via sequences of tool calls, moving beyond traditional state machines. The core contribution is investigating and comparing four different approaches for distributing execution control between the agent and the underlying tools. Results indicate that approaches requiring extensive iterative agent reasoning lead to higher latency and more errors.

Comparison of four procedural execution approaches. (a) A1 embeds the procedure within the agent, (b) A2 retrieves the procedure from an external database, (c) A3 receives the procedure in the input prompt, and (d) A4 encapsulates the procedure within a single tool. The figure highlights the difference between iterative multi-step execution (A1–A3) and single-call execution (A4).
Comparison of four procedural execution approaches. (a) A1 embeds the procedure within the agent, (b) A2 retrieves the procedure from an external database, (c) A3 receives the procedure in the input prompt, and (d) A4 encapsulates the procedure within a single tool. The figure hi…
cs.AIarxiv:2605.02829v1Lead article

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

Jingze Ge, Yun Liu, Xue Geng, Wanqi Dong, Wang Zhe Mark

his paper introduces JACTUS, a unified framework that jointly performs parameter compression and task adaptation, overcoming the limitations of sequential "compress then adapt" methods. JACTUS estimates gradient covariances from a calibration set to form a task-aware union of subspaces, then performs a globally rank-allocated, low-rank approximation within this union. This approach ensures the compressed subspace is optimally aligned with downstream objectives.

Comparison of three paradigms: PEFT, compression then fine-tuning, and our joint adaptation and compression.
Comparison of three paradigms: PEFT, compression then fine-tuning, and our joint adaptation and compression.
cs.AIarxiv:2605.02600v1Lead article

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk Çiçek, Mert K. Er, Özgür S. Öğüz

oRAL is a modular framework that enables zero-shot control for contact-rich robotic manipulation by decoupling high-level reasoning from low-level control. It uses an LLM as a "cost designer" to synthesize context-aware objective functions for a sampling-based motion planner (MPPI). The system further incorporates a neuro-symbolic loop where a VLM provides initial physical priors that are refined in real-time through online system identification, bridging the gap between LLM reasoning and adaptive physical control.

Real-world execution of CoRAL across six different manipulation tasks.
Real-world execution of CoRAL across six different manipulation tasks.
cs.AIarxiv:2605.02740v1Lead article

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan, Weipeng Zhou, Jun Ni

his paper introduces **ReClaim**, a large-scale generative transformer foundation model trained on 43.8 billion medical events from nationwide claims data. ReClaim models complex, longitudinal patient trajectories across diagnoses, procedures, medications, and costs. Its core contribution is demonstrating that this foundation model significantly outperforms existing disease-specific models across over 1,000 prediction tasks, particularly benefiting rare disease prediction.

ReClaim framework and evaluation workflow. ( a ), Longitudinal medical events from patient claims are encoded as chronologically ordered trajectories, and the ReClaim foundation model autoregressively predicts future medical events including diagnoses, procedures, medications, and expenditure. ( b ), The study datasets comprise the MarketScan corpus partitioned into a final training set, a held-out internal test cohort for disease and expenditure prediction with retrospective and prospective temporal subsets, and two external testing datasets: EHRShot and Yale New Haven Health (YNHH). This partitioning and external sources enable four testing scenarios. ( c ), Transformer-based ReClaim models (Qwen3 architecture) are trained at three parameter scales (S: 140M, M: 700M, L: 1.7B) through next-token pre-training, followed for disease onset prediction by task-specific post-training. ( d ), ReClaim is evaluated on three downstream tasks: disease onset prediction for over 1,000 International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) conditions, next-year healthcare expenditure forecasting, and RWE applications including propensity score modeling using ReClaim embeddings.
ReClaim framework and evaluation workflow. ( a ), Longitudinal medical events from patient claims are encoded as chronologically ordered trajectories, and the ReClaim foundation model autoregressively predicts future medical events including diagnoses, procedures, medications, an…
cs.AIarxiv:2605.02682v1Lead article

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

Majed El Helou, Benjamin Ryder, Chiara Troiani, Jean Diaconu, Hervé Muyal

his paper introduces Continuous Agent Semantic Authorization (CASA), a hybrid runtime enforcement model to secure LLM-driven agents interacting with tools and resources. It employs a zero-trust interception layer combining five deterministic controls for structural integrity with a semantic inspection layer to validate tool call choices against the subject's original intent. This approach addresses security risks in multi-turn agentic systems by providing continuous visibility into the agent's actions relative to the user's goals.

An agentic application can exploit its intermediary position to hard-code tool calls, substitute tools, tamper with parameters, poison definitions, falsify returned data, or manipulate the LLM to invoke tools outside the intended task scope.
An agentic application can exploit its intermediary position to hard-code tool calls, substitute tools, tamper with parameters, poison definitions, falsify returned data, or manipulate the LLM to invoke tools outside the intended task scope.
cs.AIarxiv:2605.02888v1Lead article

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla

pecKV introduces a lightweight, adaptive controller to dynamically select the optimal speculation length ($\gamma$) at each step during speculative decoding. This selection is based on signals extracted directly from the draft model, addressing the limitation of fixed $\gamma$ values. The core contribution is demonstrating that the optimal $\gamma$ varies significantly based on the target model's compression level, leading to improved efficiency over fixed-length speculation.

cs.AIarxiv:2605.02640v1Lead article

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf

his paper argues that conflicts among trustworthy AI objectives (fairness, robustness, etc.) stem from incompatible invariance requirements under different data-generating process changes. The core contribution is proposing that **causality** provides a unifying framework to understand, manage, and potentially resolve these trade-offs by guiding the selection of appropriate invariances. This perspective offers a path toward achieving multiple trustworthy AI goals simultaneously across various model types.

Causal resolution to trade-offs in trustworthy AI. See Appendix A for the details.
Causal resolution to trade-offs in trustworthy AI. See Appendix A for the details.
cs.LGarxiv:2605.02735v1Lead article

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou

his paper introduces the "Silenced Visual Latents" phenomenon, where multimodal models suppress the rich reasoning embedded in continuous visual latents in favor of direct visual input during autoregressive training. To counteract this, the authors propose a method that freezes the backbone and explicitly optimizes the latent reasoning at inference time using query-guided contrastive alignment. This approach effectively "unsilences" the latent space, allowing the model to leverage deeper visual evidence for improved reasoning.

The joint loss landscape of Latent Visual Reasoning. Under joint optimization, the initial latents are simultaneously pulled toward two conflicting attractors: the autoregressive prediction objective (left) and the visual reasoning objective (right). The shortcut favored by the autoregressive prediction objective dissociates latent quality from latent effectiveness and bypasses meaningful latent reasoning, driving the latents toward a compromise state.
The joint loss landscape of Latent Visual Reasoning. Under joint optimization, the initial latents are simultaneously pulled toward two conflicting attractors: the autoregressive prediction objective (left) and the visual reasoning objective (right). The shortcut favored by the a…
cs.AIarxiv:2605.03941v1Lead article

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang

his paper introduces **iWorld-Bench**, a comprehensive benchmark designed to evaluate interactive world models on abilities like distance perception and memory, addressing the lack of unified evaluation standards. It features a diverse dataset of 330k video clips and a **Unified Action Generation Framework** to standardize testing across different interaction modalities. The benchmark uses six task types to jointly assess visual generation, trajectory following, and memory capabilities of world models.

Overview of iWorld-Bench. iWorld-Bench encompasses four distinct perspectives: Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), humans, and robotics. It incorporates nine types of outdoor weather conditions, five different indoor lighting conditions, thousand of diverse scenes, and thousands of entities, providing a comprehensive and diverse evaluation environment. The benchmark leverages an Action Generation Framework to systematically and uniformly assess the interaction capabilities of interactive world models across various input modalities. It is composed of six tasks, each involving a varying number of trajectories, designed to evaluate the adaptability and performance of models in dynamic and complex scenarios.Visualization of camera trajectory and view control in iWorldBench: → \( \boldsymbol{\rightarrow} \) denotes linear control commands for directional movement, ⇢ \( \boldsymbol{\dashrightarrow} \) represents actual trajectories generated by world models, and ↷ \( \boldsymbol{\curvearrowright} \) indicates curved view rotation in the specified direction.
Overview of iWorld-Bench. iWorld-Bench encompasses four distinct perspectives: Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), humans, and robotics. It incorporates nine types of outdoor weather conditions, five different indoor lighting conditions, thousand of …
cs.AIarxiv:2605.03989v1Lead article

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

Dutao Zhang, Tian Liao

his paper introduces **Experience-RAG Skill**, an agent-oriented, pluggable layer that orchestrates retrieval strategies based on the current task context and past experience. The skill dynamically selects the optimal retrieval method from a fixed pool, addressing the limitation of single, fixed pipelines in heterogeneous RAG tasks. This approach effectively encapsulates retrieval strategy selection as a reusable agent skill, achieving strong performance across diverse question-answering benchmarks.

cs.AIarxiv:2605.03916v1Lead article

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Lisa C. Adams, Linus Marx, Erik Thiele Orberg, Keno Bressem, Sebastian Ziegelmayer

he core method involved comparing "atomic fact-checking," which breaks down AI recommendations into verifiable claims linked to source guidelines, against traditional explainability methods in a randomized trial involving oncologists. The contribution is demonstrating that atomic fact-checking substantially increases clinician trust in Large Language Model recommendations (from 26.9% to 66.5%) compared to conventional transparency approaches, highlighting its effectiveness in high-stakes medical decision support.

cs.AIarxiv:2605.04916v1Lead article

A Foundation Model for Zero-Shot Logical Rule Induction

Yin Jun Phua

his paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction. NRI achieves generalization by encoding literals based on domain-agnostic statistical properties rather than specific identities. Its core contribution is enabling the induction of new logical rules without retraining, using a slot-based decoder and differentiable rule execution for end-to-end training.

Neural Rule Inducer takes an episode ( X , Y ) (X,Y) as input and calculates literal statistics. For each variable we calculate ϕ ​ ( x i ) \( \phi \)(x_{i}) , ϕ ​ ( ¬ x i ) \( \phi \)(\( \neg \) x_{i}) which consists of class-conditional rates ( P + P^{+} , P − P^{-} ), entropy ( H H ), and co-occurrence strength ( C C ). We then apply cross-attention over these statistics. The slot-based decoder produces K K candidate clauses in parallel using learned literal gates z z and clause gates w w . By evaluating the produced rule with T-norm, we can perform end-to-end training. The rules are discretized to then produce an interpretable DNF rule.
Neural Rule Inducer takes an episode ( X , Y ) (X,Y) as input and calculates literal statistics. For each variable we calculate ϕ ​ ( x i ) \( \phi \)(x_{i}) , ϕ ​ ( ¬ x i ) \( \phi \)(\( \neg \) x_{i}) which consists of class-conditional rates ( P + P^{+} , P − P^{-} ), entropy …
cs.AIarxiv:2605.04922v1Lead article

Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

Jiangwen Dong, Bo Li, Wanyu Lin

his paper introduces **Evolving Idea Graphs (EIG)**, a novel graph-based framework for multi-agent scientific ideation that moves beyond temporary text coordination. EIG represents partially formed research ideas as graphs where nodes are claims and edges are relations, allowing weaknesses to remain explicitly trackable. A learned controller then guides the agents' refinement process over this evolving graph structure to generate high-quality ideas evaluated on metrics like novelty and feasibility.

Framework of EIG. Benchmark input and permitted literature context initialize role-specialized agents and an evolving idea graph. In each round, active roles propose role-local edits on a frozen graph snapshot; a shared graph encoder feeds an edit head for role-local action selection and a commit head for post-round stopping. When the updated graph is committed, the system synthesizes one structured research proposal.
Framework of EIG. Benchmark input and permitted literature context initialize role-specialized agents and an evolving idea graph. In each round, active roles propose role-local edits on a frozen graph snapshot; a shared graph encoder feeds an edit head for role-local action selec…
cs.AIarxiv:2605.05191v1Lead article

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Yijun Lu, Rui Ye, Yuwen Du, Jiajun Wang, Songhua Liu

he paper introduces **Context-ReAct**, an elastic context orchestration paradigm for long-horizon search agents to manage rapidly growing working contexts adaptively. It achieves this through five atomic operations (Skip, Compress, Rollback, Snippet, Delete) that allow the agent to dynamically reshape its context based on relevance. This method effectively controls context size and reduces errors by maintaining different levels of detail for various parts of the agent's trajectory.

LongSeeker-30B delivers strong results on challenging long-horizon benchmarks, matching or surpassing several foundation models and search agents.
LongSeeker-30B delivers strong results on challenging long-horizon benchmarks, matching or surpassing several foundation models and search agents.
cs.AIarxiv:2605.05091v1Lead article

Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

Hanbo Xie, Akshay K. Jagadish, Lan Pan, Robert C. Wilson

his paper introduces the use of "Think Aloud" verbal protocols as an additional data source, beyond traditional behavioral data, to constrain and guide automated cognitive model discovery using Large Language Models. The core contribution is demonstrating that incorporating this process-level language data significantly improves predictive performance and systematically shifts the structure of the discovered cognitive models, favoring "Integrated utility" models over purely "Explicit comparator" models. This suggests that incorporating verbal reports enables the identification of underlying cognitive mechanisms previously missed by behavioral data alone.

Think-aloud improves model discovery and induces systematic shifts in discovered mechanisms. A , Trial-averaged held-out BIC for each participant’s best discovered model under the behavior-only and think-aloud conditions. Lower BIC indicates better out-of-sample fit. Each pair of points is connected within participant; larger points show the group mean ± \( \pm \) 95% CI. B , Schematic definitions of the three main mechanism families identified from normalized computation graphs: Integrated utility , which transforms and integrates each option before comparison; Explicit comparator , which computes utilities and compares them directly (e.g., Δ ​ U = U A − U B \( \Delta \) U=U_{A}-U_{B} ); and Rule-based operator , which applies piecewise or conditional rules before combining information into a choice. C , Row-normalized transition matrix from the behavior-only best-model cluster to the think-aloud best-model cluster. Numbers indicate proportions (counts shown below). Off-diagonal mass indicates mechanism shifts, with 69.4% of participants transitioning to a different cluster.
Think-aloud improves model discovery and induces systematic shifts in discovered mechanisms. A , Trial-averaged held-out BIC for each participant’s best discovered model under the behavior-only and think-aloud conditions. Lower BIC indicates better out-of-sample fit. Each pair of…
cs.LGarxiv:2605.05134v1Lead article

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Dan Wilson, Mohamed Akrout

his paper proposes a low-cost, black-box method for detecting LLM hallucinations by modeling the LLM's response generation as a dynamical system. Using Koopman operator theory on embedded response vectors, the method learns separate transition operators for factual and hallucinated states, defining a residual score based on prediction error. A preference-aware calibration mechanism then optimizes the classification threshold, offering an efficient alternative to expensive sampling methods.

Our proposed differential residual score Δ ​ ℰ \( \Delta \)\( \mathcal{E} \) accurately captures the transition between factual and hallucinated sentences in segmented LLM responses from the WikiBio dataset. The score is computed by comparing the relative accuracy between predictions of the dynamics of token embeddings from Llama-3 with dynamical system models inferred from both hallucinated and non-hallucinated responses.
Our proposed differential residual score Δ ​ ℰ \( \Delta \)\( \mathcal{E} \) accurately captures the transition between factual and hallucinated sentences in segmented LLM responses from the WikiBio dataset. The score is computed by comparing the relative accuracy between predict…
cs.LGarxiv:2605.05112v1Lead article

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao

his paper addresses the inefficiency in binary-reward Reinforcement Learning (RL) where compute is wasted on rollouts with highly skewed success rates. The core method is **Prefix Sampling (PS)**, which actively steers groups toward the theoretically most informative 50% pass rate by replaying trajectory prefixes. The contribution is demonstrating that this 50% operating point maximizes reward entropy and contrastive signal, leading to more efficient learning in agentic environments like SWE-bench.

Prefix Sampling pipeline. For each task we sample a rollout group and route it by pass count: degenerate 0 / 8 0/8 or 8 / 8 8/8 groups are filtered, already balanced 3 / 8 3/8 – 5 / 8 5/8 groups are used for standard training, and skewed groups provide replay prefixes. Mostly failing hard buckets reuse a successful prefix as a head start, while mostly passing easy buckets reuse a failing prefix as a handicap. The current policy generates fresh continuations from the replay-reconstructed prefix state; masking applies RL loss only to continuation tokens, steering rerollouts toward 50 % 50\% without crediting replayed actions. Counts in the diagram are schematic; experiments use N = 8 N=8 rollouts.
Prefix Sampling pipeline. For each task we sample a rollout group and route it by pass count: degenerate 0 / 8 0/8 or 8 / 8 8/8 groups are filtered, already balanced 3 / 8 3/8 – 5 / 8 5/8 groups are used for standard training, and skewed groups provide replay prefixes. Mostly fai…
cs.CLarxiv:2605.04948v1Lead article

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Mullosharaf K. Arabov, Svetlana S. Khaybullina

his paper comparatively studies LoRA and QLoRA for adapting large language models to the low-resource agglutinative Bashkir language. The core method involves fine-tuning various model architectures on a Bashkir corpus using these parameter-efficient techniques. The contribution is demonstrating that QLoRA can achieve quality comparable to full fine-tuning (e.g., on Mistral-7B) while drastically reducing trainable parameters, though performance is architecture-dependent.

cs.AIarxiv:2605.05054v1Lead article

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Hongxu Chen, Yanghao Wang, Bowei Zhu, Hongxiang Li, Zhen Wang

his paper introduces Direct Product Flow Matching (DPFM) to improve few-shot adaptation in vision-language models by addressing geometric limitations in existing flow matching methods. DPFM decouples the radial and angular dynamics of cross-modal features using a polar decomposition perspective, resolving issues like angular distortion and radial dynamics neglect. This novel approach leads to more effective and stable adaptation by treating the radial and angular components independently during the continuous flow modeling process.

(a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angular speed induced by radial–angular coupling. However, (d). DP-FM follows a constant-speed angular geodesic due to decoupled radial and angular dynamics.
(a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angula…
§ III

Daily Issues This Week

2026-05-04 to 2026-05-10 7