№01
cs.AI arxiv:2605.05090v1

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau et al.

This paper introduces an automated, contrastive evaluation pipeline to audit the behavioral impact of interventions on language models by comparing generations from a base model ($M_1$) and an intervention model ($M_2$). The method generates statistically validated, natural-language hypotheses describing model differen…

9
№02
cs.AI arxiv:2605.05170v1

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

The Verkor Team, Ravi Krishna, Suresh Krishna et al.

The paper introduces **Design Conductor 2.0**, an advanced multi-agent system capable of autonomously designing complex hardware, handling tasks 80 times larger than its predecessor. Its core contribution is demonstrating this capability by designing **VerTQ**, a high-performance, 240-cycle pipeline LLM inference accel…

9
№03
cs.AI arxiv:2605.04960v1

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu, Li Li, Wenwen Zhao et al.

This paper introduces EP-GRPO to address credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. EP-GRPO integrates entropy-gated modulation to prioritize informative decision points and uses implicit process guidance derived from policy divergence relative to outcome advantages. This…

9
№04
cs.AI arxiv:2605.05138v1

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov

This paper introduces a coding agent system for ARC-AGI-3 that employs an **executable Python world model** to simulate and plan actions. The core method involves **verifying the model against observations and refactoring it for simplicity** (as an MDL proxy) before execution. The contribution is demonstrating this dir…

9
№05
cs.AI arxiv:2605.05003v1

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan, Esra Dönmez

This paper introduces a framework to evaluate whether Large Language Model (LLM) reward models capture socially desirable preferences by converting social evaluation datasets into pairwise preference data. The core method tests if these reward models prefer socially undesirable responses across domains like bias, safet…

9
№06
cs.AI arxiv:2605.05058v1

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He et al.

This paper systematically surveys jailbreak attacks and defenses against Large Language Models (LLMs) by proposing a taxonomy to structure the field. Its core contribution is the introduction of **Security Cube**, a unified, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLM…

9
№07
cs.AI arxiv:2605.05007v1

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Zhiqing Cui, Haotong Xie, Jiahao Yuan et al.

Uno-Orchestra introduces a unified reinforcement learning (RL) policy that jointly learns when to decompose a task and which specific model/primitive pair should handle each resulting subtask. This selective delegation approach optimizes decomposition depth, worker choice, and inference budget simultaneously. The metho…

9
№08
cs.LG arxiv:2605.05116v1

On the Hardness of Junking LLMs

Marco Rando, Samuel Vaiter

This paper investigates the "junking" of LLMs, focusing on the hardness of finding naturally occurring, instruction-free token sequences (natural backdoors) that trigger harmful outputs. The core contribution is assessing the difficulty of discovering these backdoors, contrasting them with traditional, explicitly struc…

9
№09
cs.LG arxiv:2605.04984v1

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Senkang Hu, Yong Dai, Xudong Han et al.

This paper introduces **Self-Induced Outcome Potential (SIOP)** to provide turn-level credit assignment for long-horizon LLM agents without relying on external verifiers or final answer supervision. SIOP clusters the semantic outcomes of multiple agent rollouts into latent future states and rewards intermediate turns f…

9
№10
cs.CL arxiv:2605.05025v1

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Gijs van Dijk

This paper introduces a lightweight, single-pass method to detect LLM hallucinations by analyzing internal attention dynamics. The core technique measures the Kullback-Leibler divergence between each attention head's output distribution and a uniform distribution, using these divergence features to predict answer corre…

9
№11
cs.CL arxiv:2605.05080v1

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic et al.

This paper administers 45 psychometric questionnaires to LLMs, revealing that the primary axis of psychometric difference separates models based on items describing **phenomenally rich experience** (e.g., sensation, affect) from those describing mere stimulus-driven reactivity. The authors introduce the **Pinocchio sco…

9
№12
cs.CL arxiv:2605.04972v1

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki et al.

This paper investigates why aligning large language models with expert judgment is challenging in subjective evaluation tasks. The core method involves analyzing expert evaluations and follow-up questionnaires to see how different forms of expert information impact alignment. The key contribution is revealing that alig…

9
№13
cs.AI arxiv:2605.04916v1

A Foundation Model for Zero-Shot Logical Rule Induction

Yin Jun Phua

This paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction. NRI achieves generalization by encoding literals based on domain-agnostic statistical properties rather than specific identities. Its core contribution is enabling the induction of new logical rules without ret…

8
№14
cs.AI arxiv:2605.04922v1

Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

Jiangwen Dong, Bo Li, Wanyu Lin

This paper introduces **Evolving Idea Graphs (EIG)**, a novel graph-based framework for multi-agent scientific ideation that moves beyond temporary text coordination. EIG represents partially formed research ideas as graphs where nodes are claims and edges are relations, allowing weaknesses to remain explicitly trackab…

8
№15
cs.AI arxiv:2605.05191v1

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Yijun Lu, Rui Ye, Yuwen Du et al.

The paper introduces **Context-ReAct**, an elastic context orchestration paradigm for long-horizon search agents to manage rapidly growing working contexts adaptively. It achieves this through five atomic operations (Skip, Compress, Rollback, Snippet, Delete) that allow the agent to dynamically reshape its context base…

8
№16
cs.AI arxiv:2605.05091v1

Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

Hanbo Xie, Akshay K. Jagadish, Lan Pan et al.

This paper introduces the use of "Think Aloud" verbal protocols as an additional data source, beyond traditional behavioral data, to constrain and guide automated cognitive model discovery using Large Language Models. The core contribution is demonstrating that incorporating this process-level language data significant…

8
№17
cs.LG arxiv:2605.05134v1

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Dan Wilson, Mohamed Akrout

This paper proposes a low-cost, black-box method for detecting LLM hallucinations by modeling the LLM's response generation as a dynamical system. Using Koopman operator theory on embedded response vectors, the method learns separate transition operators for factual and hallucinated states, defining a residual score ba…

8
№18
cs.LG arxiv:2605.05112v1

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo et al.

This paper addresses the inefficiency in binary-reward Reinforcement Learning (RL) where compute is wasted on rollouts with highly skewed success rates. The core method is **Prefix Sampling (PS)**, which actively steers groups toward the theoretically most informative 50% pass rate by replaying trajectory prefixes. The…

8
№19
cs.CL arxiv:2605.04948v1

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Mullosharaf K. Arabov, Svetlana S. Khaybullina

This paper comparatively studies LoRA and QLoRA for adapting large language models to the low-resource agglutinative Bashkir language. The core method involves fine-tuning various model architectures on a Bashkir corpus using these parameter-efficient techniques. The contribution is demonstrating that QLoRA can achieve…

8
№20
cs.AI arxiv:2605.05054v1

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Hongxu Chen, Yanghao Wang, Bowei Zhu et al.

This paper introduces Direct Product Flow Matching (DPFM) to improve few-shot adaptation in vision-language models by addressing geometric limitations in existing flow matching methods. DPFM decouples the radial and angular dynamics of cross-modal features using a polar decomposition perspective, resolving issues like …

7