From the arXiv
Thursday, 7 May 2026 · 20 papers
Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
This paper introduces an automated, contrastive evaluation pipeline to audit the behavioral impact of interventions on language models by comparing generations from a base model ($M_1$) and an intervention model ($M_2$). The method generates statistically validated, natural-language hypotheses describing model differen…
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours
The paper introduces **Design Conductor 2.0**, an advanced multi-agent system capable of autonomously designing complex hardware, handling tasks 80 times larger than its predecessor. Its core contribution is demonstrating this capability by designing **VerTQ**, a high-performance, 240-cycle pipeline LLM inference accel…
EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
This paper introduces EP-GRPO to address credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. EP-GRPO integrates entropy-gated modulation to prioritize informative decision points and uses implicit process guidance derived from policy divergence relative to outcome advantages. This…
Executable World Models for ARC-AGI-3 in the Era of Coding Agents
This paper introduces a coding agent system for ARC-AGI-3 that employs an **executable Python world model** to simulate and plan actions. The core method involves **verifying the model against observations and refactoring it for simplicity** (as an MDL proxy) before execution. The contribution is demonstrating this dir…
Misaligned by Reward: Socially Undesirable Preferences in LLMs
This paper introduces a framework to evaluate whether Large Language Model (LLM) reward models capture socially desirable preferences by converting social evaluation datasets into pairwise preference data. The core method tests if these reward models prefer socially undesirable responses across domains like bias, safet…
SoK: Robustness in Large Language Models against Jailbreak Attacks
This paper systematically surveys jailbreak attacks and defenses against Large Language Models (LLMs) by proposing a taxonomy to structure the field. Its core contribution is the introduction of **Security Cube**, a unified, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLM…
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
Uno-Orchestra introduces a unified reinforcement learning (RL) policy that jointly learns when to decompose a task and which specific model/primitive pair should handle each resulting subtask. This selective delegation approach optimizes decomposition depth, worker choice, and inference budget simultaneously. The metho…
On the Hardness of Junking LLMs
This paper investigates the "junking" of LLMs, focusing on the hardness of finding naturally occurring, instruction-free token sequences (natural backdoors) that trigger harmful outputs. The core contribution is assessing the difficulty of discovering these backdoors, contrasting them with traditional, explicitly struc…
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
This paper introduces **Self-Induced Outcome Potential (SIOP)** to provide turn-level credit assignment for long-horizon LLM agents without relying on external verifiers or final answer supervision. SIOP clusters the semantic outcomes of multiple agent rollouts into latent future states and rewards intermediate turns f…
Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
This paper introduces a lightweight, single-pass method to detect LLM hallucinations by analyzing internal attention dynamics. The core technique measures the Kullback-Leibler divergence between each attention head's output distribution and a uniform distribution, using these divergence features to predict answer corre…
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
This paper administers 45 psychometric questionnaires to LLMs, revealing that the primary axis of psychometric difference separates models based on items describing **phenomenally rich experience** (e.g., sensation, affect) from those describing mere stimulus-driven reactivity. The authors introduce the **Pinocchio sco…
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
This paper investigates why aligning large language models with expert judgment is challenging in subjective evaluation tasks. The core method involves analyzing expert evaluations and follow-up questionnaires to see how different forms of expert information impact alignment. The key contribution is revealing that alig…
A Foundation Model for Zero-Shot Logical Rule Induction
This paper introduces the Neural Rule Inducer (NRI), a foundation model for zero-shot logical rule induction. NRI achieves generalization by encoding literals based on domain-agnostic statistical properties rather than specific identities. Its core contribution is enabling the induction of new logical rules without ret…
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
This paper introduces **Evolving Idea Graphs (EIG)**, a novel graph-based framework for multi-agent scientific ideation that moves beyond temporary text coordination. EIG represents partially formed research ideas as graphs where nodes are claims and edges are relations, allowing weaknesses to remain explicitly trackab…
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
The paper introduces **Context-ReAct**, an elastic context orchestration paradigm for long-horizon search agents to manage rapidly growing working contexts adaptively. It achieves this through five atomic operations (Skip, Compress, Rollback, Snippet, Delete) that allow the agent to dynamically reshape its context base…
Think-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior
This paper introduces the use of "Think Aloud" verbal protocols as an additional data source, beyond traditional behavioral data, to constrain and guide automated cognitive model discovery using Large Language Models. The core contribution is demonstrating that incorporating this process-level language data significant…
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
This paper proposes a low-cost, black-box method for detecting LLM hallucinations by modeling the LLM's response generation as a dynamical system. Using Koopman operator theory on embedded response vectors, the method learns separate transition operators for factual and hallucinated states, defining a residual score ba…
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
This paper addresses the inefficiency in binary-reward Reinforcement Learning (RL) where compute is wasted on rollouts with highly skewed success rates. The core method is **Prefix Sampling (PS)**, which actively steers groups toward the theoretically most informative 50% pass rate by replaying trajectory prefixes. The…
Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir
This paper comparatively studies LoRA and QLoRA for adapting large language models to the low-resource agglutinative Bashkir language. The core method involves fine-tuning various model architectures on a Bashkir corpus using these parameter-efficient techniques. The contribution is demonstrating that QLoRA can achieve…
Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation
This paper introduces Direct Product Flow Matching (DPFM) to improve few-shot adaptation in vision-language models by addressing geometric limitations in existing flow matching methods. DPFM decouples the radial and angular dynamics of cross-modal features using a polar decomposition perspective, resolving issues like …