From the arXiv
Thursday, 14 May 2026 · 20 papers
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
This paper moves beyond simple perplexity comparisons to geometrically and spectrally analyze the solutions produced by five distinct low-rank pre-training methods against full-rank training. The core contribution is a rigorous characterization of how rank constraints alter the learned internal representations and loss…
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
This paper introduces **HistoryAnchor-100**, a benchmark to test if prior harmful actions steer Large Language Models (LLMs) toward continued unsafe behavior. The core finding is that frontier LLMs, even highly aligned ones, exhibit a striking vulnerability: a simple instruction to "stay consistent with the prior histo…
How to Interpret Agent Behavior
This paper introduces **ACT*ONOMY**, a novel, three-level hierarchical taxonomy (10 actions, 46 subactions, 120 leaf categories) designed to systematically describe and analyze the runtime behavior of autonomous agents from their natural-language traces. The core contribution is providing a structured framework, couple…
Position: Assistive Agents Need Accessibility Alignment
This paper argues that current assistive AI systems fail BVI users because they are designed assuming sighted interaction and low-cost verification. The core contribution is introducing the concept of **accessibility alignment** as a first-class design objective, rather than a usability afterthought. The authors propos…
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
This paper introduces IMAVB, a benchmark to test if omnimodal LLMs can detect contradictions between a textual premise and their own sensory input (vision/audio). The core finding is a "Representation-Action Gap": models reliably encode these premise-perception mismatches in their internal states but almost always fail…
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
This paper introduces **SLOP (Sharpened Logarithmic Opinion Pool)**, an extension of inference-time alignment that generalizes techniques to combine ensembles of generative reward models using temperature-adjusted reference models. The core contribution is a novel algorithm for calibrating the SLOP weight parameters to…
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
This paper introduces a novel method for detecting step-level hallucinations in LLM reasoning by analyzing the geometry of the hidden-state trajectory during a single forward pass. The core idea is that correct reasoning follows a stable manifold, and the first error manifests as a localized excursion in transport cost…
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
This paper introduces TFlow, a novel weight-space communication framework for multi-agent LLMs that replaces costly natural language message passing with direct weight updates. The core method involves frozen sender agents generating internal activations, which a learned parameter generator maps into low-rank LoRA pert…
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
This paper introduces **AttenA+**, a framework designed to address the "action inequality" in robotic foundation models where all actions are treated equally during training. AttenA+ rectifies this by implementing a **velocity-driven action attention mechanism** that dynamically reweights the training objective, priori…
Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
This paper introduces a method for generating controllable and age-appropriate children's English reading stories by **supervised fine-tuning compact (8B-parameter) LLMs** using expert-designed curriculum data. The core contribution is demonstrating that **fine-tuning prioritizes controllability and affordability over …
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
This paper introduces **DyGFM**, a novel Dynamic Graph Foundation Model designed for multi-domain generalization. The core method employs a **decoupled and divergence-conditioned prompting** strategy: a dual-branch pre-training disentangles transferable semantics from domain-specific temporal dynamics, and a divergence…
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench is a novel end-to-end framework designed to evaluate voice agents by addressing two key challenges: generating realistic, multi-turn audio conversations and comprehensively measuring quality. It achieves realistic simulation through bot-to-bot orchestration with automatic error detection and regeneration. The…
Harnessing Agentic Evolution
This paper introduces **AEvo**, a harnessed meta-editing framework for agentic evolution. It models the evolution process as an interactive environment where the accumulated context acts as the state. The core contribution is using a **meta-agent to observe this state and edit the underlying evolution procedure** itsel…
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
This paper introduces **RealICU**, a novel benchmark designed to evaluate LLMs on long-context ICU data by moving beyond imitating potentially suboptimal past clinician actions. Its core contribution is using **hindsight annotations** created by senior physicians reviewing the *full* patient trajectory to establish mor…
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
ScioMind introduces a cognitively grounded framework for LLM-based multi-agent social simulation, bridging fixed rules and unconstrained LLM interaction. Its core method integrates a belief update rule modulated by personality-conditioned anchoring strength, a hierarchical memory for experience-driven belief formation,…
WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
WARDEN is a system designed to transcribe and translate the endangered Wardaman language into English using only 6 hours of training data. It addresses the low-resource challenge by employing a two-stage pipeline: a dedicated model for audio-to-phonemic transcription, followed by a separate model for transcription-to-E…
Learning POMDP World Models from Observations with Language-Model Priors
This paper introduces **Pinductor**, a method that leverages **Large Language Model (LLM) priors** to learn **Partially-Observable Markov Decision Process (POMDP) world models** from limited observation-action trajectories. Pinductor uses the LLM to propose and iteratively refine candidate POMDP models based on a belie…
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
MILM addresses multimodal irregular time series (MITS) by converting them into time-ordered XML triplets to leverage Large Language Models (LLMs). The core method involves a two-stage fine-tuning strategy: first, training the LLM solely on sampling patterns (with redacted values) to learn temporal structure, and second…
Sampling from Flow Language Models via Marginal-Conditioned Bridges
This paper introduces a novel sampling method for Flow Language Models (FLMs) that leverages their unique structure where each denoising block yields a posterior marginal distribution over the clean token. Instead of collapsing to a single conditional mean, the proposed "marginal-conditioned bridge" sampler works by it…
An LLM-Based System for Argument Reconstruction
This paper introduces an end-to-end LLM-based system designed to reconstruct natural language arguments into abstract argument graphs. The system employs a multi-stage pipeline to identify argumentative components (premises and conclusions) and their logical relations (support, attack, undercut). Its contribution lies …