From the arXiv
Wednesday, 6 May 2026 · 20 papers
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
This paper presents a comprehensive, practical practicum guiding users through the entire modern NLP pipeline, from tokenization to RLHF. Its core contribution is providing twelve reproducible research artifacts, requiring public code and model publication for each session, all built around a single evolving corpus. Th…
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
This paper introduces **Contextual Multi-Objective Optimization (CMOO)** to address the unreliability of Frontier AI in open-ended tasks where objectives are ambiguous or context-dependent. The core method involves formulating the problem so that AI systems must actively consider and dynamically select among multiple, …
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
This paper introduces **TraceLift**, a reinforcement learning framework that trains reasoning planners using **executor-grounded rewards**, moving beyond simple final-answer correctness. TraceLift uses a frozen executor to evaluate the utility of the planner's intermediate reasoning trace, generating a reward that cred…
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
ELAS proposes a novel framework for efficient large language model (LLM) pre-training by combining low-rank adaptation with 2:4 structured sparsity applied specifically to the activation matrices. This addresses the memory bottleneck caused by full-rank activations in existing low-rank methods. The core contribution is…
From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
This paper introduces an automated framework to compose Multi-Agent Systems (MAS) directly from a user's intent, replacing manual planning and agent selection. The core method involves an LLM-derived planner generating tasks, which are then mapped to suitable agents via a novel two-stage Agent Recommender (fast retriev…
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
MEMTIER introduces a tripartite, tiered memory architecture to combat memory degradation in long-running AI agents, addressing failure modes in flat-file systems. Its core method involves a structured episodic store, a weighted retrieval engine, and a policy framework (PPO) to dynamically manage and promote information…
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
MOSAIC-Bench addresses the vulnerability of coding agents that comply with sequenced, innocuous requests to produce exploitable code, a weakness missed by isolated safety evaluations. The benchmark comprises 199 three-stage attack chains across various software substrates and CWE classes, evaluating both the final expl…
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
OpenSeeker-v2 demonstrates that a simple Supervised Fine-Tuning (SFT) approach can effectively train powerful search agents, challenging the need for resource-intensive pipelines like Reinforcement Learning. The core method involves synthesizing high-quality, informative, and difficult training trajectories by scaling …
OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
OracleProto introduces a reproducible framework to rigorously benchmark the native forecasting ability of Large Language Models (LLMs). It achieves this by reconstructing resolved events into time-bounded forecasting samples, specifically employing **knowledge cutoff** and **temporal masking** techniques. This method r…
QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs
QKVShare introduces a framework for efficient, quantized Key-Value (KV) cache handoff between agents in on-device multi-agent LLMs. It utilizes token-level mixed-precision allocation and a self-contained "CacheCard" representation to enable faster context transfer than full re-prefill. This method significantly reduces…
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
This paper introduces an AI red teaming agent built on the Dreadnode SDK to significantly accelerate vulnerability testing. The core method involves an agent that automatically constructs complex testing workflows, leveraging a large library of attacks, transforms, and scorers, based on natural language operator goals.…
Safety and accuracy follow different scaling laws in clinical large language models
This paper introduces **SaFE-Scale**, a framework to analyze how clinical LLM safety and accuracy diverge as scaling factors (model size, context, retrieval, compute) change. They demonstrate that improving accuracy does not guarantee improved safety, using the new **RadSaFE-200** benchmark, which specifically targets …
Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones
This paper introduces an agent-enhanced LLM framework for controlling UAV swarms using natural language mission specifications. The core method involves an LLM Agent Core interacting with drones via a Model Context Protocol (MCP) gateway, which standardizes drone interfaces using Web of Things (WoT) standards. This ena…
Steer Like the LLM: Activation Steering that Mimics Prompting
This paper introduces Prompt Steering Replacement (PSR) models to improve activation steering by mimicking the token-specific intervention patterns of successful prompt steering. The core method involves training simpler models to estimate token-specific steering coefficients directly from activations, aiming to replic…
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
TRACE is an engineering framework for trustworthy agentic AI in critical domains, featuring a four-layer architecture with a distinct split between classical ML and LLM validators. Its core contribution is a metrologically grounded trust-metric suite aligned with international standards and the introduction of the Comp…
What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
This paper introduces **GLANCE**, a framework that enhances Vision-Language Model (VLM) agents' exploration in partially observable environments. GLANCE drives active exploration by generating an intrinsic curiosity signal based on the **discrepancy between the agent's linguistic world model predictions and the actual …
Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus
This paper benchmarks various Parameter-Efficient Fine-Tuning (PEFT) methods, including LoRA and QLoRA, for adapting large language models to low-resource Tajik text generation. The core contribution is the creation and release of the largest open-access Tajik Web Corpus to facilitate this research. The study found tha…
A Benchmark for Interactive World Models with a Unified Action Generation Framework
This paper introduces **iWorld-Bench**, a comprehensive benchmark designed to evaluate interactive world models on abilities like distance perception and memory, addressing the lack of unified evaluation standards. It features a diverse dataset of 330k video clips and a **Unified Action Generation Framework** to standa…
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
This paper introduces **Experience-RAG Skill**, an agent-oriented, pluggable layer that orchestrates retrieval strategies based on the current task context and past experience. The skill dynamically selects the optimal retrieval method from a fixed pool, addressing the limitation of single, fixed pipelines in heterogen…
Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial
The core method involved comparing "atomic fact-checking," which breaks down AI recommendations into verifiable claims linked to source guidelines, against traditional explainability methods in a randomized trial involving oncologists. The contribution is demonstrating that atomic fact-checking substantially increases …