№01
cs.CL arxiv:2605.03799v1

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

This paper presents a comprehensive, practical practicum guiding users through the entire modern NLP pipeline, from tokenization to RLHF. Its core contribution is providing twelve reproducible research artifacts, requiring public code and model publication for each session, all built around a single evolving corpus. Th…

10
№02
cs.AI arxiv:2605.03900v1

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

Jie Zhou, Qin Chen, Liang He

This paper introduces **Contextual Multi-Objective Optimization (CMOO)** to address the unreliability of Frontier AI in open-ended tasks where objectives are ambiguous or context-dependent. The core method involves formulating the problem so that AI systems must actively consider and dynamically select among multiple, …

9
№03
cs.AI arxiv:2605.03862v1

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Tianyang Han, Hengyu Shi, Junjie Hu et al.

This paper introduces **TraceLift**, a reinforcement learning framework that trains reasoning planners using **executor-grounded rewards**, moving beyond simple final-answer correctness. TraceLift uses a frozen executor to evaluate the utility of the planner's intermediate reasoning trace, generating a reward that cred…

9
№04
cs.AI arxiv:2605.03667v1

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

Jiaxi Li, Lu Yin, Li Shen et al.

ELAS proposes a novel framework for efficient large language model (LLM) pre-training by combining low-rank adaptation with 2:4 structured sparsity applied specifically to the activation matrices. This addresses the memory bottleneck caused by full-rank activations in existing low-rank methods. The core contribution is…

9
№05
cs.AI arxiv:2605.03986v1

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Kishan Athrey, Ramin Pishehvar, Brian Riordan et al.

This paper introduces an automated framework to compose Multi-Agent Systems (MAS) directly from a user's intent, replacing manual planning and agent selection. The core method involves an LLM-derived planner generating tasks, which are then mapped to suitable agents via a novel two-stage Agent Recommender (fast retriev…

9
№06
cs.AI arxiv:2605.03675v1

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Bronislav Sidik, Lior Rokach

MEMTIER introduces a tripartite, tiered memory architecture to combat memory degradation in long-running AI agents, addressing failure modes in flat-file systems. Its core method involves a structured episodic store, a weighted retrieval engine, and a policy framework (PPO) to dynamically manage and promote information…

9
№07
cs.AI arxiv:2605.03952v1

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Jonathan Steinberg, Oren Gal

MOSAIC-Bench addresses the vulnerability of coding agents that comply with sequenced, innocuous requests to produce exploitable code, a weakness missed by isolated safety evaluations. The benchmark comprises 199 three-stage attack chains across various software substrates and CWE classes, evaluating both the final expl…

9
№08
cs.AI arxiv:2605.04036v1

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Yuwen Du, Rui Ye, Shuo Tang et al.

OpenSeeker-v2 demonstrates that a simple Supervised Fine-Tuning (SFT) approach can effectively train powerful search agents, challenging the need for resource-intensive pipelines like Reinforcement Learning. The core method involves synthesizing high-quality, informative, and difficult training trajectories by scaling …

9
№09
cs.AI arxiv:2605.03762v1

OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

Yiding Ma, Chengyun Ruan, Kaibo Huang et al.

OracleProto introduces a reproducible framework to rigorously benchmark the native forecasting ability of Large Language Models (LLMs). It achieves this by reconstructing resolved events into time-bounded forecasting samples, specifically employing **knowledge cutoff** and **temporal masking** techniques. This method r…

9
№10
cs.AI arxiv:2605.03884v1

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Pratik Honavar, Tejpratap GVSL

QKVShare introduces a framework for efficient, quantized Key-Value (KV) cache handoff between agents in on-device multi-agent LLMs. It utilizes token-level mixed-precision allocation and a self-contained "CacheCard" representation to enable faster context transfer than full re-prefill. This method significantly reduces…

9
№11
cs.AI arxiv:2605.04019v1

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

This paper introduces an AI red teaming agent built on the Dreadnode SDK to significantly accelerate vulnerability testing. The core method involves an agent that automatically constructs complex testing workflows, leveraging a large library of attacks, transforms, and scorers, based on natural language operator goals.…

9
№12
cs.AI arxiv:2605.04039v1

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa et al.

This paper introduces **SaFE-Scale**, a framework to analyze how clinical LLM safety and accuracy diverge as scaling factors (model size, context, retrieval, compute) change. They demonstrate that improving accuracy does not guarantee improved safety, using the new **RadSaFE-200** benchmark, which specifically targets …

9
№13
cs.AI arxiv:2605.03788v1

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

Andrea Iannoli, Lorenzo Gigli, Luca Sciullo et al.

This paper introduces an agent-enhanced LLM framework for controlling UAV swarms using natural language mission specifications. The core method involves an LLM Agent Core interacting with drones via a Model Context Protocol (MCP) gateway, which standardizes drone interfaces using Web of Things (WoT) standards. This ena…

9
№14
cs.AI arxiv:2605.03907v1

Steer Like the LLM: Activation Steering that Mimics Prompting

Geert Heyman, Frederik Vandeputte

This paper introduces Prompt Steering Replacement (PSR) models to improve activation steering by mimicking the token-specific intervention patterns of successful prompt steering. The core method involves training simpler models to estimate token-specific steering coefficients directly from activations, aiming to replic…

9
№15
cs.AI arxiv:2605.03838v1

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Serhii Zabolotnii

TRACE is an engineering framework for trustworthy agentic AI in critical domains, featuring a four-layer architecture with a distinct split between classical ML and LLM validators. Its core contribution is a metrologically grounded trust-metric suite aligned with international standards and the introduction of the Comp…

9
№16
cs.AI arxiv:2605.03782v1

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Haoxi Li, Qinglin Hou, Jianfei Ma et al.

This paper introduces **GLANCE**, a framework that enhances Vision-Language Model (VLM) agents' exploration in partially observable environments. GLANCE drives active exploration by generating an intrinsic curiosity signal based on the **discrepancy between the agent's linguistic world model predictions and the actual …

9
№17
cs.CL arxiv:2605.03742v1

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

Mullosharaf K. Arabov

This paper benchmarks various Parameter-Efficient Fine-Tuning (PEFT) methods, including LoRA and QLoRA, for adapting large language models to low-resource Tajik text generation. The core contribution is the creation and release of the largest open-access Tajik Web Corpus to facilitate this research. The study found tha…

9
№18
cs.AI arxiv:2605.03941v1

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Jianjie Fang, Yingshan Lei, Qin Wan et al.

This paper introduces **iWorld-Bench**, a comprehensive benchmark designed to evaluate interactive world models on abilities like distance perception and memory, addressing the lack of unified evaluation standards. It features a diverse dataset of 330k video clips and a **Unified Action Generation Framework** to standa…

8
№19
cs.AI arxiv:2605.03989v1

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

Dutao Zhang, Tian Liao

This paper introduces **Experience-RAG Skill**, an agent-oriented, pluggable layer that orchestrates retrieval strategies based on the current task context and past experience. The skill dynamically selects the optimal retrieval method from a fixed pool, addressing the limitation of single, fixed pipelines in heterogen…

8
№20
cs.AI arxiv:2605.03916v1

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Lisa C. Adams, Linus Marx, Erik Thiele Orberg et al.

The core method involved comparing "atomic fact-checking," which breaks down AI recommendations into verifiable claims linked to source guidelines, against traditional explainability methods in a randomized trial involving oncologists. The contribution is demonstrating that atomic fact-checking substantially increases …

8