arXiv — 2026-06-08 — Linnet — Linnet

№01

cs.LG arxiv:2606.07367v1

Self-evolving LLM agents with in-distribution Optimization

Yudi Zhang, Meng Fang, Zhenfang Chen et al.

The paper introduces **Q-Evolve**, a self-evolving framework for LLM agents designed to overcome sparse reward challenges in long-horizon decision-making. It unifies automatic process-reward labeling and policy learning using an in-distribution reinforcement learning approach. The core method learns a stable critic fro…

10

№02

cs.AI arxiv:2606.07410v1

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Yuxiang Chen, Jun Wang

This paper comprehensively compares the mathematical reasoning steps of the DeepSeek-R1 LLM and humans on AIME 2025 problems, categorizing 10,247 steps. The core finding is a structural difference: human reasoning is compact, while the LLM exhibits "topological mimicry," frequently revisiting shallow steps without logi…

9

№03

cs.AI arxiv:2606.07462v1

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Jiayu Wang, Weijiang Lv, Bowen Fu et al.

This paper introduces the **AARR (Act As a Real Researcher) benchmark series** to evaluate frontier LLMs and agents on the nuanced professionalism and thoroughness required in real research, moving beyond simple macro-level execution. The first installment, **AARRI-Bench**, specifically assesses agents' ability to emul…

9

№04

cs.AI arxiv:2606.07299v1

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Lingyong Yan, Can Xu, Yukun Zhao et al.

DuMate-DeepResearch is a multi-agent framework designed to overcome limitations in current Deep Research (DR) systems, specifically concerning long-horizon planning, task decomposition, and auditability. It achieves this by decoupling the Agent Core (handling planning and scheduling) from an extensible Tool Ecosystem, …

9

№05

cs.AI arxiv:2606.07489v1

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Jeremy Yang, Kate Zyskowski, Noah Yonack et al.

This paper investigates how autonomous AI agents transform knowledge work by analyzing production data comparing Perplexity's Search and Computer products. The core finding is that the autonomous Computer product significantly accelerates task completion (26 minutes of automated work vs. 33 seconds of manual orchestrat…

9

№06

cs.AI arxiv:2606.07392v1

Online Pandora's Box for Contextual LLM Cascading

Alexandre Belloni, Yan Chen, Yehua Wei

This paper introduces the **Online Pandora's Box for Contextual LLM Cascading**, an adaptive framework for sequentially querying and selecting among LLM APIs based on request context. Its core method models the **contextual reservation index** directly, addressing the unique challenge where feedback is mediated by the …

9

№07

cs.AI arxiv:2606.07412v1

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Chuan Xiao, Zhengbo Jiao, Shaobo Wang et al.

Socratic-SWE is a closed-loop framework that enables self-evolving software engineering agents by leveraging their own historical solving traces. It distills these traces into structured "agent skills" that capture recurring failures and successful repair patterns. These skills then guide the generation of new, targete…

9

№08

cs.AI arxiv:2606.07313v1

SV-Detect: AI-generated Text Detection with Steering Vectors

Mikhail Vishnyakov, Tatiana Gaintseva

SV-Detect detects AI-generated text by extracting "steering vectors" from a frozen language model's hidden layers, which define directions separating human and machine text. The method represents inputs by their alignment with these layer-wise directions and uses a lightweight classifier on these features for detection…

9

№09

cs.AI arxiv:2606.07237v1

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

Mahdi Alkaeed

This paper systematically evaluates the sensitivity of general and medical Large Language Models (LLMs) to prompt variations (natural and adversarial) using the MedMCQA benchmark. The core contribution is demonstrating that even minor phrasing changes significantly impact model consistency and accuracy in clinical reas…

9

№10

cs.CL arxiv:2606.07513v1

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Xintao Wang, Sirui Zheng, Hongqiu Wu et al.

Agentopia is a comprehensive framework designed for long-term life simulation of multi-agent societies, extending simulations from days to years. The core method involves simulating 100 LLM-powered agents autonomously pursuing growth, relationships, and goals over a simulated decade. The contribution is enabling the st…

9

№11

cs.CL arxiv:2606.07402v1

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian et al.

The paper introduces **M$^3$Exam**, a novel benchmark designed to evaluate language agents' multimodal memory capabilities in realistic user-agent interactions, moving beyond sparse, human-centric data. Its core contribution is a query-centric evaluation framework that tests cross-modal grounding and implicit informati…

9

№12

cs.CL arxiv:2606.07441v1

Sycophantic Praise: Evaluating Excessive Praise in Language Models

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye et al.

This paper introduces a novel framework to measure *sycophantic praise* in language models, distinguishing it from simple agreement. The method quantifies praise by comparing it against the contribution's quality and expected user ability, showing it is a distinct alignment problem. The authors demonstrate this framewo…

9

№13

cs.AI arxiv:2606.07379v1

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori et al.

This paper introduces **CapCode**, a framework for creating coding evaluation datasets where the maximum achievable *non-cheating* score is deliberately capped below perfect performance. This design allows high scores significantly exceeding the cap to serve as reliable indicators of deceptive cheating. Furthermore, th…

8

№14

cs.AI arxiv:2606.07316v1

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

Haoran Xu, Lei Zhang, Iadh Ounis et al.

This paper introduces Hierarchical Certified Semantic Commitment (H-CSC), a Byzantine Fault Tolerance (BFT)-inspired protocol designed for LLM-agent collaboration. H-CSC converts embedding-derived finality signals into one of three typed outcomes: a semantic commit, a verdict commit, or an explicit abort. Its core cont…

8

№15

cs.AI arxiv:2606.07515v1

How reliable are LLMs when it comes to playing dice?

Luca Avena, Gianmarco Bet, Bernardo Busoni

This paper benchmarks the probabilistic reasoning of eight state-of-the-art LLMs using standard and counterintuitive dice problems. The core finding is that while models excel at standard problems (0.96 accuracy), performance significantly drops on counterintuitive tasks (0.59 accuracy) and is highly sensitive to promp…

8

№16

cs.AI arxiv:2606.07512v1

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Guo Gan, Kaixiang Ji et al.

MemDreamer addresses long-video understanding by decoupling perception and reasoning using a Hierarchical Graph Memory to incrementally build semantic abstractions from streamed video. During inference, an agentic retrieval mechanism uses tool-augmented actions to navigate this memory structure, allowing the model to r…

8

№17

cs.AI arxiv:2606.07500v1

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

Fatema Siddika, Md Anwar Hossen, Tanwi Mallick et al.

This paper introduces SETA, a framework for continual learning in LLMs that addresses catastrophic forgetting by employing a Mixture of Sparse Experts architecture. SETA adaptively decomposes model parameters into task-specific experts and shared experts, isolating new knowledge while protecting common features. This s…

8

№18

cs.AI arxiv:2606.07422v1

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

Yang Zhang, Xiao Fei, Amr Mohamed et al.

This paper introduces a controlled framework using real-world cultural questions to disentangle general language proficiency from localized cultural knowledge access in LLMs. By crossing question type (agnostic vs. specific) with query language (English vs. local) and employing Item Response Theory, the authors isolate…

8

№19

cs.AI arxiv:2606.07433v1

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu et al.

This paper proposes a unified framework for analyzing human-view video understanding using MLLMs, structured around three core abilities: **watching, remembering, and reasoning**. The contribution lies in providing a structured formulation to characterize how these models acquire evidence, maintain context over long vi…

8

№20

cs.LG arxiv:2606.07303v1

Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

Jacques Raynal, Pierre Slangen, Elsa Raynal et al.

The paper introduces the **Bootstrap Theory of Representational Emergence (TBER)**, a framework explaining how new levels of representation arise in machine learning. TBER posits that representational innovation is driven not just by data or compute, but fundamentally by **explanatory insufficiency**, where existing re…

8