publications
Preprints and manuscripts on AI agents, agent harnesses, coding agents, and evaluation.
* for equal contribution, # for corresponding author.
- arXiv2026
Reached alphaXiv Hot #1, selected by DAIR.AI Weekly Top 10 and The AI Timeline’s Top AI/ML Research Papers of the Week (7 papers total), and received 550K+ views on X.
Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.
@misc{pan2026naturallanguageagentharnesses, title = {Natural-Language Agent Harnesses}, author = {Pan, Linyue and Zou, Lexiao and Guo, Shuo and Ni, Jingchen and Zheng, Hai-Tao}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2603.25723} } - Lingyue Fu* , Xin Ding* , Linyue Pan* , Yaoming Zhu , Shao Zhang , Lin Qiu , Weiwen Liu# , Weinan Zhang , Xuezhi Cao , Xunliang Cai , Jiaxin Ding , and Yong Yu2026
Current evaluation for Large Language Model (LLM) code agents predominantly focus on generating functional code in single-turn scenarios, which fails to evaluate the agent's capability for continuous code optimization and multi-turn iterative development. To bridge this gap, we introduce CATArena, a framework designed to evaluate the evolutionary capabilities of code agents via iterative tournaments. Agents engage in multi-turn tournaments and continuously refine their code through self-reflection and peer-learning based on comprehensive execution feedback. For evaluation, we propose a dual-metric system to decouple static generation proficiency from evolutionary potential. Extensive experiments reveal that an agent's evolutionary potential is not strictly correlated with its initial proficiency. Our analysis further reveals that current agents struggle to concurrently leverage both peer-learning and self-reflection for effective performance gains. Furthermore, the results validate CATArena's high extensibility and resistance to variance tasks, establishing it as a continuous and reliable standard for assessing the evolutionary capability of LLM code agents.
@misc{fu2026catarenaevaluatingevolutionarycapabilities, title = {CATArena: Evaluating Evolutionary Capabilities of Code Agents via Iterative Tournaments}, author = {Fu, Lingyue and Ding, Xin and Pan, Linyue and Zhu, Yaoming and Zhang, Shao and Qiu, Lin and Liu, Weiwen and Zhang, Weinan and Cao, Xuezhi and Cai, Xunliang and Ding, Jiaxin and Yu, Yong}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2510.26852} } - arXiv2026
Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution), an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.
@misc{pan2026sagequantitativeevaluationsocialized, title = {SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems}, author = {Pan, Linyue and Zhu, Yaoming and Qiu, Lin and Cao, Xuezhi and Cai, Xunliang}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2606.03544} }