publications
Preprints and manuscripts on AI agents, agent harnesses, coding agents, and evaluation.
2026
- arXiv2026Under review
Reached alphaXiv Hot #1, selected by DAIR.AI Weekly Top 10, and received 550K+ views on X.
Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.
@misc{pan2026naturallanguageagentharnesses, title = {Natural-Language Agent Harnesses}, author = {Pan, Linyue and Zou, Lexiao and Guo, Shuo and Ni, Jingchen and Zheng, Hai-Tao}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2603.25723} } - arXiv
Lingyue Fu , Xin Ding , Linyue Pan , Yaoming Zhu , Shao Zhang , Lin Qiu , Weiwen Liu , Weinan Zhang , Xuezhi Cao , Xunliang Cai , Jiaxin Ding , and Yong Yu2026Under reviewCurrent evaluation for Large Language Model (LLM) code agents predominantly focus on generating functional code in single-turn scenarios, which fails to evaluate the agent's capability for continuous code optimization and multi-turn iterative development. To bridge this gap, we introduce CATArena, a framework designed to evaluate the evolutionary capabilities of code agents via iterative tournaments. Agents engage in multi-turn tournaments and continuously refine their code through self-reflection and peer-learning based on comprehensive execution feedback. For evaluation, we propose a dual-metric system to decouple static generation proficiency from evolutionary potential. Extensive experiments reveal that an agent's evolutionary potential is not strictly correlated with its initial proficiency. Our analysis further reveals that current agents struggle to concurrently leverage both peer-learning and self-reflection for effective performance gains. Furthermore, the results validate CATArena's high extensibility and resistance to variance tasks, establishing it as a continuous and reliable standard for assessing the evolutionary capability of LLM code agents.
@misc{fu2026catarenaevaluatingevolutionarycapabilities, title = {CATArena: Evaluating Evolutionary Capabilities of Code Agents via Iterative Tournaments}, author = {Fu, Lingyue and Ding, Xin and Pan, Linyue and Zhu, Yaoming and Zhang, Shao and Qiu, Lin and Liu, Weiwen and Zhang, Weinan and Cao, Xuezhi and Cai, Xunliang and Ding, Jiaxin and Yu, Yong}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2510.26852} }