Note
This project was originally named VeriGUI. As our initial data collection focused on web-based tasks that primarily involve information-seeking rather than GUI interaction, we now define this part as the standalone VeriWeb benchmark, while desktop and other GUI-oriented scenarios will be released as a separate benchmark (in progress). We apologize for any resulting confusion.
- π Updates
- π Overview
- β¨ Key Features
- π Installation
- π€ Running Agents
- π Evaluation
- ποΈ Project Structure
- π» Visualize Tool
- π Citation
- π Contact
- π₯ Contributors
- π License
[Oct 23, 2025]π₯ We have released the updated 302 web-based task trajectories![Jul 21, 2025]π₯ We have released the first batch of 130 web-based task trajectories!
Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources.
In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions:
- (1) π Long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning;
- (2) β subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable.
The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities
- 302 realistic information-seeking tasks across 5 real-world domains
- Long-chain web trajectories decomposed into multiple interdependent subtasks
- Tasks combine breadth-oriented search and depth-oriented search
- Agents must retrieve, reason, and synthesize evidence from diverse web pages
- Fine-grained evaluation at each intermediate subtask, not only the final outcome
- Fixed, verifiable target outputs for every subtask while supporting diverse exploration strategies
- Each subtask can serve as an independent starting point, enabling evaluation at different stages of a task
- Rich supervision signals for diagnosing failure modes
- All tasks and trajectories carefully created and annotated by human experts
- High-quality task instructions, subtask decompositions, and answer annotations
- Each task includes a complete human demonstration with detailed observation and action logs
# Only for evaluating
pip install openai tqdm
# Run agents
pip install openai tqdm camel-ai[all] browser-useWe provide some examples of agents under the agents directory. You can run these agents by executing the following command:
python agents/some_agent.pyThe dataset of VeriWeb is located at data. The format of the dataset is described in detail in the following sections.
[
{
"id": "1", // index id
"name": "V1_3", // name of the task
"type": "global", // type of the task, global or causal
"instruction": "xxxxx", // instruction for the task
"answer": "xxxxx", // expected answer for the task, in JSON format
},
......
]The evaluation script evaluate.py can be used to evaluate the performance of agents using LLM-as-a-judge. The evaluation script expects a JSON format file with the following format:
[
{
"id": "1", // index id
"name": "V1_3", // name of the task
"type": "global", // type of the task, global or causal
"instruction": "xxxxx", // instruction for the task
"answer": "xxxxx", // expected answer for the task, in JSON format
"prediction": "xxxxx", // agent's predicted result
"nsteps": 10, // number of steps taken by the agent
},
......
]With this file, you can run the evaluation script to get the performance of the agent:
python evaluate.py --input_file veriWeb_prediction.json --output_file output.jsonThen, you can use calc_avg.py to calculate the average score of the evaluation results:
python calc_avg.py --input_file output.jsonThe directory structure of the project is defined as follows:
agent-workflow-devkit/
βββ agents/ # Agent implementations
β βββ deepresearch.py # Deepresearch agent example
β βββ search.py # Search engine agent example
β βββ browseruse.py # Browser-use agent example
β βββ owl.py # Multi-agent system example
βββ data/ # Dataset files
β βββ data.json # Cleaned data
β βββ original.json # Original data
βββ evaluated/ # Evaluation results
βββ predictions/ # Model predictions
βββ evaluate.py # Evaluation script
βββ batch_evaluate.py # Batch evaluation
βββ calc_avg.py # Calculate averages
βββ utils.py # Utility functions
- Open VeriGUI.2077ai.org
- Select the corresponding task data folder
- View the visualization results
- Interactive event timeline visualization
- Support for various event types (MOUSE_DRAG, MOUSE_UP, TAB_CHANGE, etc.)
- Video playback synchronization
- Jump to specific actions functionality
If you find VeriWeb useful in your research, please cite our paper:
@article{verigui2025,
title={VeriGUI: Verifiable Long-Chain GUI Dataset},
author={Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao},
journal={arXiv preprint arXiv:2508.04026},
year={2025}
}For questions, suggestions, or collaborations, please feel free to:
- π Issues: GitHub Issues
We thank all contributors who have helped make VeriWeb possible. Special thanks to the research team and community members who provided valuable feedback and improvements.
This project is licensed under the Apache 2.0 License.
