Skip to content

VeriGUI-Team/VeriWeb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking

Note

This project was originally named VeriGUI. As our initial data collection focused on web-based tasks that primarily involve information-seeking rather than GUI interaction, we now define this part as the standalone VeriWeb benchmark, while desktop and other GUI-oriented scenarios will be released as a separate benchmark (in progress). We apologize for any resulting confusion.

🧭 Contents

🌟 Updates

  • [Oct 23, 2025] πŸ”₯ We have released the updated 302 web-based task trajectories!
  • [Jul 21, 2025] πŸ”₯ We have released the first batch of 130 web-based task trajectories!

πŸ“– Overview

Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources.

In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions:

  • (1) πŸ”— Long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning;
  • (2) βœ… subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable.

The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities

VeriWeb Dataset Overview

An overview of the VeriWeb benchmark across five domain-specific scenarios.

✨ Key Features

πŸ”— Long-Chain Complexity

  • 302 realistic information-seeking tasks across 5 real-world domains
  • Long-chain web trajectories decomposed into multiple interdependent subtasks
  • Tasks combine breadth-oriented search and depth-oriented search
  • Agents must retrieve, reason, and synthesize evidence from diverse web pages

βœ… Subtask-Level Verifiability

  • Fine-grained evaluation at each intermediate subtask, not only the final outcome
  • Fixed, verifiable target outputs for every subtask while supporting diverse exploration strategies
  • Each subtask can serve as an independent starting point, enabling evaluation at different stages of a task
  • Rich supervision signals for diagnosing failure modes

πŸ§‘β€πŸŽ¨ Human-Expert Annotation

  • All tasks and trajectories carefully created and annotated by human experts
  • High-quality task instructions, subtask decompositions, and answer annotations
  • Each task includes a complete human demonstration with detailed observation and action logs

πŸš€ Installation

# Only for evaluating
pip install openai tqdm

# Run agents
pip install openai tqdm camel-ai[all] browser-use

πŸ€– Running Agents

We provide some examples of agents under the agents directory. You can run these agents by executing the following command:

python agents/some_agent.py

πŸ“Š Evaluation

The dataset of VeriWeb is located at data. The format of the dataset is described in detail in the following sections.

[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
  },
  ......
]

The evaluation script evaluate.py can be used to evaluate the performance of agents using LLM-as-a-judge. The evaluation script expects a JSON format file with the following format:

[
  {
    "id": "1",              // index id
    "name": "V1_3",         // name of the task
    "type": "global",       // type of the task, global or causal
    "instruction": "xxxxx", // instruction for the task
    "answer": "xxxxx",      // expected answer for the task, in JSON format
    "prediction": "xxxxx",  // agent's predicted result
    "nsteps": 10,           // number of steps taken by the agent
  },
  ......
]

With this file, you can run the evaluation script to get the performance of the agent:

python evaluate.py --input_file veriWeb_prediction.json --output_file output.json

Then, you can use calc_avg.py to calculate the average score of the evaluation results:

python calc_avg.py --input_file output.json

πŸ—‚οΈ Project Structure

The directory structure of the project is defined as follows:

agent-workflow-devkit/
β”œβ”€β”€ agents/                 # Agent implementations
β”‚   └── deepresearch.py     # Deepresearch agent example
β”‚   └── search.py           # Search engine agent example
β”‚   └── browseruse.py       # Browser-use agent example
β”‚   └── owl.py              # Multi-agent system example
β”œβ”€β”€ data/                   # Dataset files
β”‚   └── data.json           # Cleaned data
β”‚   └── original.json       # Original data
β”œβ”€β”€ evaluated/              # Evaluation results
β”œβ”€β”€ predictions/            # Model predictions
β”œβ”€β”€ evaluate.py             # Evaluation script
β”œβ”€β”€ batch_evaluate.py       # Batch evaluation
β”œβ”€β”€ calc_avg.py             # Calculate averages
└── utils.py                # Utility functions

πŸ’» Visualize Tool

Usage

  • Open VeriGUI.2077ai.org
  • Select the corresponding task data folder
  • View the visualization results

Features

  • Interactive event timeline visualization
  • Support for various event types (MOUSE_DRAG, MOUSE_UP, TAB_CHANGE, etc.)
  • Video playback synchronization
  • Jump to specific actions functionality

πŸŽ“ Citation

If you find VeriWeb useful in your research, please cite our paper:

@article{verigui2025,
  title={VeriGUI: Verifiable Long-Chain GUI Dataset},
  author={Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao},
  journal={arXiv preprint arXiv:2508.04026},
  year={2025}
}

πŸ“ž Contact

For questions, suggestions, or collaborations, please feel free to:

πŸ‘₯ Contributors

We thank all contributors who have helped make VeriWeb possible. Special thanks to the research team and community members who provided valuable feedback and improvements.

πŸ“„ License

This project is licensed under the Apache 2.0 License.

About

VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages