Skip to content

Epic: Multi-Host Testing Support in Avocado-VT #4183

@YongxueHong

Description

@YongxueHong

Summary

This epic tracks the effort to introduce multi-host testing capabilities into the Avocado-VT framework. The goal is to extend the test scope from a single host to a multi-host cluster, enabling more complex and realistic test scenarios, such as live VM migration. This will be accomplished by creating a distributed architecture with a central Controller Node for test orchestration and multiple Worker Nodes for test execution.

Motivation

Modern virtualization products are typically deployed in complex, multi-host cluster environments to meet customer demands for high availability and features like VM migration. Customers almost always perform these critical operations on remote hosts. Currently, the Avocado-VT framework lacks native support for testing across multiple hosts. The existing workaround—simulating migration on a single local host—is insufficient as it fails to accurately replicate the real-world conditions and potential issues of a distributed environment. This feature is critical to closing that gap and ensuring our testing reflects customer deployments.

Key Goals

  • Enable Cluster Testing: Allow test cases to execute seamlessly across multiple physical.
  • Decouple Control and Execution: Separate test orchestration (control plane) from task execution (data plane) for a more robust and scalable architecture.
  • Support Realistic Scenarios: Provide first-class support for testing critical multi-host features, starting with VM live migration.
  • Improve Resource Efficiency: Optimize the use of distributed resources to increase test throughput and capacity.

Proposed Architecture

The architecture is composed of a Controller Node, where the test logic runs, and one or more Worker Nodes, where the virtual machines are executed.

  • Controller Node: A single machine that runs the Avocado-VT test process. It hosts the logical modules responsible for managing the cluster (vt_cluster), providing high-level VM APIs (vt_vmm), and managing shared resources (vt_resmgr). The test code makes direct, in-process Python calls to these modules.
  • Worker Node: A remote machine that runs a lightweight vt_agent. The agent listens for commands from the Controller Node and performs actions locally, such as starting/stopping QEMU processes and managing local resources.

Visual Architecture

Component Diagram:
This diagram shows that the test itself runs on the Controller Node, making direct API calls to the management libraries (vt_vmm, vt_resmgr),
which in turn use the core vt_cluster module to orchestrate the remote Worker Nodes.

+=================================================================+
|                      CONTROLLER NODE                            |
|                                                                 |
|   +---------------------------------------------------------+   |
|   |           Test Code (Avocado-VT Test Process)           |   |
|   +---------------------------------------------------------+   |
|                         |                                       |
|                         | (Direct, In-Process Python API Calls) |
|                         v                                       |
|   +-------------------------+      +--------------------------+ |
|   | vt_vmm                  |      | vt_resmgr                | |
|   | (VM Management Library) |      | (Resource Mgmt Library)  | |
|   +-------------------------+      +--------------------------+ |
|               |                            |                    |
|               +------------+---------------+                    |
|                            | (Internal Calls)                   |
|                            v                                    |
|   +-----------------------------------------------------------+ |
|   |                 vt_cluster (Core Controller)              | |
|   |-----------------------------------------------------------| |
|   | - Manages Agents, State, and Task Dispatch                | |
|   +-----------------------------------------------------------+ |
|                                                                 |
+=================================================================+
                          |                      |
            (Network Communication: RPC)         |
                          |                      |
       +------------------+----------------------+------------------+
       |                                         |                  |
       v                                         v                  v
+----------------------+           +----------------------+   +----------------------+
|    Worker Node 1     |           |    Worker Node 2     |   |    Worker Node N     |
|----------------------|           |----------------------|   |----------------------|
| +------------------+ |           | +------------------+ |   | +------------------+ |
| |   Worker Agent   | |           | |   Worker Agent   | |   | |   Worker Agent   | |
| |   (`vt_agent`)   | |           | |   (`vt_agent`)   | |   | |   (`vt_agent`)   | |
| +------------------+ |           | +------------------+ |   | +------------------+ |
| - Local Task Exec    |           | - Local Task Exec    |   | - Local Task Exec    |
| - VM Lifecycle Mgmt  |           | - VM Lifecycle Mgmt  |   | - VM Lifecycle Mgmt  |
+----------------------+           +----------------------+   +----------------------+

Interaction Flow:
A typical test operation, like creating a VM, would follow this flow:

  1. A test calls the vt_vmm API to create a new VM.
  2. vt_vmm requests a suitable host from the vt_cluster Controller.
  3. vt_cluster selects an available worker node and reserves the required resources.
  4. vt_vmm sends the "create VM" command to the vt_cluster Controller, targeting the selected node.
  5. vt_cluster relays the command to the vt_agent on that node.
  6. vt_agent executes the command locally to start the QEMU process.
  7. vt_agent reports the status (success/failure, VM details) back to the vt_cluster Controller.

Implementation Plan

This project is broken down into four core modules. Each module will be developed as a distinct component with a clear set of responsibilities:

  • 1. vt_agent (Worker Agent Framework)
    • Objective: Develop a lightweight agent to run on each worker node, enabling remote command and control.
    • Key Deliverables:
      • A secure and reliable communication channel to the vt_cluster controller.
      • A set of local services for executing tasks (e.g., QMP, shell) as directed by the controller.
  • 2. vt_cluster (Cluster Management Controller)
    • Objective: Create the central controller for managing the entire cluster of worker nodes.
    • Key Deliverables:
      • Node discovery, registration, and lifecycle management.
      • A foundational API for upper-layer modules to query and allocate nodes for tasks.
  • 3. vt_vmm (Distributed Virtual Machine Manager)
    • Objective: Abstract VM operations across the cluster, providing a unified interface for managing distributed VMs.
    • Key Deliverables:
      • High-level API to create, inspect, migrate, and destroy VMs on worker nodes by interacting with vt_cluster.
      • Orchestration logic for multi-host operations, starting with VM live migration.
  • 4. vt_resmgr (Distributed Resource Manager)
    • Objective: Manage resources, such as storage and networking, across the cluster.
    • Key Deliverables:
      • A unified resource management infrastructure to manage resources.
      • A service for managing NFS/Local Filesystem storage resources.
      • Integration with vt_vmm to ensure tests can allocate and use distributed resources seamlessly.
  • 5. vt_imgr (Distributed Image Manager)
    • Objective: Manage images, such as qemu image, across the cluster.
    • Key Deliverables:
      • A unified image management infrastructure to manage images.
      • A service for managing qemu image, including high-level APIs to create/destroy/backup/restore/clone qemu images.

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions