diff --git a/README.md b/README.md
index 486843469..4a7d5bab9 100644
--- a/README.md
+++ b/README.md
@@ -141,7 +141,7 @@ UFO³ introduces **Galaxy**, a revolutionary multi-device orchestration framewor
| **Task Model** | Sequential ReAct Loop | DAG-based Constellation Workflows |
| **Scope** | Single device, multi-app | Multi-device, cross-platform |
| **Coordination** | HostAgent + AppAgents | ConstellationAgent + TaskOrchestrator |
-| **Device Support** | Windows Desktop | Windows, Linux, macOS, Android, Web |
+| **Device Support** | Windows Desktop | Windows, Linux, Android (more coming) |
| **Task Planning** | Application-level | Device-level with dependencies |
| **Execution** | Sequential | Parallel DAG execution |
| **Device Agent Role** | Standalone | Can serve as Galaxy device agent |
@@ -280,23 +280,26 @@ pip install -r requirements.txt
copy config\galaxy\agent.yaml.template config\galaxy\agent.yaml
# Edit and add your API keys
-# 3. Start device agents (with platform flags)
-# Windows:
-python -m ufo.server.app --port 5000
-python -m ufo.client.client --ws --ws-server ws://localhost:5000/ws --client-id windows_device_1 --platform windows
+# 3. Configure devices
+# Edit config\galaxy\devices.yaml to register your devices
-# Linux:
-python -m ufo.server.app --port 5001
-python -m ufo.client.client --ws --ws-server ws://localhost:5001/ws --client-id linux_device_1 --platform linux
+# 4. Start device agents (with platform flags)
+# Windows: Start server + client
+# Linux: Start server + MCP servers + client
+# Mobile (Android): Start server + MCP servers + client
+# See platform-specific guides for detailed setup
-# 4. Launch Galaxy
+# 5. Launch Galaxy
python -m galaxy --interactive
```
**📖 Complete Guide:**
- [Galaxy README](./galaxy/README.md) – Architecture & concepts
- [Online Quick Start](https://microsoft.github.io/UFO/getting_started/quick_start_galaxy/) – Step-by-step tutorial
-- [Configuration](https://microsoft.github.io/UFO/configuration/system/galaxy_devices/) – Device setup
+- [Windows Device Setup](https://microsoft.github.io/UFO/getting_started/quick_start_ufo2/)
+- [Linux Device Setup](https://microsoft.github.io/UFO/getting_started/quick_start_linux/)
+- [Mobile Device Setup](https://microsoft.github.io/UFO/getting_started/quick_start_mobile/) – Android agent setup
+- [Configuration](https://microsoft.github.io/UFO/configuration/system/galaxy_devices/) – Device pool configuration
diff --git a/README_ZH.md b/README_ZH.md
index 07b6969a1..dfb59730a 100644
--- a/README_ZH.md
+++ b/README_ZH.md
@@ -136,7 +136,7 @@ UFO³ 引入了 **Galaxy**,这是一个革命性的多设备编排框架,可
| **任务模型** | 顺序 ReAct 循环 | 基于 DAG 的星座工作流 |
| **范围** | 单设备,多应用 | 多设备,跨平台 |
| **协调** | HostAgent + AppAgents | ConstellationAgent + TaskOrchestrator |
-| **设备支持** | Windows 桌面 | Windows、Linux、macOS、Android、Web |
+| **设备支持** | Windows 桌面 | Windows、Linux、Android(更多平台即将推出) |
| **任务规划** | 应用程序级别 | 设备级别,带依赖关系 |
| **执行** | 顺序 | 并行 DAG 执行 |
| **设备智能体角色** | 独立 | 可作为 Galaxy 设备智能体 |
@@ -268,30 +268,33 @@ UFO² 扮演双重角色:**独立 Windows 自动化**和 Windows 平台的 **G
**用于跨设备编排**
```powershell
-# 1. 安装
+# 1. 安装依赖
pip install -r requirements.txt
# 2. 配置 ConstellationAgent
copy config\galaxy\agent.yaml.template config\galaxy\agent.yaml
-# 编辑并添加您的 API 密钥
+# 编辑配置文件,添加 API Key
-# 3. 启动设备智能体(带平台标志)
-# Windows:
-python -m ufo.server.app --port 5000
-python -m ufo.client.client --ws --ws-server ws://localhost:5000/ws --client-id windows_device_1 --platform windows
+# 3. 配置设备
+# 编辑 config\galaxy\devices.yaml 注册您的设备
-# Linux:
-python -m ufo.server.app --port 5001
-python -m ufo.client.client --ws --ws-server ws://localhost:5001/ws --client-id linux_device_1 --platform linux
+# 4. 启动设备智能体(带平台标志)
+# Windows: 启动服务器 + 客户端
+# Linux: 启动服务器 + MCP 服务器 + 客户端
+# Mobile (Android): 启动服务器 + MCP 服务器 + 客户端
+# 请参阅特定平台指南了解详细设置
-# 4. 启动 Galaxy
+# 5. 启动 Galaxy
python -m galaxy --interactive
```
**📖 完整指南:**
- [Galaxy 中文文档](./galaxy/README_ZH.md) – 架构和概念
- [在线快速入门](https://microsoft.github.io/UFO/getting_started/quick_start_galaxy/) – 分步教程
-- [配置](https://microsoft.github.io/UFO/configuration/system/galaxy_devices/) – 设备设置
+- [Windows 设备设置](https://microsoft.github.io/UFO/getting_started/quick_start_ufo2/)
+- [Linux 设备设置](https://microsoft.github.io/UFO/getting_started/quick_start_linux/)
+- [Mobile 设备设置](https://microsoft.github.io/UFO/getting_started/quick_start_mobile/) – Android 智能体设置
+- [配置](https://microsoft.github.io/UFO/configuration/system/galaxy_devices/) – 设备池配置
|
diff --git a/aip/messages.py b/aip/messages.py
index af739fcc4..97a76f19a 100644
--- a/aip/messages.py
+++ b/aip/messages.py
@@ -474,3 +474,82 @@ def validate_server_message(msg: ServerMessage) -> bool:
return False
return True
+
+
+# ============================================================================
+# Binary Transfer Message Types (New Feature)
+# ============================================================================
+
+
+class BinaryMetadata(BaseModel):
+ """
+ Metadata for binary data transfer.
+
+ This metadata is sent as a text frame before the actual binary data,
+ allowing receivers to prepare for and validate incoming binary transfers.
+ """
+
+ type: Literal["binary_data"] = "binary_data"
+ filename: Optional[str] = None
+ mime_type: Optional[str] = None
+ size: int = Field(..., description="Size of binary data in bytes")
+ checksum: Optional[str] = Field(
+ None, description="MD5 or SHA256 checksum for validation"
+ )
+ session_id: Optional[str] = None
+ description: Optional[str] = None
+ timestamp: Optional[str] = None
+ # Allow additional custom fields
+ model_config = ConfigDict(extra="allow")
+
+
+class FileTransferStart(BaseModel):
+ """
+ Message to initiate a chunked file transfer.
+
+ Sent before sending file chunks to inform the receiver about
+ the file details and transfer parameters.
+ """
+
+ type: Literal["file_transfer_start"] = "file_transfer_start"
+ filename: str = Field(..., description="Name of file being transferred")
+ size: int = Field(..., description="Total file size in bytes")
+ chunk_size: int = Field(..., description="Size of each chunk in bytes")
+ total_chunks: int = Field(..., description="Total number of chunks")
+ mime_type: Optional[str] = Field(None, description="MIME type of file")
+ session_id: Optional[str] = None
+ description: Optional[str] = None
+ # Allow additional custom fields
+ model_config = ConfigDict(extra="allow")
+
+
+class FileTransferComplete(BaseModel):
+ """
+ Message to signal completion of a chunked file transfer.
+
+ Sent after all file chunks have been transmitted, includes
+ checksum for validation.
+ """
+
+ type: Literal["file_transfer_complete"] = "file_transfer_complete"
+ filename: str = Field(..., description="Name of transferred file")
+ total_chunks: int = Field(..., description="Total chunks sent")
+ checksum: Optional[str] = Field(None, description="MD5 checksum of complete file")
+ session_id: Optional[str] = None
+ # Allow additional custom fields
+ model_config = ConfigDict(extra="allow")
+
+
+class ChunkMetadata(BaseModel):
+ """
+ Metadata for a single file chunk.
+
+ Sent with each chunk during chunked file transfer to track
+ chunk sequence and validate chunk integrity.
+ """
+
+ chunk_num: int = Field(..., description="Chunk sequence number (0-indexed)")
+ chunk_size: int = Field(..., description="Size of this chunk in bytes")
+ checksum: Optional[str] = Field(None, description="Checksum of this chunk")
+ # Allow additional custom fields
+ model_config = ConfigDict(extra="allow")
diff --git a/aip/protocol/base.py b/aip/protocol/base.py
index dbf4a3953..90dd97abf 100644
--- a/aip/protocol/base.py
+++ b/aip/protocol/base.py
@@ -232,6 +232,316 @@ async def close(self) -> None:
"""Close protocol and transport."""
await self.transport.close()
+ # ========================================================================
+ # Binary Message Handling (New Feature)
+ # ========================================================================
+
+ async def send_binary_message(
+ self, data: bytes, metadata: Optional[Dict[str, Any]] = None
+ ) -> None:
+ """
+ Send a binary message with optional metadata.
+
+ Uses a two-frame approach for structured binary transfers:
+ 1. Text frame with JSON metadata (filename, size, mime_type, checksum, etc.)
+ 2. Binary frame with actual file data
+
+ This approach allows receivers to prepare for incoming binary data
+ and validate it after reception.
+
+ :param data: Binary data to send (image, file, etc.)
+ :param metadata: Optional metadata dict with fields like:
+ - filename: str
+ - mime_type: str (e.g., "image/png", "application/pdf")
+ - size: int (will be auto-filled)
+ - checksum: str (optional, for validation)
+ - session_id: str (optional)
+ - custom fields as needed
+
+ :raises: ConnectionError if transport not connected
+ :raises: IOError if send fails
+
+ Example:
+ # Send an image with metadata
+ with open("screenshot.png", "rb") as f:
+ image_data = f.read()
+
+ await protocol.send_binary_message(
+ data=image_data,
+ metadata={
+ "filename": "screenshot.png",
+ "mime_type": "image/png",
+ "description": "Desktop screenshot"
+ }
+ )
+ """
+ import datetime
+ import json
+
+ try:
+ # 1. Prepare and send metadata as text frame
+ meta = metadata or {}
+ meta.update(
+ {
+ "type": "binary_data",
+ "size": len(data),
+ "timestamp": datetime.datetime.now(
+ datetime.timezone.utc
+ ).isoformat(),
+ }
+ )
+
+ meta_json = json.dumps(meta)
+ await self.transport.send(meta_json.encode("utf-8"))
+ self.logger.debug(f"Sent binary metadata: {meta}")
+
+ # 2. Send actual data as binary frame
+ await self.transport.send_binary(data)
+ self.logger.debug(f"Sent {len(data)} bytes of binary data")
+
+ except Exception as e:
+ self.logger.error(f"Error sending binary message: {e}")
+ raise
+
+ async def receive_binary_message(
+ self, validate_size: bool = True
+ ) -> tuple[bytes, Dict[str, Any]]:
+ """
+ Receive a binary message with metadata.
+
+ Expects a two-frame sequence:
+ 1. Text frame with JSON metadata
+ 2. Binary frame with actual data
+
+ :param validate_size: If True, validates received size matches metadata
+ :return: Tuple of (binary_data, metadata_dict)
+ :raises: ConnectionError if connection closed
+ :raises: IOError if receive fails
+ :raises: ValueError if size validation fails
+
+ Example:
+ # Receive a binary file
+ data, metadata = await protocol.receive_binary_message()
+
+ filename = metadata.get("filename", "received_file.bin")
+ with open(filename, "wb") as f:
+ f.write(data)
+
+ print(f"Received: {filename} ({len(data)} bytes)")
+ """
+ import json
+
+ try:
+ # 1. Receive metadata as text frame
+ meta_bytes = await self.transport.receive()
+ meta = json.loads(meta_bytes.decode("utf-8"))
+ self.logger.debug(f"Received binary metadata: {meta}")
+
+ # Validate metadata type
+ if meta.get("type") != "binary_data":
+ self.logger.warning(
+ f"Expected binary_data message, got: {meta.get('type')}"
+ )
+
+ # 2. Receive actual binary data
+ data = await self.transport.receive_binary()
+ self.logger.debug(f"Received {len(data)} bytes of binary data")
+
+ # 3. Validate size if requested
+ if validate_size and "size" in meta:
+ expected_size = meta["size"]
+ actual_size = len(data)
+ if actual_size != expected_size:
+ error_msg = (
+ f"Size mismatch: expected {expected_size} bytes, "
+ f"got {actual_size} bytes"
+ )
+ self.logger.error(error_msg)
+ raise ValueError(error_msg)
+
+ return data, meta
+
+ except Exception as e:
+ self.logger.error(f"Error receiving binary message: {e}")
+ raise
+
+ async def send_file(
+ self,
+ file_path: str,
+ chunk_size: int = 1024 * 1024, # 1MB chunks
+ compute_checksum: bool = True,
+ ) -> None:
+ """
+ Send a file in chunks (for large files).
+
+ Sends large files by splitting them into chunks and sending
+ a completion message with checksum for validation.
+
+ Protocol:
+ 1. Send file_transfer_start message (text frame)
+ 2. Send file chunks as binary messages
+ 3. Send file_transfer_complete message with checksum (text frame)
+
+ :param file_path: Path to file to send
+ :param chunk_size: Size of each chunk in bytes (default: 1MB)
+ :param compute_checksum: If True, computes and sends MD5 checksum
+ :raises: FileNotFoundError if file doesn't exist
+ :raises: IOError if send fails
+
+ Example:
+ # Send a large video file
+ await protocol.send_file(
+ "video.mp4",
+ chunk_size=2 * 1024 * 1024 # 2MB chunks
+ )
+ """
+ import hashlib
+ import os
+
+ if not os.path.exists(file_path):
+ raise FileNotFoundError(f"File not found: {file_path}")
+
+ file_size = os.path.getsize(file_path)
+ file_name = os.path.basename(file_path)
+ total_chunks = (file_size + chunk_size - 1) // chunk_size
+
+ # Detect MIME type
+ import mimetypes
+ import json
+
+ mime_type, _ = mimetypes.guess_type(file_path)
+
+ # Send file header (as JSON string)
+ header_msg = {
+ "type": "file_transfer_start",
+ "filename": file_name,
+ "size": file_size,
+ "chunk_size": chunk_size,
+ "total_chunks": total_chunks,
+ "mime_type": mime_type,
+ }
+ await self.transport.send(json.dumps(header_msg).encode("utf-8"))
+
+ # Send file in chunks
+ md5_hash = hashlib.md5() if compute_checksum else None
+
+ with open(file_path, "rb") as f:
+ chunk_num = 0
+
+ while True:
+ chunk = f.read(chunk_size)
+ if not chunk:
+ break
+
+ if md5_hash:
+ md5_hash.update(chunk)
+
+ await self.send_binary_message(
+ chunk, {"chunk_num": chunk_num, "chunk_size": len(chunk)}
+ )
+
+ chunk_num += 1
+ self.logger.info(f"Sent chunk {chunk_num}/{total_chunks}")
+
+ # Send completion with checksum (as JSON string)
+ completion_msg = {
+ "type": "file_transfer_complete",
+ "filename": file_name,
+ "total_chunks": chunk_num,
+ }
+
+ if md5_hash:
+ completion_msg["checksum"] = md5_hash.hexdigest()
+
+ await self.transport.send(json.dumps(completion_msg).encode("utf-8"))
+ self.logger.info(f"File transfer complete: {file_name}")
+
+ async def receive_file(
+ self, output_path: str, validate_checksum: bool = True
+ ) -> Dict[str, Any]:
+ """
+ Receive a file that was sent in chunks.
+
+ Receives a chunked file transfer and writes to the specified path.
+ Validates checksum if provided.
+
+ :param output_path: Path where received file should be saved
+ :param validate_checksum: If True, validates MD5 checksum
+ :return: Dictionary with transfer metadata (filename, size, checksum, etc.)
+ :raises: IOError if receive fails
+ :raises: ValueError if checksum validation fails
+
+ Example:
+ # Receive a file
+ metadata = await protocol.receive_file("downloads/received_video.mp4")
+ print(f"Received: {metadata['filename']} ({metadata['size']} bytes)")
+ """
+ import hashlib
+ import json
+ import os
+
+ # 1. Receive file header
+ header_bytes = await self.transport.receive()
+ header = json.loads(header_bytes.decode("utf-8"))
+
+ if header.get("type") != "file_transfer_start":
+ raise ValueError(f"Expected file_transfer_start, got: {header.get('type')}")
+
+ filename = header["filename"]
+ total_size = header["size"]
+ total_chunks = header["total_chunks"]
+
+ self.logger.info(
+ f"Receiving file: {filename} ({total_size} bytes, {total_chunks} chunks)"
+ )
+
+ # 2. Receive chunks and write to file
+ md5_hash = hashlib.md5() if validate_checksum else None
+ os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
+
+ with open(output_path, "wb") as f:
+ for chunk_num in range(total_chunks):
+ data, chunk_meta = await self.receive_binary_message()
+
+ if md5_hash:
+ md5_hash.update(data)
+
+ f.write(data)
+ self.logger.info(f"Received chunk {chunk_num + 1}/{total_chunks}")
+
+ # 3. Receive completion message
+ completion_bytes = await self.transport.receive()
+ completion = json.loads(completion_bytes.decode("utf-8"))
+
+ if completion.get("type") != "file_transfer_complete":
+ raise ValueError(
+ f"Expected file_transfer_complete, got: {completion.get('type')}"
+ )
+
+ # 4. Validate checksum
+ if validate_checksum and "checksum" in completion:
+ expected_checksum = completion["checksum"]
+ actual_checksum = md5_hash.hexdigest()
+
+ if actual_checksum != expected_checksum:
+ error_msg = (
+ f"Checksum mismatch: expected {expected_checksum}, "
+ f"got {actual_checksum}"
+ )
+ self.logger.error(error_msg)
+ raise ValueError(error_msg)
+
+ self.logger.info(f"Checksum validated: {actual_checksum}")
+
+ self.logger.info(f"File received successfully: {output_path}")
+
+ return {
+ "filename": filename,
+ "size": total_size,
+ "output_path": output_path,
+ "checksum": completion.get("checksum"),
+ }
+
class ProtocolMiddleware(ABC):
"""
diff --git a/aip/transport/adapters.py b/aip/transport/adapters.py
index 3ca9c36fa..350aada10 100644
--- a/aip/transport/adapters.py
+++ b/aip/transport/adapters.py
@@ -8,9 +8,12 @@
Uses the Adapter pattern to abstract away differences between:
- FastAPI WebSocket (server-side)
- websockets library (client-side)
+
+Supports both text and binary frame transmission for efficient file transfer.
"""
from abc import ABC, abstractmethod
+from typing import Union
from websockets import WebSocketClientProtocol
@@ -20,6 +23,7 @@ class WebSocketAdapter(ABC):
Abstract adapter for WebSocket operations.
Provides a consistent interface regardless of the underlying WebSocket implementation.
+ Supports both text frames (for JSON messages) and binary frames (for file transfer).
"""
@abstractmethod
@@ -42,6 +46,45 @@ async def receive(self) -> str:
"""
pass
+ @abstractmethod
+ async def send_bytes(self, data: bytes) -> None:
+ """
+ Send binary data through WebSocket.
+
+ Sends data as a binary WebSocket frame for efficient transmission
+ of images, files, and other binary content.
+
+ :param data: Binary data to send
+ :raises: Exception if send fails
+ """
+ pass
+
+ @abstractmethod
+ async def receive_bytes(self) -> bytes:
+ """
+ Receive binary data from WebSocket.
+
+ Expects a binary WebSocket frame. Raises an error if a text frame is received.
+
+ :return: Received binary data
+ :raises: ValueError if a text frame is received instead of binary
+ :raises: Exception if receive fails
+ """
+ pass
+
+ @abstractmethod
+ async def receive_auto(self) -> Union[str, bytes]:
+ """
+ Receive data and auto-detect frame type (text or binary).
+
+ This method automatically detects whether the received WebSocket frame
+ is text or binary and returns the appropriate type.
+
+ :return: Received data (str for text frames, bytes for binary frames)
+ :raises: Exception if receive fails
+ """
+ pass
+
@abstractmethod
async def close(self) -> None:
"""
@@ -64,6 +107,7 @@ class FastAPIWebSocketAdapter(WebSocketAdapter):
Adapter for FastAPI/Starlette WebSocket (server-side).
Used when the server accepts WebSocket connections from clients.
+ Supports both text and binary frame transmission.
"""
def __init__(self, websocket):
@@ -84,6 +128,38 @@ async def receive(self) -> str:
"""Receive text data via FastAPI WebSocket."""
return await self._ws.receive_text()
+ async def send_bytes(self, data: bytes) -> None:
+ """
+ Send binary data via FastAPI WebSocket.
+
+ FastAPI provides native send_bytes() method for binary frames.
+ """
+ await self._ws.send_bytes(data)
+
+ async def receive_bytes(self) -> bytes:
+ """
+ Receive binary data via FastAPI WebSocket.
+
+ FastAPI provides native receive_bytes() method.
+ Raises an error if a text frame is received.
+ """
+ return await self._ws.receive_bytes()
+
+ async def receive_auto(self) -> Union[str, bytes]:
+ """
+ Auto-detect and receive text or binary data.
+
+ Uses FastAPI's receive() to get the raw message and extract
+ the appropriate data type.
+ """
+ message = await self._ws.receive()
+ if "text" in message:
+ return message["text"]
+ elif "bytes" in message:
+ return message["bytes"]
+ else:
+ raise ValueError(f"Unknown WebSocket message type: {message}")
+
async def close(self) -> None:
"""Close FastAPI WebSocket connection."""
await self._ws.close()
@@ -100,6 +176,7 @@ class WebSocketsLibAdapter(WebSocketAdapter):
Adapter for websockets library (client-side).
Used when the client connects to a WebSocket server.
+ Supports both text and binary frame transmission.
"""
def __init__(self, websocket: WebSocketClientProtocol):
@@ -122,6 +199,38 @@ async def receive(self) -> str:
return received.decode("utf-8")
return received
+ async def send_bytes(self, data: bytes) -> None:
+ """
+ Send binary data via websockets library.
+
+ The websockets library automatically detects bytes type and sends
+ as a binary WebSocket frame.
+ """
+ await self._ws.send(data)
+
+ async def receive_bytes(self) -> bytes:
+ """
+ Receive binary data via websockets library.
+
+ Raises ValueError if a text frame is received instead of binary.
+ """
+ received = await self._ws.recv()
+ if isinstance(received, str):
+ raise ValueError(
+ "Expected binary WebSocket frame, but received text frame. "
+ f"Received data: {received[:100]}..."
+ )
+ return received
+
+ async def receive_auto(self) -> Union[str, bytes]:
+ """
+ Auto-detect and receive text or binary data.
+
+ The websockets library's recv() automatically returns the correct type
+ (str for text frames, bytes for binary frames).
+ """
+ return await self._ws.recv()
+
async def close(self) -> None:
"""Close websockets library connection."""
await self._ws.close()
diff --git a/aip/transport/websocket.py b/aip/transport/websocket.py
index b77065a16..42f048667 100644
--- a/aip/transport/websocket.py
+++ b/aip/transport/websocket.py
@@ -6,11 +6,12 @@
Implements the Transport interface using WebSockets.
Provides reliable, bidirectional, full-duplex communication over a single TCP connection.
+Supports both text frames (for JSON messages) and binary frames (for efficient file transfer).
"""
import asyncio
import logging
-from typing import Optional
+from typing import Optional, Union
import websockets
from websockets import WebSocketClientProtocol
@@ -29,12 +30,22 @@ class WebSocketTransport(Transport):
- Configurable timeouts
- Large message support (up to 100MB by default)
- Graceful connection shutdown
+ - Text and binary frame support for efficient data transfer
Usage:
+ # Text messages (JSON)
transport = WebSocketTransport(ping_interval=30, ping_timeout=180)
await transport.connect("ws://localhost:8000/ws")
await transport.send(b"Hello")
data = await transport.receive()
+
+ # Binary data (files, images)
+ await transport.send_binary(image_bytes)
+ binary_data = await transport.receive_binary()
+
+ # Auto-detect frame type
+ data = await transport.receive_auto() # Returns str or bytes
+
await transport.close()
"""
@@ -234,6 +245,167 @@ async def wait_closed(self) -> None:
await self._ws.wait_closed()
self._state = TransportState.DISCONNECTED
+ async def send_binary(self, data: bytes) -> None:
+ """
+ Send binary data through WebSocket as a binary frame.
+
+ This method sends raw binary data (images, files, etc.) without
+ text encoding overhead, providing maximum efficiency for binary transfers.
+
+ :param data: Binary bytes to send
+ :raises: ConnectionError if not connected
+ :raises: IOError if send fails
+
+ Example:
+ # Send an image file
+ with open("screenshot.png", "rb") as f:
+ image_data = f.read()
+ await transport.send_binary(image_data)
+ """
+ if not self.is_connected or self._adapter is None:
+ raise ConnectionError("Transport not connected")
+
+ if not self._adapter.is_open():
+ self._state = TransportState.DISCONNECTED
+ raise ConnectionError("WebSocket connection is closed")
+
+ try:
+ adapter_type = type(self._adapter).__name__
+ self.logger.debug(
+ f"Sending {len(data)} bytes (binary frame) via {adapter_type}"
+ )
+
+ await self._adapter.send_bytes(data)
+
+ self.logger.debug(f"✅ Sent {len(data)} bytes successfully")
+ except ConnectionClosed as e:
+ self._state = TransportState.DISCONNECTED
+ self.logger.debug(f"Connection closed during binary send: {e}")
+ raise ConnectionError(f"Connection closed: {e}") from e
+ except (ConnectionError, OSError) as e:
+ self._state = TransportState.ERROR
+ error_msg = str(e).lower()
+ if "closed" in error_msg or "not connected" in error_msg:
+ self.logger.debug(f"Cannot send binary (connection closed): {e}")
+ else:
+ self.logger.warning(f"Connection error sending binary data: {e}")
+ raise IOError(f"Failed to send binary data: {e}") from e
+ except Exception as e:
+ self._state = TransportState.ERROR
+ self.logger.error(f"Error sending binary data: {e}")
+ raise IOError(f"Failed to send binary data: {e}") from e
+
+ async def receive_binary(self) -> bytes:
+ """
+ Receive binary data from WebSocket as a binary frame.
+
+ This method expects a binary WebSocket frame and returns raw bytes.
+ Raises an error if a text frame is received.
+
+ :return: Received binary bytes
+ :raises: ConnectionError if connection closed
+ :raises: ValueError if a text frame is received instead of binary
+ :raises: IOError if receive fails
+
+ Example:
+ # Receive a binary file
+ file_data = await transport.receive_binary()
+ with open("received_file.bin", "wb") as f:
+ f.write(file_data)
+ """
+ if not self.is_connected or self._adapter is None:
+ raise ConnectionError("Transport not connected")
+
+ try:
+ adapter_type = type(self._adapter).__name__
+ self.logger.debug(
+ f"🔍 Attempting to receive binary data via {adapter_type}..."
+ )
+
+ data = await self._adapter.receive_bytes()
+
+ self.logger.debug(f"✅ Received {len(data)} bytes successfully")
+ return data
+ except ConnectionClosed as e:
+ self._state = TransportState.DISCONNECTED
+ self.logger.debug(f"Connection closed during binary receive: {e}")
+ raise ConnectionError(f"Connection closed: {e}") from e
+ except ValueError as e:
+ # Raised when expecting binary but got text frame
+ self.logger.error(f"Frame type mismatch: {e}")
+ raise
+ except (ConnectionError, OSError) as e:
+ self._state = TransportState.ERROR
+ error_msg = str(e).lower()
+ if "closed" in error_msg or "not connected" in error_msg:
+ self.logger.debug(f"Cannot receive binary (connection closed): {e}")
+ else:
+ self.logger.warning(f"Connection error receiving binary data: {e}")
+ raise IOError(f"Failed to receive binary data: {e}") from e
+ except Exception as e:
+ self._state = TransportState.ERROR
+ self.logger.error(f"Error receiving binary data: {e}")
+ raise IOError(f"Failed to receive binary data: {e}") from e
+
+ async def receive_auto(self) -> Union[bytes, str]:
+ """
+ Receive data and automatically detect frame type (text or binary).
+
+ This method receives a WebSocket frame and returns the appropriate type:
+ - str for text frames (JSON messages)
+ - bytes for binary frames (files, images)
+
+ :return: Received data (str for text frames, bytes for binary frames)
+ :raises: ConnectionError if connection closed
+ :raises: IOError if receive fails
+
+ Example:
+ data = await transport.receive_auto()
+ if isinstance(data, bytes):
+ # Handle binary data
+ print(f"Received {len(data)} bytes")
+ else:
+ # Handle text data
+ message = json.loads(data)
+ """
+ if not self.is_connected or self._adapter is None:
+ raise ConnectionError("Transport not connected")
+
+ try:
+ adapter_type = type(self._adapter).__name__
+ self.logger.debug(
+ f"🔍 Attempting to receive data (auto-detect) via {adapter_type}..."
+ )
+
+ data = await self._adapter.receive_auto()
+
+ if isinstance(data, bytes):
+ self.logger.debug(
+ f"✅ Received {len(data)} bytes (binary frame) successfully"
+ )
+ else:
+ self.logger.debug(
+ f"✅ Received {len(data)} chars (text frame) successfully"
+ )
+
+ return data
+ except ConnectionClosed as e:
+ self._state = TransportState.DISCONNECTED
+ self.logger.debug(f"Connection closed during receive: {e}")
+ raise ConnectionError(f"Connection closed: {e}") from e
+ except (ConnectionError, OSError) as e:
+ self._state = TransportState.ERROR
+ error_msg = str(e).lower()
+ if "closed" in error_msg or "not connected" in error_msg:
+ self.logger.debug(f"Cannot receive (connection closed): {e}")
+ else:
+ self.logger.warning(f"Connection error receiving data: {e}")
+ raise IOError(f"Failed to receive data: {e}") from e
+ except Exception as e:
+ self._state = TransportState.ERROR
+ self.logger.error(f"Error receiving data: {e}")
+ raise IOError(f"Failed to receive data: {e}") from e
+
@property
def websocket(self) -> Optional[WebSocketClientProtocol]:
"""
diff --git a/config/ufo/mcp.yaml b/config/ufo/mcp.yaml
index 0ef7842f2..442e26142 100644
--- a/config/ufo/mcp.yaml
+++ b/config/ufo/mcp.yaml
@@ -145,3 +145,20 @@ LinuxAgent:
port: 8010
path: "/mcp"
reset: false
+
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http
+ host: "localhost"
+ port: 8020
+ path: "/mcp"
+ reset: false
+ action:
+ - namespace: MobileActionExecutor
+ type: http
+ host: "localhost"
+ port: 8021
+ path: "/mcp"
+ reset: false
diff --git a/config/ufo/third_party.yaml b/config/ufo/third_party.yaml
index 3766273a3..dd39c6874 100644
--- a/config/ufo/third_party.yaml
+++ b/config/ufo/third_party.yaml
@@ -3,7 +3,7 @@
# beyond the core Windows GUI automation
# Enabled Third-Party Agents
-ENABLED_THIRD_PARTY_AGENTS: ["HardwareAgent", "LinuxAgent"]
+ENABLED_THIRD_PARTY_AGENTS: ["HardwareAgent", "LinuxAgent", "MobileAgent"]
THIRD_PARTY_AGENT_CONFIG:
HardwareAgent:
@@ -19,3 +19,9 @@ THIRD_PARTY_AGENT_CONFIG:
APPAGENT_PROMPT: "ufo/prompts/third_party/linux_agent.yaml"
APPAGENT_EXAMPLE_PROMPT: "ufo/prompts/third_party/linux_agent_example.yaml"
INTRODUCTION: "For Linux Use Only."
+
+ MobileAgent:
+ AGENT_NAME: "MobileAgent"
+ APPAGENT_PROMPT: "ufo/prompts/third_party/mobile_agent.yaml"
+ APPAGENT_EXAMPLE_PROMPT: "ufo/prompts/third_party/mobile_agent_example.yaml"
+ INTRODUCTION: "For Android Mobile Device Control. Enables remote control and automation of Android devices via ADB and UI interactions."
diff --git a/documents/docs/galaxy/overview.md b/documents/docs/galaxy/overview.md
index 2595dd7dd..26738c969 100644
--- a/documents/docs/galaxy/overview.md
+++ b/documents/docs/galaxy/overview.md
@@ -364,6 +364,20 @@ devices:
logs_file_path: "/root/log/log1.txt"
auto_connect: true
max_retries: 5
+
+ - device_id: "mobile_agent_1"
+ server_url: "ws://localhost:5002/ws"
+ os: "android"
+ capabilities:
+ - "mobile"
+ - "adb"
+ - "ui_automation"
+ metadata:
+ os: "android"
+ performance: "medium"
+ device_type: "smartphone"
+ auto_connect: true
+ max_retries: 5
```
**`config/galaxy/constellation.yaml`** - Configure runtime settings:
@@ -387,19 +401,19 @@ See [Galaxy Configuration](../configuration/system/galaxy_devices.md) for comple
### 3. Start Device Agents
-On each device, launch the Agent Server:
+On each device, launch the Agent Server. For detailed setup instructions, see the respective quick start guides:
**On Windows:**
-```powershell
-# Start Agent Server on port 5005
-python -m ufo --mode agent-server --port 5005
-```
+
+See [Windows Agent (UFO²) Quick Start →](../getting_started/quick_start_ufo2.md)
**On Linux:**
-```bash
-# Start Agent Server on port 5001
-python -m ufo --mode agent-server --port 5001
-```
+
+See [Linux Agent Quick Start →](../getting_started/quick_start_linux.md)
+
+**On Mobile (Android):**
+
+See [Mobile Agent Quick Start →](../getting_started/quick_start_mobile.md)
### 4. Launch Galaxy Client
diff --git a/documents/docs/getting_started/quick_start_galaxy.md b/documents/docs/getting_started/quick_start_galaxy.md
index 3b8693699..66bc19755 100644
--- a/documents/docs/getting_started/quick_start_galaxy.md
+++ b/documents/docs/getting_started/quick_start_galaxy.md
@@ -118,8 +118,9 @@ Galaxy orchestrates **device agents** that execute tasks on individual machines.
|--------------|----------|---------------|-----------|
| **WindowsAgent (UFO²)** | Windows 10/11 | [UFO² as Galaxy Device](../ufo2/as_galaxy_device.md) | Desktop automation, Office apps, GUI operations |
| **LinuxAgent** | Linux | [Linux as Galaxy Device](../linux/as_galaxy_device.md) | Server management, CLI operations, log analysis |
+| **MobileAgent** | Android | [Mobile as Galaxy Device](../mobile/as_galaxy_device.md) | Mobile app automation, UI testing, device control |
-> **💡 Choose Your Devices:** You can use any combination of Windows and Linux agents. Galaxy will intelligently route tasks based on device capabilities.
+> **💡 Choose Your Devices:** You can use any combination of Windows, Linux, and Mobile agents. Galaxy will intelligently route tasks based on device capabilities.
### Quick Setup Overview
@@ -133,6 +134,7 @@ For each device agent you want to use, you need to:
- **For Windows devices (UFO²):** See [UFO² as Galaxy Device](../ufo2/as_galaxy_device.md) for complete step-by-step instructions.
- **For Linux devices:** See [Linux as Galaxy Device](../linux/as_galaxy_device.md) for complete step-by-step instructions.
+- **For Mobile devices:** See [Mobile as Galaxy Device](../mobile/as_galaxy_device.md) for complete step-by-step instructions.
### Example: Quick Windows Device Setup
@@ -171,6 +173,8 @@ python -m ufo.client.client \
python -m ufo.client.mcp.http_servers.linux_mcp_server
```
+> **💡 Note:** For detailed Mobile Agent setup with ADB and Android device configuration, see [Mobile Quick Start](quick_start_mobile.md).
+
---
## 🔌 Step 4: Configure Device Pool
@@ -236,6 +240,30 @@ devices:
description: "Development server for backend operations"
auto_connect: true
max_retries: 5
+
+ # Mobile Device (Android)
+ - device_id: "mobile_phone_1" # Must match --client-id
+ server_url: "ws://localhost:5001/ws" # Must match server WebSocket URL
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "ui_automation"
+ - "messaging"
+ - "camera"
+ - "location"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ screen_size: "1080x2400"
+ installed_apps:
+ - "com.android.chrome"
+ - "com.google.android.apps.maps"
+ - "com.whatsapp"
+ description: "Android phone for mobile automation and testing"
+ auto_connect: true
+ max_retries: 5
```
> **⚠️ Critical:** IDs and URLs must match exactly:
@@ -638,8 +666,10 @@ capabilities:
- [UFO² as Galaxy Device](../ufo2/as_galaxy_device.md) - Complete Windows device setup
- [Linux as Galaxy Device](../linux/as_galaxy_device.md) - Complete Linux device setup
+- [Mobile as Galaxy Device](../mobile/as_galaxy_device.md) - Complete Android device setup
- [UFO² Overview](../ufo2/overview.md) - Windows desktop automation capabilities
- [Linux Agent Overview](../linux/overview.md) - Linux server automation capabilities
+- [Mobile Agent Overview](../mobile/overview.md) - Android mobile automation capabilities
### Configuration
diff --git a/documents/docs/getting_started/quick_start_mobile.md b/documents/docs/getting_started/quick_start_mobile.md
new file mode 100644
index 000000000..b093e6615
--- /dev/null
+++ b/documents/docs/getting_started/quick_start_mobile.md
@@ -0,0 +1,1478 @@
+# ⚡ Quick Start: Mobile Agent
+
+Get your Android device running as a UFO³ device agent in 10 minutes. This guide walks you through ADB setup, server/client configuration, and MCP service initialization for Android automation.
+
+> **📚 Documentation Navigation:**
+>
+> - **Architecture & Concepts:** [Mobile Agent Overview](../mobile/overview.md)
+> - **State Management:** [State Machine](../mobile/state.md)
+> - **Processing Pipeline:** [Processing Strategy](../mobile/strategy.md)
+> - **Available Commands:** [MCP Commands Reference](../mobile/commands.md)
+> - **Galaxy Integration:** [As Galaxy Device](../mobile/as_galaxy_device.md)
+
+---
+
+## 📋 Prerequisites
+
+Before you begin, ensure you have:
+
+- **Python 3.10+** installed on your computer
+- **UFO repository** cloned from [GitHub](https://github.com/microsoft/UFO)
+- **Android device** (physical device or emulator) with Android 5.0+ (API 21+)
+- **ADB (Android Debug Bridge)** installed and accessible
+- **USB debugging enabled** on your Android device (for physical devices)
+- **Network connectivity** between server and client machines
+- **LLM configured** in `config/ufo/agents.yaml` (see [Model Configuration](../configuration/models/overview.md))
+
+| Component | Minimum Version | Verification Command |
+|-----------|----------------|---------------------|
+| Python | 3.10 | `python --version` |
+| Android OS | 5.0 (API 21) | Check device settings |
+| ADB | Latest | `adb --version` |
+| LLM API Key | N/A | Check `config/ufo/agents.yaml` |
+
+> **⚠️ LLM Configuration Required:** The Mobile Agent shares the same LLM configuration with the AppAgent. Before starting, ensure you have configured your LLM provider (OpenAI, Azure OpenAI, Gemini, Claude, etc.) and added your API keys to `config/ufo/agents.yaml`. See [Model Setup Guide](../configuration/models/overview.md) for detailed instructions.
+
+---
+
+## 📱 Step 0: Android Device Setup
+
+You can use either a **physical Android device** or an **Android emulator**. Choose the setup method that fits your needs.
+
+### Option A: Physical Android Device Setup
+
+#### 1. Enable Developer Options
+
+On your Android device:
+
+1. Open **Settings** → **About phone**
+2. Tap **Build number** 7 times
+3. You'll see "You are now a developer!"
+
+#### 2. Enable USB Debugging
+
+1. Go to **Settings** → **System** → **Developer options**
+2. Turn on **USB debugging**
+3. (Optional) Turn on **Stay awake** (device won't sleep while charging)
+
+#### 3. Connect Device to Computer
+
+**Via USB Cable:**
+
+```bash
+# Connect device via USB
+# On device, allow USB debugging when prompted
+
+# Verify connection
+adb devices
+```
+
+**Expected Output:**
+```
+List of devices attached
+XXXXXXXXXXXXXX device
+```
+
+**Via Wireless ADB (Android 11+):**
+
+```bash
+# On device: Settings → Developer options → Wireless debugging
+# Get IP address and port (e.g., 192.168.1.100:5555)
+
+# On computer: Connect to device
+adb connect 192.168.1.100:5555
+
+# Verify connection
+adb devices
+```
+
+**Expected Output:**
+```
+List of devices attached
+192.168.1.100:5555 device
+```
+
+### Option B: Android Emulator Setup
+
+#### Option B1: Using Android Studio Emulator (Recommended)
+
+**Step 1: Install Android Studio**
+
+Download from: https://developer.android.com/studio
+
+**Windows:**
+```powershell
+# Download Android Studio installer
+# Run: android-studio-xxx.exe
+# Follow installation wizard
+```
+
+**macOS:**
+```bash
+# Download Android Studio DMG
+# Drag to Applications folder
+# Open Android Studio
+```
+
+**Linux:**
+```bash
+# Download Android Studio tarball
+tar -xzf android-studio-*.tar.gz
+cd android-studio/bin
+./studio.sh
+```
+
+**Step 2: Install Android SDK Components**
+
+1. Open Android Studio
+2. Go to **Tools** → **SDK Manager**
+3. Install:
+ - ✅ Android SDK Platform (API 33 or higher)
+ - ✅ Android SDK Platform-Tools
+ - ✅ Android SDK Build-Tools
+ - ✅ Android Emulator
+
+**Step 3: Create Virtual Device**
+
+1. In Android Studio, click **Device Manager** (phone icon)
+2. Click **Create Device**
+3. Select hardware:
+ - **Phone** category
+ - Choose **Pixel 6** or **Pixel 7** (recommended)
+ - Click **Next**
+
+4. Select system image:
+ - Choose **Release Name**: **Tiramisu** (Android 13, API 33) or newer
+ - Click **Download** if not installed
+ - Click **Next**
+
+5. Configure AVD:
+ - **AVD Name**: `Pixel_6_API_33` (or your choice)
+ - **Startup orientation**: Portrait
+ - **Graphics**: Automatic or Hardware
+ - Click **Finish**
+
+**Step 4: Start Emulator**
+
+**From Android Studio:**
+1. Open **Device Manager**
+2. Click ▶️ (Play button) next to your AVD
+
+**From Command Line:**
+```bash
+# List available emulators
+emulator -list-avds
+
+# Start emulator
+emulator -avd Pixel_6_API_33 &
+```
+
+**Step 5: Verify ADB Connection**
+
+```bash
+# Wait for emulator to fully boot (~1-2 minutes)
+adb devices
+```
+
+**Expected Output:**
+```
+List of devices attached
+emulator-5554 device
+```
+
+#### Option B2: Using Genymotion (Alternative)
+
+**Step 1: Install Genymotion**
+
+Download from: https://www.genymotion.com/download/
+
+```bash
+# Free personal edition available
+# Requires VirtualBox (auto-installed)
+```
+
+**Step 2: Create Virtual Device**
+
+1. Open Genymotion
+2. Click **+** (Add new device)
+3. Sign in with Genymotion account (free)
+4. Select device:
+ - **Google Pixel 6** or similar
+ - **Android 13.0** or newer
+5. Click **Install**
+6. Click **Start**
+
+**Step 3: Verify ADB Connection**
+
+```bash
+adb devices
+```
+
+**Expected Output:**
+```
+List of devices attached
+192.168.56.101:5555 device
+```
+
+### Verify Device is Ready
+
+Run this test to ensure device is accessible:
+
+```bash
+# Get device model
+adb shell getprop ro.product.model
+
+# Get Android version
+adb shell getprop ro.build.version.release
+
+# Test screenshot capability
+adb shell screencap -p /sdcard/test.png
+adb pull /sdcard/test.png .
+```
+
+If all commands succeed, your device is ready! ✅
+
+---
+
+## 🔧 Step 1: Install ADB (Android Debug Bridge)
+
+ADB is essential for communicating with Android devices. Choose your platform:
+
+### Windows
+
+**Option 1: Install via Android Studio (Recommended)**
+
+ADB is included with Android Studio (see Step 0 Option B1).
+
+After installation, add to PATH:
+
+```powershell
+# Add Android SDK platform-tools to PATH
+# Default location:
+$env:PATH += ";C:\Users\\AppData\Local\Android\Sdk\platform-tools"
+
+# Test
+adb --version
+```
+
+**Option 2: Standalone ADB Installation**
+
+```powershell
+# Download platform-tools
+# https://developer.android.com/studio/releases/platform-tools
+
+# Extract to C:\adb
+# Add to PATH:
+$env:PATH += ";C:\adb"
+
+# Test
+adb --version
+```
+
+**Make PATH Permanent (Optional):**
+
+1. Open **System Properties** → **Environment Variables**
+2. Under **User variables**, edit **Path**
+3. Add: `C:\Users\\AppData\Local\Android\Sdk\platform-tools`
+4. Click **OK**
+
+### macOS
+
+**Option 1: Via Homebrew (Recommended)**
+
+```bash
+# Install Homebrew (if not installed)
+/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+
+# Install ADB
+brew install android-platform-tools
+
+# Verify
+adb --version
+```
+
+**Option 2: Via Android Studio**
+
+ADB is included with Android Studio. Add to PATH:
+
+```bash
+# Add to ~/.zshrc or ~/.bash_profile
+export PATH="$PATH:$HOME/Library/Android/sdk/platform-tools"
+
+# Reload
+source ~/.zshrc
+
+# Test
+adb --version
+```
+
+### Linux
+
+**Ubuntu/Debian:**
+
+```bash
+sudo apt update
+sudo apt install -y adb
+
+# Verify
+adb --version
+```
+
+**Fedora/RHEL:**
+
+```bash
+sudo dnf install android-tools
+
+# Verify
+adb --version
+```
+
+**Arch Linux:**
+
+```bash
+sudo pacman -S android-tools
+
+# Verify
+adb --version
+```
+
+### Verify ADB Installation
+
+```bash
+adb version
+```
+
+**Expected Output:**
+```
+Android Debug Bridge version 1.0.41
+Version 34.0.5-10900879
+```
+
+---
+
+## 📦 Step 2: Install Python Dependencies
+
+Install all UFO dependencies:
+
+```bash
+cd /path/to/UFO
+pip install -r requirements.txt
+```
+
+**Verify installation:**
+
+```bash
+python -c "import ufo; print('✅ UFO installed successfully')"
+```
+
+> **Tip:** For production deployments, use a virtual environment:
+>
+> ```bash
+> python -m venv venv
+>
+> # Windows
+> venv\Scripts\activate
+>
+> # macOS/Linux
+> source venv/bin/activate
+>
+> pip install -r requirements.txt
+> ```
+
+---
+
+## 🖥️ Step 3: Start Device Agent Server
+
+**Server Component:** The Device Agent Server manages connections from Android devices and dispatches tasks.
+
+### Basic Server Startup
+
+On your computer (where Python is installed):
+
+```bash
+python -m ufo.server.app --port 5001 --platform mobile
+```
+
+**Expected Output:**
+
+```console
+INFO - Starting UFO Server on 0.0.0.0:5001
+INFO - Platform: mobile
+INFO - Log level: WARNING
+INFO: Started server process [12345]
+INFO: Waiting for application startup.
+INFO: Application startup complete.
+INFO: Uvicorn running on http://0.0.0.0:5001 (Press CTRL+C to quit)
+```
+
+Once you see "Uvicorn running", the server is ready at `ws://0.0.0.0:5001/ws`.
+
+### Server Configuration Options
+
+| Argument | Default | Description | Example |
+|----------|---------|-------------|---------|
+| `--port` | `5000` | Server listening port | `--port 5001` |
+| `--host` | `0.0.0.0` | Bind address | `--host 127.0.0.1` |
+| `--platform` | Auto | Platform override | `--platform mobile` |
+| `--log-level` | `WARNING` | Logging verbosity | `--log-level DEBUG` |
+
+**Custom Configuration Examples:**
+
+```bash
+# Different port
+python -m ufo.server.app --port 8080 --platform mobile
+
+# Localhost only
+python -m ufo.server.app --host 127.0.0.1 --port 5001 --platform mobile
+
+# Debug mode
+python -m ufo.server.app --port 5001 --platform mobile --log-level DEBUG
+```
+
+### Verify Server is Running
+
+```bash
+curl http://localhost:5001/api/health
+```
+
+**Expected Response (when no clients connected):**
+
+```json
+{
+ "status": "healthy",
+ "online_clients": []
+}
+```
+
+> **💡 Tip:** The `online_clients` list will be empty until you start and connect the Mobile Client in Step 5.
+
+---
+
+## 🔌 Step 4: Start MCP Services (Android Machine)
+
+**MCP Service Component:** Two MCP servers provide Android device interaction capabilities. They must be running before starting the client.
+
+> **💡 Learn More:** For detailed documentation on all available MCP commands and their usage, see the [MCP Commands Reference](../mobile/commands.md).
+
+### Understanding the Two MCP Servers
+
+MobileAgent uses **two separate MCP servers** for different responsibilities:
+
+| Server | Port | Purpose | Tools |
+|--------|------|---------|-------|
+| **Data Collection** | 8020 | Screenshot, UI tree, device info, apps list | 5 read-only tools |
+| **Action** | 8021 | Touch actions, typing, app launching | 8 control tools |
+
+### Start Both MCP Servers
+
+**Recommended: Start Both Servers Together**
+
+On the machine with ADB access to your Android device:
+
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --data-port 8020 \
+ --action-port 8021 \
+ --server both
+```
+
+**Expected Output:**
+
+```console
+====================================================================
+UFO Mobile MCP Servers (Android)
+Android device control via ADB and Model Context Protocol
+====================================================================
+
+Using ADB: adb
+Checking ADB connection...
+
+List of devices attached
+emulator-5554 device
+
+✅ Found 1 connected device(s)
+====================================================================
+
+🚀 Starting both servers on localhost (shared state)
+ - Data Collection Server: localhost:8020
+ - Action Server: localhost:8021
+
+Note: Both servers share the same MobileServerState for caching
+
+✅ Starting both servers in same process (shared MobileServerState)
+ - Data Collection Server: localhost:8020
+ - Action Server: localhost:8021
+
+======================================================================
+Both servers share MobileServerState cache. Press Ctrl+C to stop.
+======================================================================
+
+✅ Data Collection Server thread started
+✅ Action Server thread started
+
+======================================================================
+Both servers are running. Press Ctrl+C to stop.
+======================================================================
+```
+
+**Alternative: Start Servers Separately**
+
+If needed, you can start each server in separate terminals:
+
+**Terminal 1: Data Collection Server**
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --data-port 8020 \
+ --server data
+```
+
+**Terminal 2: Action Server**
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --action-port 8021 \
+ --server action
+```
+
+> **⚠️ Important:** When running servers separately, they won't share cached state, which may impact performance. Running both together is recommended.
+
+### MCP Server Configuration Options
+
+| Argument | Default | Description | Example |
+|----------|---------|-------------|---------|
+| `--host` | `localhost` | Server host | `--host 127.0.0.1` |
+| `--data-port` | `8020` | Data collection server port | `--data-port 8020` |
+| `--action-port` | `8021` | Action server port | `--action-port 8021` |
+| `--server` | `both` | Which server(s) to start | `--server both` |
+| `--adb-path` | `adb` | Path to ADB executable | `--adb-path /path/to/adb` |
+
+### Verify MCP Servers are Running
+
+**Check Data Collection Server:**
+```bash
+curl http://localhost:8020/health
+```
+
+**Check Action Server:**
+```bash
+curl http://localhost:8021/health
+```
+
+Both should return a health status response indicating the server is operational.
+
+### What if ADB is not in PATH?
+
+If ADB is not in your system PATH, specify the full path:
+
+**Windows:**
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --adb-path "C:\Users\YourUsername\AppData\Local\Android\Sdk\platform-tools\adb.exe" \
+ --server both
+```
+
+**macOS:**
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --adb-path "$HOME/Library/Android/sdk/platform-tools/adb" \
+ --server both
+```
+
+**Linux:**
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --adb-path /usr/bin/adb \
+ --server both
+```
+
+---
+
+## 📱 Step 5: Start Device Agent Client
+
+**Client Component:** The Device Agent Client connects your Android device to the server and executes mobile automation tasks.
+
+### Basic Client Startup
+
+On your computer (same machine as MCP servers):
+
+```bash
+python -m ufo.client.client \
+ --ws \
+ --ws-server ws://localhost:5001/ws \
+ --client-id mobile_phone_1 \
+ --platform mobile
+```
+
+### Client Parameters Explained
+
+| Parameter | Required | Description | Example |
+|-----------|----------|-------------|---------|
+| `--ws` | ✅ Yes | Enable WebSocket mode | `--ws` |
+| `--ws-server` | ✅ Yes | Server WebSocket URL | `ws://localhost:5001/ws` |
+| `--client-id` | ✅ Yes | **Unique** device identifier | `mobile_phone_1` |
+| `--platform` | ✅ Yes | Platform type (must be `mobile`) | `--platform mobile` |
+
+> **⚠️ Critical Requirements:**
+>
+> 1. `--client-id` must be globally unique - No two devices can share the same ID
+> 2. `--platform mobile` is mandatory - Without this flag, the Mobile Agent won't work correctly
+> 3. Server address must be correct - Use actual server IP if not on localhost
+
+### Understanding the WebSocket URL
+
+The `--ws-server` parameter format is:
+
+```
+ws://:/ws
+```
+
+Examples:
+
+| Scenario | WebSocket URL | Description |
+|----------|---------------|-------------|
+| **Same Machine** | `ws://localhost:5001/ws` | Server and client on same computer |
+| **Same Network** | `ws://192.168.1.100:5001/ws` | Server on local network |
+| **Remote Server** | `ws://203.0.113.50:5001/ws` | Server on internet (public IP) |
+
+### Connection Success Indicators
+
+**Client Logs:**
+
+```log
+INFO - Platform detected/specified: mobile
+INFO - UFO Client initialized for platform: mobile
+INFO - [WS] Connecting to ws://localhost:5001/ws (attempt 1/5)
+INFO - [WS] [AIP] Successfully registered as mobile_phone_1
+INFO - [WS] Heartbeat loop started (interval: 30s)
+```
+
+**Server Logs:**
+
+```log
+INFO - [WS] ✅ Registered device client: mobile_phone_1
+INFO - [WS] Device mobile_phone_1 platform: mobile
+```
+
+Client is connected and ready to receive tasks when you see "Successfully registered"! ✅
+
+### Verify Connection
+
+```bash
+# Check connected clients on server
+curl http://localhost:5001/api/clients
+```
+
+**Expected Response:**
+
+```json
+{
+ "online_clients": ["mobile_phone_1"]
+}
+```
+
+> **Note:** The response shows only client IDs. For detailed information about each client, check the server logs.
+
+---
+
+## 🎯 Step 6: Dispatch Tasks via HTTP API
+
+Once the server, client, and MCP services are all running, you can dispatch tasks to your Android device through the server's HTTP API.
+
+### API Endpoint
+
+```
+POST http://:/api/dispatch
+```
+
+### Request Format
+
+```json
+{
+ "client_id": "mobile_phone_1",
+ "request": "Your natural language task description",
+ "task_name": "optional_task_identifier"
+}
+```
+
+### Example 1: Launch an App
+
+**Using cURL:**
+```bash
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{
+ "client_id": "mobile_phone_1",
+ "request": "Open Google Chrome browser",
+ "task_name": "launch_chrome"
+ }'
+```
+
+**Using Python:**
+```python
+import requests
+
+response = requests.post(
+ "http://localhost:5001/api/dispatch",
+ json={
+ "client_id": "mobile_phone_1",
+ "request": "Open Google Chrome browser",
+ "task_name": "launch_chrome"
+ }
+)
+print(response.json())
+```
+
+**Successful Response:**
+
+```json
+{
+ "status": "dispatched",
+ "task_name": "launch_chrome",
+ "client_id": "mobile_phone_1",
+ "session_id": "550e8400-e29b-41d4-a716-446655440000"
+}
+```
+
+### Example 2: Search on Maps
+
+```bash
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{
+ "client_id": "mobile_phone_1",
+ "request": "Open Google Maps and search for coffee shops nearby",
+ "task_name": "search_coffee"
+ }'
+```
+
+### Example 3: Type and Submit Text
+
+```bash
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{
+ "client_id": "mobile_phone_1",
+ "request": "Open Chrome, search for weather forecast, and show me the results",
+ "task_name": "check_weather"
+ }'
+```
+
+### Example 4: Take Screenshot
+
+```bash
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{
+ "client_id": "mobile_phone_1",
+ "request": "Take a screenshot of the current screen",
+ "task_name": "capture_screen"
+ }'
+```
+
+### Task Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant API as HTTP Client
+ participant Server as Agent Server
+ participant Client as Mobile Client
+ participant MCP as MCP Services
+ participant Device as Android Device
+
+ Note over API,Server: 1. Task Submission
+ API->>Server: POST /api/dispatch {client_id, request}
+ Server->>Server: Generate session_id
+ Server-->>API: {status: dispatched, session_id}
+
+ Note over Server,Client: 2. Task Assignment
+ Server->>Client: TASK_ASSIGNMENT (via WebSocket)
+ Client->>Client: Initialize Mobile Agent
+
+ Note over Client,MCP: 3. Data Collection
+ Client->>MCP: Capture screenshot
+ Client->>MCP: Get installed apps
+ Client->>MCP: Get UI controls
+ MCP->>Device: ADB commands
+ Device-->>MCP: Screenshot + Apps + Controls
+ MCP-->>Client: Visual context
+
+ Note over Client: 4. LLM Decision
+ Client->>Client: Construct prompt with screenshots
+ Client->>Client: Get action from LLM
+
+ Note over Client,MCP: 5. Action Execution
+ Client->>MCP: Execute mobile action (tap, swipe, launch_app, etc.)
+ MCP->>Device: ADB input commands
+ Device-->>MCP: Action result
+ MCP-->>Client: Success/Failure
+
+ Note over Client,Server: 6. Result Reporting
+ Client->>Server: TASK_RESULT {status, screenshots, actions}
+ Server-->>API: Task completed
+```
+
+### Request Parameters
+
+| Field | Required | Type | Description | Example |
+|-------|----------|------|-------------|---------|
+| `client_id` | ✅ Yes | string | Target mobile device ID (must match `--client-id`) | `"mobile_phone_1"` |
+| `request` | ✅ Yes | string | Natural language task description | `"Open Chrome"` |
+| `task_name` | ❌ Optional | string | Unique task identifier (auto-generated if omitted) | `"task_001"` |
+
+> **⚠️ Client Must Be Online:** If the `client_id` is not connected, you'll receive:
+> ```json
+> {
+> "detail": "Client not online"
+> }
+> ```
+>
+> Verify the client is connected:
+> ```bash
+> curl http://localhost:5001/api/clients
+> ```
+
+---
+
+## 🌌 Step 7: Configure as UFO³ Galaxy Device
+
+To use the Mobile Agent as a managed device within the **UFO³ Galaxy** multi-tier framework, you need to register it in the `devices.yaml` configuration file.
+
+> **📖 Detailed Guide:** For comprehensive information on using Mobile Agent in Galaxy, including multi-device workflows and advanced configuration, see [Using Mobile Agent as Galaxy Device](../mobile/as_galaxy_device.md).
+
+### Device Configuration File
+
+The Galaxy configuration is located at:
+
+```
+config/galaxy/devices.yaml
+```
+
+### Add Mobile Agent Configuration
+
+Edit `config/galaxy/devices.yaml` and add your Mobile agent:
+
+```yaml
+devices:
+ - device_id: "mobile_phone_1"
+ server_url: "ws://localhost:5001/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "messaging"
+ - "maps"
+ - "camera"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ screen_size: "1080x2400"
+ installed_apps:
+ - "com.android.chrome"
+ - "com.google.android.apps.maps"
+ - "com.whatsapp"
+ description: "Android phone for mobile automation"
+ auto_connect: true
+ max_retries: 5
+```
+
+### Configuration Fields Explained
+
+| Field | Required | Type | Description | Example |
+|-------|----------|------|-------------|---------|
+| `device_id` | ✅ Yes | string | **Must match client `--client-id`** | `"mobile_phone_1"` |
+| `server_url` | ✅ Yes | string | **Must match server WebSocket URL** | `"ws://localhost:5001/ws"` |
+| `os` | ✅ Yes | string | Operating system | `"mobile"` |
+| `capabilities` | ❌ Optional | list | Device capabilities | `["mobile", "android"]` |
+| `metadata` | ❌ Optional | dict | Custom metadata | See below |
+| `auto_connect` | ❌ Optional | boolean | Auto-connect on Galaxy startup | `true` |
+| `max_retries` | ❌ Optional | integer | Connection retry attempts | `5` |
+
+### Metadata Fields (Custom)
+
+The `metadata` section provides context to the LLM:
+
+| Field | Purpose | Example |
+|-------|---------|---------|
+| `device_type` | Phone, tablet, emulator | `"phone"` |
+| `android_version` | OS version | `"13"` |
+| `screen_size` | Resolution | `"1080x2400"` |
+| `installed_apps` | Available apps | `["com.android.chrome", ...]` |
+| `description` | Human-readable description | `"Personal phone"` |
+
+### Multiple Mobile Devices Example
+
+```yaml
+devices:
+ # Personal Phone
+ - device_id: "mobile_phone_personal"
+ server_url: "ws://192.168.1.100:5001/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "messaging"
+ - "whatsapp"
+ - "maps"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ installed_apps:
+ - "com.whatsapp"
+ - "com.google.android.apps.maps"
+ description: "Personal Android phone"
+ auto_connect: true
+ max_retries: 5
+
+ # Work Phone
+ - device_id: "mobile_phone_work"
+ server_url: "ws://192.168.1.101:5002/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "email"
+ - "teams"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "12"
+ installed_apps:
+ - "com.microsoft.office.outlook"
+ - "com.microsoft.teams"
+ description: "Work Android phone"
+ auto_connect: true
+ max_retries: 5
+
+ # Tablet
+ - device_id: "mobile_tablet_home"
+ server_url: "ws://192.168.1.102:5003/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "tablet"
+ - "media"
+ metadata:
+ os: "mobile"
+ device_type: "tablet"
+ android_version: "13"
+ screen_size: "2560x1600"
+ installed_apps:
+ - "com.netflix.mediaclient"
+ description: "Home tablet for media"
+ auto_connect: true
+ max_retries: 5
+```
+
+### Critical Requirements
+
+> **⚠️ Configuration Validation - These fields MUST match exactly:**
+>
+> 1. **`device_id` in YAML** ↔ **`--client-id` in client command**
+> 2. **`server_url` in YAML** ↔ **`--ws-server` in client command**
+>
+> **If these don't match, Galaxy cannot control the device!**
+
+### Using Galaxy to Control Mobile Agents
+
+Once configured, launch Galaxy:
+
+```bash
+python -m galaxy --interactive
+```
+
+**Galaxy will:**
+1. ✅ Load device configuration from `config/galaxy/devices.yaml`
+2. ✅ Connect to all configured Android devices
+3. ✅ Orchestrate multi-device tasks
+4. ✅ Route tasks based on capabilities
+
+> **ℹ️ Galaxy Documentation:** For detailed Galaxy usage, see:
+>
+> - [Galaxy Overview](../galaxy/overview.md)
+> - [Galaxy Quick Start](quick_start_galaxy.md)
+> - [Mobile Agent as Galaxy Device](../mobile/as_galaxy_device.md)
+
+---
+
+## � Understanding Mobile Agent Internals
+
+Now that you have Mobile Agent running, you may want to understand how it works under the hood:
+
+### State Machine
+
+Mobile Agent uses a **3-state finite state machine** to manage task execution:
+
+- **CONTINUE** - Active execution, processing user requests
+- **FINISH** - Task completed successfully
+- **FAIL** - Unrecoverable error occurred
+
+Learn more: [State Machine Documentation](../mobile/state.md)
+
+### Processing Pipeline
+
+During the CONTINUE state, Mobile Agent executes a **4-phase pipeline**:
+
+1. **Data Collection** - Capture screenshots, get apps, collect UI controls
+2. **LLM Interaction** - Send visual context to LLM for decision making
+3. **Action Execution** - Execute mobile actions (tap, swipe, type, etc.)
+4. **Memory Update** - Record actions and results for context
+
+Learn more: [Processing Strategy Documentation](../mobile/strategy.md)
+
+### Available Commands
+
+Mobile Agent uses **13 MCP commands** across two servers:
+
+- **Data Collection Server (8020)**: 5 read-only commands
+- **Action Server (8021)**: 8 control commands
+
+Learn more: [MCP Commands Reference](../mobile/commands.md)
+
+---
+
+## �🐛 Common Issues & Troubleshooting
+
+### Issue 1: ADB Device Not Found
+
+**Error: No Devices Detected**
+
+Symptoms:
+```bash
+$ adb devices
+List of devices attached
+# Empty list
+```
+
+**Solutions:**
+
+**For Physical Devices:**
+
+1. **Check USB connection:**
+ - Use a different USB cable (some cables are charge-only)
+ - Try a different USB port on your computer
+ - Ensure USB debugging is enabled on device
+
+2. **Authorize computer on device:**
+ - Disconnect and reconnect USB
+ - On device, tap "Allow USB debugging" when prompted
+ - Check "Always allow from this computer"
+
+3. **Restart ADB server:**
+ ```bash
+ adb kill-server
+ adb start-server
+ adb devices
+ ```
+
+4. **Check USB driver (Windows):**
+ - Install Google USB Driver via Android Studio SDK Manager
+ - Or install device-specific driver from manufacturer
+
+**For Emulators:**
+
+1. **Wait for emulator to fully boot** (can take 1-2 minutes)
+
+2. **Restart emulator:**
+ - Close emulator completely
+ - Start emulator again from Android Studio or command line
+
+3. **Check emulator is running:**
+ ```bash
+ emulator -list-avds
+ emulator -avd Pixel_6_API_33
+ ```
+
+### Issue 2: MCP Server Cannot Connect to Device
+
+**Error: ADB Connection Failed**
+
+Symptoms:
+```log
+ERROR - Failed to execute ADB command
+ERROR - Device not accessible
+```
+
+**Solutions:**
+
+1. **Verify ADB connection first:**
+ ```bash
+ adb devices
+ ```
+ Device should show "device" status (not "offline" or "unauthorized")
+
+2. **Test ADB commands manually:**
+ ```bash
+ adb shell getprop ro.product.model
+ adb shell screencap -p /sdcard/test.png
+ ```
+
+3. **Restart MCP servers with debug output:**
+ ```bash
+ # Kill existing servers
+ pkill -f mobile_mcp_server
+
+ # Start with explicit ADB path
+ python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --adb-path $(which adb) \
+ --server both
+ ```
+
+4. **Check device permissions:**
+ - Ensure USB debugging is still authorized
+ - Revoke and re-grant USB debugging authorization on device
+
+### Issue 3: Client Cannot Connect to Server
+
+**Error: Connection Refused or Failed**
+
+Symptoms:
+```log
+ERROR - [WS] Failed to connect to ws://localhost:5001/ws
+Connection refused
+```
+
+**Solutions:**
+
+1. **Verify server is running:**
+ ```bash
+ curl http://localhost:5001/api/health
+ ```
+
+ Should return:
+ ```json
+ {
+ "status": "healthy",
+ "online_clients": []
+ }
+ ```
+
+2. **Check server address:**
+ - If server and client are on different machines, use server's IP address
+ - Replace `localhost` with actual IP address (e.g., `ws://192.168.1.100:5001/ws`)
+ - Ensure the port number matches the server's `--port` argument
+
+3. **Check firewall settings:**
+ ```bash
+ # Windows: Allow port 5001
+ netsh advfirewall firewall add rule name="UFO Server" dir=in action=allow protocol=TCP localport=5001
+
+ # macOS: System Preferences → Security & Privacy → Firewall → Firewall Options
+
+ # Linux (Ubuntu):
+ sudo ufw allow 5001/tcp
+ ```
+
+### Issue 4: Missing `--platform mobile` Flag
+
+**Error: Incorrect Agent Type**
+
+Symptoms:
+- Client connects but cannot execute mobile commands
+- Server logs show wrong platform type
+- Tasks fail with "unsupported operation" errors
+
+**Solution:**
+
+Always include `--platform mobile` when starting the client:
+
+```bash
+# Wrong (missing platform)
+python -m ufo.client.client --ws --client-id mobile_phone_1
+
+# Correct
+python -m ufo.client.client \
+ --ws \
+ --client-id mobile_phone_1 \
+ --platform mobile
+```
+
+### Issue 5: Screenshot Capture Fails
+
+**Error: Cannot Capture Screenshot**
+
+Symptoms:
+```log
+ERROR - Failed to capture screenshot
+ERROR - screencap command failed
+```
+
+**Solutions:**
+
+1. **Test screenshot manually:**
+ ```bash
+ adb shell screencap -p /sdcard/test.png
+ adb pull /sdcard/test.png .
+ ```
+
+2. **Check device storage:**
+ ```bash
+ adb shell df -h /sdcard
+ ```
+ Ensure sufficient space on device
+
+3. **Check permissions:**
+ ```bash
+ adb shell ls -l /sdcard
+ ```
+
+4. **Try alternative screenshot method:**
+ ```bash
+ adb exec-out screencap -p > screenshot.png
+ ```
+
+### Issue 6: UI Controls Not Found
+
+**Error: Control Information Missing**
+
+Symptoms:
+```log
+WARNING - Failed to get UI controls
+WARNING - UI tree dump failed
+```
+
+**Solutions:**
+
+1. **Test UI dump manually:**
+ ```bash
+ adb shell uiautomator dump /sdcard/window_dump.xml
+ adb shell cat /sdcard/window_dump.xml
+ ```
+
+2. **Enable accessibility services:**
+ - Some apps require accessibility services for UI automation
+ - Settings → Accessibility → Enable required services
+
+3. **Update Android WebView:**
+ - Old WebView versions may cause UI dump issues
+ - Update via Play Store: Android System WebView
+
+4. **Restart device:**
+ ```bash
+ adb reboot
+ # Wait for device to restart
+ adb wait-for-device
+ ```
+
+### Issue 7: Emulator Too Slow
+
+**Error: Performance Issues**
+
+Symptoms:
+- Emulator lags or freezes
+- Actions take very long to execute
+- Timeouts occur frequently
+
+**Solutions:**
+
+1. **Enable Hardware Acceleration:**
+ - **Windows:** Ensure Hyper-V or Intel HAXM is enabled
+ - **macOS:** Hypervisor.framework is used automatically
+ - **Linux:** Install KVM
+
+2. **Allocate More Resources:**
+ - In Android Studio AVD Manager, edit AVD
+ - Increase RAM to 2048 MB or higher
+ - Increase VM heap to 512 MB
+ - Set Graphics to "Hardware - GLES 2.0"
+
+3. **Use x86_64 System Image:**
+ - Faster than ARM images
+ - Download x86_64 image in SDK Manager
+
+4. **Reduce Screen Resolution:**
+ - Edit AVD settings
+ - Choose lower resolution (e.g., 720x1280 instead of 1080x2400)
+
+### Issue 8: Multiple Devices Connected
+
+**Error: More Than One Device**
+
+Symptoms:
+```bash
+$ adb devices
+List of devices attached
+emulator-5554 device
+192.168.1.100:5555 device
+```
+
+**Solutions:**
+
+1. **Specify device for ADB:**
+ ```bash
+ # Use emulator
+ export ANDROID_SERIAL=emulator-5554
+
+ # Use physical device
+ export ANDROID_SERIAL=192.168.1.100:5555
+ ```
+
+2. **Disconnect other devices:**
+ ```bash
+ # Disconnect wireless device
+ adb disconnect 192.168.1.100:5555
+ ```
+
+3. **Run separate MCP servers:**
+ ```bash
+ # Server for emulator
+ ANDROID_SERIAL=emulator-5554 python -m ufo.client.mcp.http_servers.mobile_mcp_server --data-port 8020 --action-port 8021 --server both
+
+ # Server for physical device
+ ANDROID_SERIAL=192.168.1.100:5555 python -m ufo.client.mcp.http_servers.mobile_mcp_server --data-port 8022 --action-port 8023 --server both
+ ```
+
+---
+
+## 📚 Next Steps
+
+You've successfully set up a Mobile Agent! Explore these topics to deepen your understanding:
+
+### Immediate Next Steps
+
+| Priority | Topic | Time | Link |
+|----------|-------|------|------|
+| 🥇 | **Mobile Agent Architecture** | 10 min | [Overview](../mobile/overview.md) |
+| 🥈 | **State Machine & Processing** | 15 min | [State Machine](../mobile/state.md) |
+| 🥉 | **MCP Commands Reference** | 15 min | [Commands](../mobile/commands.md) |
+
+### Advanced Topics
+
+| Topic | Description | Link |
+|-------|-------------|------|
+| **Processing Strategy** | 4-phase pipeline (Data, LLM, Action, Memory) | [Strategy](../mobile/strategy.md) |
+| **Galaxy Integration** | Multi-device orchestration with UFO³ | [As Galaxy Device](../mobile/as_galaxy_device.md) |
+| **MCP Protocol Details** | Deep dive into mobile interaction protocol | [Commands](../mobile/commands.md) |
+
+### Production Deployment
+
+| Best Practice | Description |
+|---------------|-------------|
+| **Persistent ADB** | Keep ADB connection stable for physical devices |
+| **Emulator Management** | Automate emulator lifecycle (start/stop/reset) |
+| **Screenshot Storage** | Configure log paths and cleanup policies in `config/ufo/system.yaml` |
+| **Security** | Use secure WebSocket (wss://) for remote deployments |
+
+> **💡 Learn More:** For comprehensive understanding of the Mobile Agent architecture and processing flow, see the [Mobile Agent Overview](../mobile/overview.md).
+
+---
+
+## ✅ Summary
+
+Congratulations! You've successfully:
+
+✅ Set up Android device (physical or emulator)
+✅ Installed ADB (Android Debug Bridge)
+✅ Installed Python dependencies
+✅ Started the Device Agent Server
+✅ Launched MCP services (data collection + action)
+✅ Connected Mobile Device Agent Client
+✅ Dispatched mobile automation tasks via HTTP API
+✅ (Optional) Configured device in Galaxy
+
+**Your Mobile Agent is Ready**
+
+You can now:
+
+- 📱 Automate Android apps remotely
+- 🖼️ Capture and analyze screenshots
+- 🎯 Interact with UI controls precisely
+- 🌌 Integrate with UFO³ Galaxy for cross-platform workflows
+
+**Start exploring mobile automation!** 🚀
+
+---
+
+## 💡 Pro Tips
+
+### Quick Start Command Summary
+
+**Start everything in order:**
+
+```bash
+# Terminal 1: Start server
+python -m ufo.server.app --port 5001 --platform mobile
+
+# Terminal 2: Start MCP services
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both
+
+# Terminal 3: Start client
+python -m ufo.client.client --ws --ws-server ws://localhost:5001/ws --client-id mobile_phone_1 --platform mobile
+
+# Terminal 4: Dispatch task
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{"client_id": "mobile_phone_1", "request": "Open Chrome browser"}'
+```
+
+### Development Shortcuts
+
+**Create shell scripts for common operations:**
+
+**Windows (PowerShell):**
+```powershell
+# start-mobile-agent.ps1
+Start-Process powershell -ArgumentList "-NoExit", "-Command", "python -m ufo.server.app --port 5001 --platform mobile"
+Start-Sleep 2
+Start-Process powershell -ArgumentList "-NoExit", "-Command", "python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both"
+Start-Sleep 2
+Start-Process powershell -ArgumentList "-NoExit", "-Command", "python -m ufo.client.client --ws --ws-server ws://localhost:5001/ws --client-id mobile_phone_1 --platform mobile"
+```
+
+**macOS/Linux (Bash):**
+```bash
+#!/bin/bash
+# start-mobile-agent.sh
+
+# Start server in background
+python -m ufo.server.app --port 5001 --platform mobile &
+sleep 2
+
+# Start MCP services in background
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both &
+sleep 2
+
+# Start client in foreground
+python -m ufo.client.client --ws --ws-server ws://localhost:5001/ws --client-id mobile_phone_1 --platform mobile
+```
+
+Make executable:
+```bash
+chmod +x start-mobile-agent.sh
+./start-mobile-agent.sh
+```
+
+### Testing Your Setup
+
+**Quick test to verify everything works:**
+
+```bash
+# Test 1: Check ADB
+adb devices
+# Should show your device
+
+# Test 2: Check Server
+curl http://localhost:5001/api/health
+# Should return {"status": "healthy"}
+
+# Test 3: Check MCP
+curl http://localhost:8020/health
+curl http://localhost:8021/health
+# Should return health status
+
+# Test 4: Check Client
+curl http://localhost:5001/api/clients
+# Should show mobile_phone_1
+
+# Test 5: Dispatch simple task
+curl -X POST http://localhost:5001/api/dispatch \
+ -H "Content-Type: application/json" \
+ -d '{"client_id": "mobile_phone_1", "request": "Take a screenshot"}'
+# Should return dispatched status
+```
+
+**Happy Mobile Automation! 🎉**
diff --git a/documents/docs/index.md b/documents/docs/index.md
index 08ecba4f5..5f220c3a2 100644
--- a/documents/docs/index.md
+++ b/documents/docs/index.md
@@ -291,6 +291,8 @@ Start here if you're new to UFO³:
|-------|-------------|-----------|
| [Galaxy Quick Start](getting_started/quick_start_galaxy.md) | Set up multi-device orchestration in 10 minutes | 🌌 Galaxy |
| [UFO² Quick Start](getting_started/quick_start_ufo2.md) | Start automating Windows in 5 minutes | 🪟 UFO² |
+| [Linux Agent Quick Start](getting_started/quick_start_linux.md) | Automate Linux systems | 🐧 Linux |
+| [Mobile Agent Quick Start](getting_started/quick_start_mobile.md) | Automate Android devices via ADB | 📱 Mobile |
| [Choosing Your Path](choose_path.md) | Decision guide for selecting the right framework | Both |
### 🏗️ Core Architecture
diff --git a/documents/docs/mcp/local_servers.md b/documents/docs/mcp/local_servers.md
index f38c58d9f..5ef27020f 100644
--- a/documents/docs/mcp/local_servers.md
+++ b/documents/docs/mcp/local_servers.md
@@ -2,6 +2,8 @@
Local MCP servers run in-process with the UFO² agent, providing fast and efficient access to tools without network overhead. They are the most common server type for built-in functionality.
+**For remote HTTP servers** (BashExecutor, HardwareExecutor, MobileExecutor), see [Remote Servers](./remote_servers.md).
+
## Overview
UFO² includes several built-in local MCP servers organized by functionality. This page provides a quick reference - click each server name for complete documentation.
diff --git a/documents/docs/mcp/overview.md b/documents/docs/mcp/overview.md
index 34ed61499..a84d5a471 100644
--- a/documents/docs/mcp/overview.md
+++ b/documents/docs/mcp/overview.md
@@ -272,6 +272,7 @@ UFO² comes with several **built-in MCP servers** that cover common automation s
|-----------|---------|-----------|----------|
| **UICollector** | UI element detection | `get_control_info`, `take_screenshot`, `get_window_list` | Windows |
| **HardwareCollector** | Hardware information | `get_cpu_info`, `get_memory_info` | Cross-platform |
+| **MobileDataCollector** | Android device observation | `capture_screenshot`, `get_ui_tree`, `get_device_info`, `get_mobile_app_target_info` | Android (ADB) |
### Action Servers
@@ -285,6 +286,7 @@ UFO² comes with several **built-in MCP servers** that cover common automation s
| **PowerPointCOMExecutor** | PowerPoint automation | `insert_slide`, `add_text`, `format_shape` | Windows |
| **ConstellationEditor** | Multi-device coordination | `create_task`, `assign_device` | Cross-platform |
| **BashExecutor** | Linux commands | `execute_bash` | Linux |
+| **MobileExecutor** | Android device control | `tap`, `swipe`, `type_text`, `launch_app`, `click_control` | Android (ADB) |
!!!example "Tool Examples"
```python
@@ -349,6 +351,22 @@ HardwareAgent:
host: "localhost"
port: 8006
path: "/mcp"
+
+# MobileAgent: Android device automation
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http # Remote server
+ host: "localhost"
+ port: 8020
+ path: "/mcp"
+ action:
+ - namespace: MobileExecutor
+ type: http
+ host: "localhost"
+ port: 8021
+ path: "/mcp"
```
**Configuration Hierarchy:**
@@ -497,6 +515,26 @@ HardwareAgent:
port: 8006
```
+### 5. Android Device Automation
+
+```yaml
+# Agent that automates Android devices via ADB
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http
+ host: "localhost" # Or remote Android automation server
+ port: 8020
+ path: "/mcp"
+ action:
+ - namespace: MobileExecutor
+ type: http
+ host: "localhost"
+ port: 8021
+ path: "/mcp"
+```
+
## Getting Started
To start using MCP in UFO²:
diff --git a/documents/docs/mcp/remote_servers.md b/documents/docs/mcp/remote_servers.md
index ccb7d0c8a..f58b288a7 100644
--- a/documents/docs/mcp/remote_servers.md
+++ b/documents/docs/mcp/remote_servers.md
@@ -65,6 +65,23 @@ Stdio MCP servers run as child processes, communicating via stdin/stdout.
---
+### MobileExecutor
+
+**Type**: Action + Data Collection (HTTP deployment, dual-server)
+**Purpose**: Android device automation via ADB
+**Deployment**: HTTP servers on machine with ADB access
+**Agent**: MobileAgent
+**Ports**: 8020 (data collection), 8021 (action)
+**Tools**: 13+ tools for Android automation
+
+**Architecture**: Runs as **two HTTP servers** that share a singleton state manager for coordinated operations:
+- **Mobile Data Collection Server** (port 8020): Screenshots, UI tree, device info, app list, controls
+- **Mobile Action Server** (port 8021): Tap, swipe, type, launch apps, press keys, control clicks
+
+**[→ See complete MobileExecutor documentation](servers/mobile_executor.md)** for all Android automation tools, dual-server architecture, deployment instructions, and usage examples.
+
+---
+
## Configuration Reference
### HTTP Server Fields
@@ -134,6 +151,33 @@ python -m ufo.client.mcp.http_servers.linux_mcp_server --host 0.0.0.0 --port 801
See the [BashExecutor documentation](servers/bash_executor.md) for systemd service setup.
+### HTTP: Android Device Automation
+
+```yaml
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http
+ host: "192.168.1.60" # Android automation server
+ port: 8020
+ path: "/mcp"
+ action:
+ - namespace: MobileExecutor
+ type: http
+ host: "192.168.1.60"
+ port: 8021
+ path: "/mcp"
+```
+
+**Server Start:**
+```bash
+# Start both servers (recommended - they share state)
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both --host 0.0.0.0 --data-port 8020 --action-port 8021
+```
+
+See the [MobileExecutor documentation](servers/mobile_executor.md) for complete deployment instructions and ADB setup.
+
### Stdio: Custom Python Server
```yaml
diff --git a/documents/docs/mcp/servers/mobile_executor.md b/documents/docs/mcp/servers/mobile_executor.md
new file mode 100644
index 000000000..d02d7f6fc
--- /dev/null
+++ b/documents/docs/mcp/servers/mobile_executor.md
@@ -0,0 +1,1418 @@
+# MobileExecutor Server
+
+## Overview
+
+**MobileExecutor** provides Android mobile device automation via ADB (Android Debug Bridge). It runs as **two separate HTTP servers** that share state for coordinated operations:
+
+- **Mobile Data Collection Server** (port 8020): Screenshots, UI tree, device info, app list, controls
+- **Mobile Action Server** (port 8021): Tap, swipe, type, launch apps, press keys
+
+**Server Type:** Action + Data Collection
+**Deployment:** HTTP (remote server, runs on machine with ADB)
+**Default Ports:** 8020 (data), 8021 (action)
+**LLM-Selectable:** ✅ Yes (action tools only)
+**Platform:** Android devices via ADB
+
+## Server Information
+
+| Property | Value |
+|----------|-------|
+| **Namespace** | `MobileDataCollector` (data), `MobileExecutor` (action) |
+| **Server Names** | `Mobile Data Collection MCP Server`, `Mobile Action MCP Server` |
+| **Platform** | Android (via ADB) |
+| **Tool Types** | `data_collection`, `action` |
+| **Deployment** | HTTP server (stateless with shared cache) |
+| **Architecture** | Dual-server with singleton state manager |
+
+## Architecture
+
+### Dual-Server Design
+
+The mobile MCP server uses a **dual-server architecture** similar to `linux_mcp_server.py`:
+
+```mermaid
+graph TB
+ Agent["Windows UFO² Agent"]
+
+ subgraph Process["Mobile MCP Servers (Same Process)"]
+ State["MobileServerState (Singleton Cache) • Apps cache • Controls cache • UI tree cache • Device info cache"]
+
+ DataServer["Data Collection Server Port 8020 • Screenshots • UI tree • Device info • App list • Controls"]
+
+ ActionServer["Action Server Port 8021 • Tap/Swipe • Type text • Launch app • Click control"]
+
+ State -.->|Shared Cache| DataServer
+ State -.->|Shared Cache| ActionServer
+ end
+
+ Device["Android Device (via ADB)"]
+
+ Agent -->|HTTP| DataServer
+ Agent -->|HTTP| ActionServer
+ DataServer -->|ADB Commands| Device
+ ActionServer -->|ADB Commands| Device
+
+ style Agent fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
+ style Process fill:#fafafa,stroke:#424242,stroke-width:2px
+ style State fill:#fff3e0,stroke:#f57c00,stroke-width:2px
+ style DataServer fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
+ style ActionServer fill:#fce4ec,stroke:#c2185b,stroke-width:2px
+ style Device fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
+```
+
+**Shared State Benefits:**
+
+- **Cache Coordination**: Action server can access controls cached by data server
+- **Performance**: Avoid duplicate ADB queries (UI tree, app list, etc.)
+- **State Consistency**: Both servers see same device state
+- **Resource Efficiency**: Single process, shared memory
+
+### State Management
+
+**MobileServerState** is a singleton that caches:
+
+| Cache | Duration | Purpose |
+|-------|----------|---------|
+| **Installed Apps** | 5 minutes | Package list for `get_mobile_app_target_info` |
+| **UI Controls** | 5 seconds | Control list for `get_app_window_controls_target_info` |
+| **UI Tree XML** | 5 seconds | Raw XML for `get_ui_tree` |
+| **Device Info** | 1 minute | Hardware specs for `get_device_info` |
+
+**Cache Invalidation:**
+
+- Automatically invalidated after interactions (tap, swipe, type)
+- Manually invalidated via `invalidate_cache` tool
+- Expired caches refreshed on next query
+
+## Data Collection Tools
+
+Data collection tools are automatically invoked by the framework, not selectable by LLM.
+
+### capture_screenshot
+
+Capture screenshot from Android device.
+
+#### Parameters
+
+None
+
+#### Returns
+
+**Type**: `str`
+
+Base64-encoded image data URI directly (format: `data:image/png;base64,...`)
+
+#### Example
+
+```python
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::capture_screenshot",
+ tool_name="capture_screenshot",
+ parameters={}
+ )
+])
+
+# result[0].data = "data:image/png;base64,iVBORw0KGgo..."
+```
+
+#### Implementation Details
+
+1. Captures screenshot on device (`screencap -p /sdcard/screen_temp.png`)
+2. Pulls image from device via ADB (`adb pull`)
+3. Encodes as base64
+4. Cleans up temporary files
+5. Returns data URI directly (matches `ui_mcp_server` format)
+
+---
+
+### get_ui_tree
+
+Get the UI hierarchy tree in XML format.
+
+#### Parameters
+
+None
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "ui_tree": str, # XML content
+ "format": "xml",
+ # OR
+ "error": str # Error message if failed
+}
+```
+
+#### Example
+
+```python
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_ui_tree",
+ tool_name="get_ui_tree",
+ parameters={}
+ )
+])
+
+# Parse XML to find elements
+import xml.etree.ElementTree as ET
+tree = ET.fromstring(result[0].data["ui_tree"])
+```
+
+#### Cache Behavior
+
+- Cached for 5 seconds
+- Automatically invalidated after interactions
+- Shared with `get_app_window_controls_target_info`
+
+---
+
+### get_device_info
+
+Get comprehensive Android device information.
+
+#### Parameters
+
+None
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "device_info": {
+ "model": str, # Device model
+ "android_version": str, # Android version (e.g., "13")
+ "sdk_version": str, # SDK version (e.g., "33")
+ "screen_size": str, # Screen resolution (e.g., "Physical size: 1080x2400")
+ "screen_density": str, # Screen density (e.g., "Physical density: 440")
+ "battery_level": str, # Battery percentage
+ "battery_status": str # Charging status
+ },
+ "from_cache": bool, # True if returned from cache
+ # OR
+ "error": str # Error message if failed
+}
+```
+
+#### Example
+
+```python
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_device_info",
+ tool_name="get_device_info",
+ parameters={}
+ )
+])
+
+device = result[0].data["device_info"]
+print(f"Device: {device['model']}")
+print(f"Android: {device['android_version']}")
+print(f"Battery: {device['battery_level']}%")
+```
+
+#### Cache Behavior
+
+- Cached for 1 minute
+- Returns `from_cache: true` when using cached data
+
+---
+
+### get_mobile_app_target_info
+
+Get information about installed application packages as `TargetInfo` list.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `filter` | `str` | No | `""` | Filter pattern for package names (e.g., `"com.android"`) |
+| `include_system_apps` | `bool` | No | `False` | Whether to include system apps (default: only user apps) |
+| `force_refresh` | `bool` | No | `False` | Force refresh from device, ignoring cache |
+
+#### Returns
+
+**Type**: `List[TargetInfo]`
+
+```python
+[
+ TargetInfo(
+ kind=TargetKind.THIRD_PARTY_AGENT,
+ id="1", # Sequential ID
+ name="com.example.app", # Package name (displayed)
+ type="com.example.app" # Package name (stored)
+ ),
+ ...
+]
+```
+
+#### Example
+
+```python
+# Get all user-installed apps
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_mobile_app_target_info",
+ tool_name="get_mobile_app_target_info",
+ parameters={"include_system_apps": False}
+ )
+])
+
+apps = result[0].data
+for app in apps:
+ print(f"ID: {app.id}, Package: {app.name}")
+
+# Filter by package name
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_mobile_app_target_info",
+ tool_name="get_mobile_app_target_info",
+ parameters={"filter": "com.android", "include_system_apps": True}
+ )
+])
+```
+
+#### Cache Behavior
+
+- Cached for 5 minutes (only when no filter and `include_system_apps=False`)
+- Use `force_refresh=True` to bypass cache
+
+---
+
+### get_app_window_controls_target_info
+
+Get UI controls information as `TargetInfo` list.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `force_refresh` | `bool` | No | `False` | Force refresh from device, ignoring cache |
+
+#### Returns
+
+**Type**: `List[TargetInfo]`
+
+```python
+[
+ TargetInfo(
+ kind=TargetKind.CONTROL,
+ id="1", # Sequential ID
+ name="Button Name", # Control text or content-desc
+ type="Button", # Control class (short name)
+ rect=[x1, y1, x2, y2] # Bounding box [left, top, right, bottom]
+ ),
+ ...
+]
+```
+
+#### Example
+
+```python
+result = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ tool_name="get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+controls = result[0].data
+for ctrl in controls:
+ print(f"ID: {ctrl.id}, Name: {ctrl.name}, Type: {ctrl.type}")
+ print(f" Rect: {ctrl.rect}")
+```
+
+#### Control Selection Criteria
+
+Only **meaningful controls** are included:
+
+- Clickable controls (`clickable="true"`)
+- Long-clickable controls (`long-clickable="true"`)
+- Checkable controls (`checkable="true"`)
+- Scrollable controls (`scrollable="true"`)
+- Controls with text or content-desc
+- EditText and Button controls
+
+**Rect format**: `[left, top, right, bottom]` in pixels (matches `ui_mcp_server.py` bbox format)
+
+#### Cache Behavior
+
+- Cached for 5 seconds
+- Automatically invalidated after interactions (tap, swipe, type)
+- Shared with action server for `click_control` and `type_text`
+
+---
+
+## Action Tools
+
+Action tools are LLM-selectable, state-modifying operations.
+
+### tap
+
+Tap/click at specified coordinates on the screen.
+
+#### Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `x` | `int` | ✅ Yes | X coordinate in pixels (from left) |
+| `y` | `int` | ✅ Yes | Y coordinate in pixels (from top) |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # "tap(x, y)"
+ "output": str, # Command output
+ "error": str # Error message if failed
+}
+```
+
+#### Example
+
+```python
+# Tap at specific coordinates
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::tap",
+ tool_name="tap",
+ parameters={"x": 500, "y": 1200}
+ )
+])
+```
+
+#### Side Effects
+
+- Invalidates controls cache (UI likely changed)
+
+---
+
+### swipe
+
+Perform swipe gesture from start to end coordinates.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `start_x` | `int` | ✅ Yes | - | Starting X coordinate |
+| `start_y` | `int` | ✅ Yes | - | Starting Y coordinate |
+| `end_x` | `int` | ✅ Yes | - | Ending X coordinate |
+| `end_y` | `int` | ✅ Yes | - | Ending Y coordinate |
+| `duration` | `int` | No | `300` | Duration in milliseconds |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # "swipe(x1,y1)->(x2,y2) in Nms"
+ "output": str,
+ "error": str
+}
+```
+
+#### Example
+
+```python
+# Swipe up (scroll down content)
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::swipe",
+ tool_name="swipe",
+ parameters={
+ "start_x": 500,
+ "start_y": 1500,
+ "end_x": 500,
+ "end_y": 500,
+ "duration": 300
+ }
+ )
+])
+
+# Swipe left (next page)
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::swipe",
+ tool_name="swipe",
+ parameters={
+ "start_x": 800,
+ "start_y": 1000,
+ "end_x": 200,
+ "end_y": 1000,
+ "duration": 200
+ }
+ )
+])
+```
+
+#### Side Effects
+
+- Invalidates controls cache (UI changed)
+
+---
+
+### type_text
+
+Type text into a specific input field control.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `text` | `str` | ✅ Yes | - | Text to input (spaces/special chars auto-escaped) |
+| `control_id` | `str` | ✅ Yes | - | Precise annotated ID from `get_app_window_controls_target_info` |
+| `control_name` | `str` | ✅ Yes | - | Precise name of control (must match `control_id`) |
+| `clear_current_text` | `bool` | No | `False` | Clear existing text before typing |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # Full action description
+ "message": str, # Step-by-step messages
+ "control_info": {
+ "id": str,
+ "name": str,
+ "type": str
+ },
+ # OR
+ "error": str # Error message
+}
+```
+
+#### Example
+
+```python
+# 1. Get controls first
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ tool_name="get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+# 2. Find search input field
+search_field = next(c for c in controls[0].data if "Search" in c.name)
+
+# 3. Type text
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::type_text",
+ tool_name="type_text",
+ parameters={
+ "text": "hello world",
+ "control_id": search_field.id,
+ "control_name": search_field.name,
+ "clear_current_text": True
+ }
+ )
+])
+```
+
+#### Workflow
+
+1. Verifies control exists in cache (requires prior `get_app_window_controls_target_info` call)
+2. Clicks control to focus it
+3. Optionally clears existing text (deletes up to 50 characters)
+4. Types text (spaces replaced with `%s`, `&` escaped)
+5. Invalidates controls cache
+
+#### Side Effects
+
+- Clicks the control (may trigger navigation)
+- Modifies input field content
+- Invalidates controls cache
+
+---
+
+### launch_app
+
+Launch an application by package name or app ID.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `package_name` | `str` | ✅ Yes | - | Package name (e.g., `"com.android.settings"`) or app name |
+| `id` | `str` | No | `None` | Optional: Precise annotated ID from `get_mobile_app_target_info` |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "message": str,
+ "package_name": str, # Actual package launched
+ "output": str, # ADB monkey output
+ "error": str,
+ "warning": str, # Optional: name resolution warning
+ "app_info": { # Optional: if id provided
+ "id": str,
+ "name": str,
+ "package": str
+ }
+}
+```
+
+#### Example
+
+```python
+# Launch by package name
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::launch_app",
+ tool_name="launch_app",
+ parameters={"package_name": "com.android.settings"}
+ )
+])
+
+# Launch by app ID (from cache)
+apps = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_mobile_app_target_info",
+ tool_name="get_mobile_app_target_info",
+ parameters={}
+ )
+])
+
+settings_app = next(a for a in apps[0].data if "settings" in a.name.lower())
+
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::launch_app",
+ tool_name="launch_app",
+ parameters={
+ "package_name": settings_app.type, # Package from cache
+ "id": settings_app.id
+ }
+ )
+])
+
+# Launch by app name (auto-resolves package)
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::launch_app",
+ tool_name="launch_app",
+ parameters={"package_name": "Settings"} # Resolves to com.android.settings
+ )
+])
+```
+
+#### Name Resolution
+
+If `package_name` doesn't contain `.` (not a package format):
+
+1. Searches installed packages for matching display name
+2. Returns resolved package with warning
+3. Fails if no match found
+
+#### Implementation
+
+Uses `adb shell monkey -p -c android.intent.category.LAUNCHER 1`
+
+---
+
+### press_key
+
+Press a hardware or software key.
+
+#### Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `key_code` | `str` | ✅ Yes | Key code (e.g., `"KEYCODE_HOME"`, `"KEYCODE_BACK"`) |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # "press_key(KEYCODE_X)"
+ "output": str,
+ "error": str
+}
+```
+
+#### Example
+
+```python
+# Press back button
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::press_key",
+ tool_name="press_key",
+ parameters={"key_code": "KEYCODE_BACK"}
+ )
+])
+
+# Press home button
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::press_key",
+ tool_name="press_key",
+ parameters={"key_code": "KEYCODE_HOME"}
+ )
+])
+
+# Press enter
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::press_key",
+ tool_name="press_key",
+ parameters={"key_code": "KEYCODE_ENTER"}
+ )
+])
+```
+
+#### Common Key Codes
+
+| Key Code | Description |
+|----------|-------------|
+| `KEYCODE_HOME` | Home button |
+| `KEYCODE_BACK` | Back button |
+| `KEYCODE_ENTER` | Enter/Return |
+| `KEYCODE_MENU` | Menu button |
+| `KEYCODE_POWER` | Power button |
+| `KEYCODE_VOLUME_UP` | Volume up |
+| `KEYCODE_VOLUME_DOWN` | Volume down |
+
+Full list: [Android KeyEvent](https://developer.android.com/reference/android/view/KeyEvent)
+
+---
+
+### click_control
+
+Click a UI control by its ID and name.
+
+#### Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `control_id` | `str` | ✅ Yes | Precise annotated ID from `get_app_window_controls_target_info` |
+| `control_name` | `str` | ✅ Yes | Precise name of control (must match `control_id`) |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # Full action description
+ "message": str, # Success message with coordinates
+ "control_info": {
+ "id": str,
+ "name": str,
+ "type": str,
+ "rect": [int, int, int, int]
+ },
+ "warning": str, # Optional: name mismatch warning
+ # OR
+ "error": str # Error message
+}
+```
+
+#### Example
+
+```python
+# 1. Get controls
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ tool_name="get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+# 2. Find OK button
+ok_button = next(c for c in controls[0].data if c.name == "OK")
+
+# 3. Click it
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::click_control",
+ tool_name="click_control",
+ parameters={
+ "control_id": ok_button.id,
+ "control_name": ok_button.name
+ }
+ )
+])
+```
+
+#### Workflow
+
+1. Retrieves control from cache by `control_id`
+2. Verifies name matches (warns if different)
+3. Calculates center position from bounding box
+4. Taps at center coordinates
+5. Invalidates controls cache
+
+#### Side Effects
+
+- Taps the control (may trigger navigation)
+- Invalidates controls cache
+
+---
+
+### wait
+
+Wait for a specified number of seconds.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `seconds` | `float` | No | `1.0` | Number of seconds to wait (0-60 range) |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "action": str, # "wait(Ns)"
+ "message": str, # "Waited for N seconds"
+ # OR
+ "error": str # Error if invalid seconds
+}
+```
+
+#### Example
+
+```python
+# Wait 1 second
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::wait",
+ tool_name="wait",
+ parameters={"seconds": 1.0}
+ )
+])
+
+# Wait 500ms
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::wait",
+ tool_name="wait",
+ parameters={"seconds": 0.5}
+ )
+])
+
+# Wait 2.5 seconds
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::wait",
+ tool_name="wait",
+ parameters={"seconds": 2.5}
+ )
+])
+```
+
+#### Constraints
+
+- Minimum: 0 seconds
+- Maximum: 60 seconds
+- Use for UI transitions, animations, app loading
+
+---
+
+### invalidate_cache
+
+Manually invalidate cached data to force refresh on next query.
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `cache_type` | `str` | No | `"all"` | Type of cache: `"controls"`, `"apps"`, `"ui_tree"`, `"device_info"`, `"all"` |
+
+#### Returns
+
+**Type**: `Dict[str, Any]`
+
+```python
+{
+ "success": bool,
+ "message": str, # Confirmation message
+ # OR
+ "error": str # Invalid cache_type
+}
+```
+
+#### Example
+
+```python
+# Invalidate all caches
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::invalidate_cache",
+ tool_name="invalidate_cache",
+ parameters={"cache_type": "all"}
+ )
+])
+
+# Invalidate only controls cache
+result = await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::invalidate_cache",
+ tool_name="invalidate_cache",
+ parameters={"cache_type": "controls"}
+ )
+])
+```
+
+#### Cache Types
+
+| Type | Description |
+|------|-------------|
+| `"controls"` | UI controls list |
+| `"apps"` | Installed apps list |
+| `"ui_tree"` | UI hierarchy XML |
+| `"device_info"` | Device information |
+| `"all"` | All caches |
+
+#### Use Cases
+
+- After manual device interaction (outside automation)
+- After app installation/uninstallation
+- When device state significantly changed
+- Before critical operations requiring fresh data
+
+---
+
+## Configuration
+
+### Client Configuration (UFO² Agent)
+
+```yaml
+# Windows agent controlling Android device
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http
+ host: "localhost" # Or remote machine IP
+ port: 8020
+ path: "/mcp"
+ action:
+ - namespace: MobileExecutor
+ type: http
+ host: "localhost"
+ port: 8021
+ path: "/mcp"
+
+# Remote Android device
+MobileAgent:
+ default:
+ data_collection:
+ - namespace: MobileDataCollector
+ type: http
+ host: "192.168.1.150" # Android automation server
+ port: 8020
+ path: "/mcp"
+ action:
+ - namespace: MobileExecutor
+ type: http
+ host: "192.168.1.150"
+ port: 8021
+ path: "/mcp"
+```
+
+## Deployment
+
+### Prerequisites
+
+1. **ADB Installation**
+
+```bash
+# Windows (via Android SDK or standalone)
+# Download from: https://developer.android.com/studio/releases/platform-tools
+
+# Linux
+sudo apt-get install android-tools-adb
+
+# macOS
+brew install android-platform-tools
+```
+
+2. **Android Device Setup**
+
+- Enable USB debugging in Developer Options
+- Connect device via USB or Wi-Fi
+- Verify connection: `adb devices`
+
+```bash
+# Check connected devices
+adb devices
+
+# Output:
+# List of devices attached
+# R5CR20XXXXX device
+```
+
+### Starting the Servers
+
+```bash
+# Start both servers (recommended)
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both --host 0.0.0.0 --data-port 8020 --action-port 8021
+
+# Output:
+# ==================================================
+# UFO Mobile MCP Servers (Android)
+# Android device control via ADB and Model Context Protocol
+# ==================================================
+# Using ADB: C:\...\adb.exe
+# Found 1 connected device(s)
+# ✅ Starting both servers in same process (shared MobileServerState)
+# - Data Collection Server: 0.0.0.0:8020
+# - Action Server: 0.0.0.0:8021
+# Both servers share MobileServerState cache. Press Ctrl+C to stop.
+
+# Start only data collection server
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server data --host 0.0.0.0 --data-port 8020
+
+# Start only action server
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server action --host 0.0.0.0 --action-port 8021
+```
+
+### Command-Line Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--server` | `both` | Which server(s): `data`, `action`, or `both` |
+| `--host` | `localhost` | Host to bind servers to |
+| `--data-port` | `8020` | Port for Data Collection Server |
+| `--action-port` | `8021` | Port for Action Server |
+| `--adb-path` | Auto-detect | Path to ADB executable |
+
+### ADB Path Detection
+
+The server auto-detects ADB from:
+
+1. Common installation paths:
+ - Windows: `C:\Users\{USER}\AppData\Local\Android\Sdk\platform-tools\adb.exe`
+ - Linux: `/usr/bin/adb`, `/usr/local/bin/adb`
+2. System PATH environment variable
+3. Fallback to `adb` command
+
+Override with `--adb-path`:
+
+```bash
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --adb-path "C:\custom\path\adb.exe"
+```
+
+### Network Configuration
+
+**Local Development:**
+```bash
+# Servers on same machine as client
+--host localhost
+```
+
+**Remote Access:**
+```bash
+# Servers accessible from network
+--host 0.0.0.0
+```
+
+**Security:** Use firewall rules to restrict access to trusted IPs.
+
+---
+
+## Best Practices
+
+### 1. Always Run Both Servers Together
+
+```bash
+# ✅ Good: Both servers in same process (shared state)
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both
+
+# ❌ Bad: Separate processes (no shared state)
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server data &
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --server action &
+```
+
+**Why:** Shared `MobileServerState` enables action server to access controls cached by data server.
+
+### 2. Get Controls Before Interaction
+
+```python
+# ✅ Good: Get controls first
+controls = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::get_app_window_controls_target_info", ...)
+])
+
+# Then click/type
+await computer.run_actions([
+ MCPToolCall(tool_key="action::click_control", parameters={"control_id": "5", ...})
+])
+
+# ❌ Bad: Click without getting controls
+await computer.run_actions([
+ MCPToolCall(tool_key="action::click_control", parameters={"control_id": "5", ...})
+])
+# Error: Control not found in cache
+```
+
+### 3. Use Control IDs, Not Coordinates
+
+```python
+# ✅ Good: Use click_control (reliable)
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::click_control",
+ parameters={"control_id": "3", "control_name": "Submit"}
+ )
+])
+
+# ⚠️ OK: Use tap only when control not available
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::tap",
+ parameters={"x": 500, "y": 1200}
+ )
+])
+```
+
+### 4. Handle Cache Expiration
+
+```python
+# Check if controls are stale
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={"force_refresh": False} # Use cache if available
+ )
+])
+
+# For critical operations, force refresh
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={"force_refresh": True} # Always query device
+ )
+])
+```
+
+### 5. Wait After Actions
+
+```python
+# ✅ Good: Wait for UI to settle
+await computer.run_actions([
+ MCPToolCall(tool_key="action::tap", parameters={"x": 500, "y": 1200})
+])
+await computer.run_actions([
+ MCPToolCall(tool_key="action::wait", parameters={"seconds": 1.0})
+])
+
+# Get updated controls
+controls = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::get_app_window_controls_target_info", ...)
+])
+```
+
+### 6. Validate ADB Connection
+
+```python
+# Check device info before operations
+device_info = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::get_device_info", parameters={})
+])
+
+if device_info[0].is_error:
+ raise RuntimeError("No Android device connected")
+```
+
+---
+
+## Use Cases
+
+### 1. App Automation
+
+```python
+# Launch app
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::launch_app",
+ tool_name="launch_app",
+ parameters={"package_name": "com.example.app"}
+ )
+])
+
+# Wait for app to load
+await computer.run_actions([
+ MCPToolCall(tool_key="action::wait", parameters={"seconds": 2.0})
+])
+
+# Get controls
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+# Find and click button
+login_btn = next(c for c in controls[0].data if "Login" in c.name)
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::click_control",
+ parameters={
+ "control_id": login_btn.id,
+ "control_name": login_btn.name
+ }
+ )
+])
+```
+
+### 2. Form Filling
+
+```python
+# Get controls
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+# Type username
+username_field = next(c for c in controls[0].data if "username" in c.name.lower())
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::type_text",
+ tool_name="type_text",
+ parameters={
+ "text": "john.doe@example.com",
+ "control_id": username_field.id,
+ "control_name": username_field.name,
+ "clear_current_text": True
+ }
+ )
+])
+
+# Get updated controls (after typing)
+await computer.run_actions([
+ MCPToolCall(tool_key="action::wait", parameters={"seconds": 0.5})
+])
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={"force_refresh": True}
+ )
+])
+
+# Type password
+password_field = next(c for c in controls[0].data if "password" in c.name.lower())
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::type_text",
+ parameters={
+ "text": "SecureP@ssw0rd",
+ "control_id": password_field.id,
+ "control_name": password_field.name
+ }
+ )
+])
+
+# Submit
+submit_btn = next(c for c in controls[0].data if "Submit" in c.name)
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::click_control",
+ parameters={
+ "control_id": submit_btn.id,
+ "control_name": submit_btn.name
+ }
+ )
+])
+```
+
+### 3. Scrolling and Navigation
+
+```python
+# Swipe up to scroll down content
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::swipe",
+ tool_name="swipe",
+ parameters={
+ "start_x": 500,
+ "start_y": 1500,
+ "end_x": 500,
+ "end_y": 500,
+ "duration": 300
+ }
+ )
+])
+
+# Wait for scrolling to complete
+await computer.run_actions([
+ MCPToolCall(tool_key="action::wait", parameters={"seconds": 0.5})
+])
+
+# Get updated controls
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={"force_refresh": True}
+ )
+])
+```
+
+### 4. Device Testing
+
+```python
+# Get device info
+device_info = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::get_device_info", parameters={})
+])
+
+print(f"Testing on: {device_info[0].data['device_info']['model']}")
+print(f"Android: {device_info[0].data['device_info']['android_version']}")
+
+# Take screenshot before test
+screenshot_before = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::capture_screenshot", parameters={})
+])
+
+# Perform test actions
+# ...
+
+# Take screenshot after test
+screenshot_after = await computer.run_data_collection([
+ MCPToolCall(tool_key="data_collection::capture_screenshot", parameters={})
+])
+
+# Compare screenshots (external comparison logic)
+```
+
+---
+
+## Comparison with Other Servers
+
+| Feature | MobileExecutor | HardwareExecutor (Robot Arm) | AppUIExecutor (Windows) |
+|---------|----------------|------------------------------|-------------------------|
+| **Platform** | Android (ADB) | Cross-platform (Hardware) | Windows (UIA) |
+| **Controls** | ✅ XML-based | ❌ Coordinate-based | ✅ UIA-based |
+| **Screenshots** | ✅ ADB screencap | ✅ Hardware camera | ✅ Windows API |
+| **Deployment** | HTTP (dual-server) | HTTP (single-server) | Local (in-process) |
+| **State Management** | ✅ Shared singleton | ❌ Stateless | ❌ No caching |
+| **App Launch** | ✅ Package manager | ❌ Manual | ✅ Process spawn |
+| **Text Input** | ✅ ADB input | ✅ HID keyboard | ✅ UIA SetValue |
+| **Cache** | ✅ 5s-5min TTL | ❌ No cache | ❌ No cache |
+
+---
+
+## Troubleshooting
+
+### ADB Connection Issues
+
+```bash
+# Restart ADB server
+adb kill-server
+adb start-server
+
+# Check device connection
+adb devices
+
+# If no devices shown:
+# 1. Check USB cable
+# 2. Verify USB debugging enabled on device
+# 3. Accept "Allow USB debugging" prompt on device
+```
+
+### Server Not Starting
+
+```bash
+# Check if ports are in use
+netstat -an | findstr "8020"
+netstat -an | findstr "8021"
+
+# Change ports if needed
+python -m ufo.client.mcp.http_servers.mobile_mcp_server --data-port 8030 --action-port 8031
+```
+
+### Controls Not Found
+
+```python
+# Force refresh cache
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={"force_refresh": True}
+ )
+])
+
+# Or invalidate cache manually
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::invalidate_cache",
+ parameters={"cache_type": "controls"}
+ )
+])
+```
+
+### Text Input Fails
+
+```python
+# Ensure control is in cache
+controls = await computer.run_data_collection([
+ MCPToolCall(
+ tool_key="data_collection::get_app_window_controls_target_info",
+ parameters={}
+ )
+])
+
+# Verify control ID and name match
+field = next(c for c in controls[0].data if c.id == "5")
+print(f"Control name: {field.name}")
+
+# Use exact ID and name
+await computer.run_actions([
+ MCPToolCall(
+ tool_key="action::type_text",
+ parameters={
+ "text": "test",
+ "control_id": field.id,
+ "control_name": field.name
+ }
+ )
+])
+```
+
+---
+
+## Related Documentation
+
+- [HardwareExecutor](./hardware_executor.md) - Hardware control (robot arm, mobile devices)
+- [BashExecutor](./bash_executor.md) - Linux command execution
+- [AppUIExecutor](./app_ui_executor.md) - Windows UI automation
+- [Remote Servers](../remote_servers.md) - HTTP deployment guide
+- [Action Servers](../action.md) - Action server concepts
+- [Data Collection Servers](../data_collection.md) - Data collection overview
diff --git a/documents/docs/mobile/as_galaxy_device.md b/documents/docs/mobile/as_galaxy_device.md
new file mode 100644
index 000000000..05d0f561f
--- /dev/null
+++ b/documents/docs/mobile/as_galaxy_device.md
@@ -0,0 +1,698 @@
+# Using Mobile Agent as Galaxy Device
+
+Configure Mobile Agent as a sub-agent in UFO's Galaxy framework to enable cross-platform, multi-device task orchestration. Galaxy can coordinate Mobile agents alongside Windows and Linux devices to execute complex workflows spanning multiple systems and platforms.
+
+> **📖 Prerequisites:**
+>
+> Before configuring Mobile Agent in Galaxy, ensure you have:
+>
+> - Completed the [Mobile Agent Quick Start Guide](../getting_started/quick_start_mobile.md) - Learn how to set up server, MCP services, and client
+> - Read the [Mobile Agent Overview](overview.md) - Understand Mobile Agent's design and capabilities
+> - Reviewed the [Galaxy Overview](../galaxy/overview.md) - Understand multi-device orchestration
+
+## Overview
+
+The **Galaxy framework** provides multi-tier orchestration capabilities, allowing you to manage multiple device agents (Windows, Linux, Android, etc.) from a central ConstellationAgent. When configured as a Galaxy device, MobileAgent becomes a **sub-agent** that can:
+
+- Execute Android-specific subtasks assigned by Galaxy
+- Participate in cross-platform workflows (e.g., Windows + Android + Linux collaboration)
+- Report execution status back to the orchestrator
+- Be dynamically selected based on capabilities and metadata
+
+For detailed information about MobileAgent's design and capabilities, see [Mobile Agent Overview](overview.md).
+
+## Galaxy Architecture with Mobile Agent
+
+```mermaid
+graph TB
+ User[User Request]
+ Galaxy[Galaxy ConstellationAgent Orchestrator]
+
+ subgraph "Device Pool"
+ Win1[Windows Device 1 HostAgent]
+ Linux1[Linux Agent 1 CLI Executor]
+ Mobile1[Mobile Agent 1 Android Phone]
+ Mobile2[Mobile Agent 2 Android Tablet]
+ Mobile3[Mobile Agent 3 Android Emulator]
+ end
+
+ User -->|Complex Task| Galaxy
+ Galaxy -->|Windows Subtask| Win1
+ Galaxy -->|Linux Subtask| Linux1
+ Galaxy -->|Mobile Subtask| Mobile1
+ Galaxy -->|Mobile Subtask| Mobile2
+ Galaxy -->|Mobile Subtask| Mobile3
+
+ style Galaxy fill:#ffe1e1
+ style Mobile1 fill:#c8e6c9
+ style Mobile2 fill:#c8e6c9
+ style Mobile3 fill:#c8e6c9
+```
+
+Galaxy orchestrates:
+
+- **Task decomposition** - Break complex requests into platform-specific subtasks
+- **Device selection** - Choose appropriate devices based on capabilities
+- **Parallel execution** - Execute subtasks concurrently across devices
+- **Result aggregation** - Combine results from all devices
+
+---
+
+## Configuration Guide
+
+### Step 1: Configure Device in `devices.yaml`
+
+Add your Mobile agent(s) to the device list in `config/galaxy/devices.yaml`:
+
+#### Example Configuration
+
+```yaml
+devices:
+ - device_id: "mobile_phone_1"
+ server_url: "ws://192.168.1.100:5001/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "messaging"
+ - "camera"
+ - "location"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ screen_size: "1080x2400"
+ installed_apps:
+ - "com.google.android.apps.maps"
+ - "com.whatsapp"
+ - "com.android.chrome"
+ description: "Personal Android phone for mobile tasks"
+ auto_connect: true
+ max_retries: 5
+```
+
+### Step 2: Understanding Configuration Fields
+
+| Field | Required | Type | Description |
+|-------|----------|------|-------------|
+| `device_id` | ✅ Yes | string | **Unique identifier** - must match client `--client-id` |
+| `server_url` | ✅ Yes | string | WebSocket URL - must match server endpoint |
+| `os` | ✅ Yes | string | Operating system - set to `"mobile"` |
+| `capabilities` | ❌ Optional | list | Skills/capabilities for task routing |
+| `metadata` | ❌ Optional | dict | Custom context for LLM-based task execution |
+| `auto_connect` | ❌ Optional | boolean | Auto-connect on Galaxy startup (default: `true`) |
+| `max_retries` | ❌ Optional | integer | Connection retry attempts (default: `5`) |
+
+### Step 3: Capabilities-Based Task Routing
+
+Galaxy uses the `capabilities` field to intelligently route subtasks to appropriate devices. Define capabilities based on device features, installed apps, or task types.
+
+#### Example Capability Configurations
+
+**Personal Phone:**
+```yaml
+capabilities:
+ - "mobile"
+ - "android"
+ - "messaging"
+ - "whatsapp"
+ - "maps"
+ - "camera"
+ - "location"
+```
+
+**Work Phone:**
+```yaml
+capabilities:
+ - "mobile"
+ - "android"
+ - "email"
+ - "calendar"
+ - "office_apps"
+ - "vpn"
+```
+
+**Testing Emulator:**
+```yaml
+capabilities:
+ - "mobile"
+ - "android"
+ - "testing"
+ - "automation"
+ - "screenshots"
+```
+
+**Tablet:**
+```yaml
+capabilities:
+ - "mobile"
+ - "android"
+ - "tablet"
+ - "large_screen"
+ - "media"
+ - "reading"
+```
+
+### Step 4: Metadata for Contextual Execution
+
+The `metadata` field provides contextual information that the LLM uses when generating actions for the Mobile agent.
+
+#### Metadata Examples
+
+**Personal Phone Metadata:**
+```yaml
+metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ sdk_version: "33"
+ screen_size: "1080x2400"
+ screen_density: "420"
+ installed_apps:
+ - "com.google.android.apps.maps"
+ - "com.whatsapp"
+ - "com.android.chrome"
+ - "com.spotify.music"
+ contacts:
+ - "John Doe"
+ - "Jane Smith"
+ description: "Personal Android phone with social and navigation apps"
+```
+
+**Work Device Metadata:**
+```yaml
+metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "12"
+ screen_size: "1080x2340"
+ installed_apps:
+ - "com.microsoft.office.outlook"
+ - "com.microsoft.teams"
+ - "com.slack"
+ vpn_configured: true
+ email_accounts:
+ - "work@company.com"
+ description: "Work phone with corporate apps and VPN"
+```
+
+**Testing Emulator Metadata:**
+```yaml
+metadata:
+ os: "mobile"
+ device_type: "emulator"
+ android_version: "14"
+ sdk_version: "34"
+ screen_size: "1080x1920"
+ installed_apps:
+ - "com.example.testapp"
+ adb_over_network: true
+ description: "Android emulator for app testing"
+```
+
+#### How Metadata is Used
+
+The LLM receives metadata in the system prompt, enabling context-aware action generation:
+
+- **App Availability**: LLM knows which apps can be launched
+- **Screen Size**: Informs swipe distances and touch coordinates
+- **Android Version**: Affects available features and UI patterns
+- **Device Type**: Phone vs tablet affects UI layout
+- **Custom Fields**: Any additional context you provide
+
+**Example**: With the personal phone metadata above, when the user requests "Navigate to restaurant", the LLM knows Maps is installed and can generate `launch_app(package_name="com.google.android.apps.maps")`.
+
+---
+
+## Multi-Device Configuration Example
+
+### Complete Galaxy Setup
+
+```yaml
+devices:
+ # Windows Desktop Agent
+ - device_id: "windows_desktop_1"
+ server_url: "ws://192.168.1.101:5000/ws"
+ os: "windows"
+ capabilities:
+ - "office_applications"
+ - "email"
+ - "web_browsing"
+ metadata:
+ os: "windows"
+ description: "Office productivity workstation"
+ auto_connect: true
+ max_retries: 5
+
+ # Linux Server Agent
+ - device_id: "linux_server_1"
+ server_url: "ws://192.168.1.102:5001/ws"
+ os: "linux"
+ capabilities:
+ - "server"
+ - "database"
+ - "api"
+ metadata:
+ os: "linux"
+ description: "Backend server"
+ auto_connect: true
+ max_retries: 5
+
+ # Personal Android Phone
+ - device_id: "mobile_phone_personal"
+ server_url: "ws://192.168.1.103:5002/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "messaging"
+ - "whatsapp"
+ - "maps"
+ - "camera"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "13"
+ screen_size: "1080x2400"
+ installed_apps:
+ - "com.google.android.apps.maps"
+ - "com.whatsapp"
+ - "com.android.chrome"
+ description: "Personal phone with social apps"
+ auto_connect: true
+ max_retries: 5
+
+ # Work Android Phone
+ - device_id: "mobile_phone_work"
+ server_url: "ws://192.168.1.104:5003/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "email"
+ - "calendar"
+ - "teams"
+ metadata:
+ os: "mobile"
+ device_type: "phone"
+ android_version: "12"
+ screen_size: "1080x2340"
+ installed_apps:
+ - "com.microsoft.office.outlook"
+ - "com.microsoft.teams"
+ description: "Work phone with corporate apps"
+ auto_connect: true
+ max_retries: 5
+
+ # Android Tablet
+ - device_id: "mobile_tablet_home"
+ server_url: "ws://192.168.1.105:5004/ws"
+ os: "mobile"
+ capabilities:
+ - "mobile"
+ - "android"
+ - "tablet"
+ - "media"
+ - "reading"
+ metadata:
+ os: "mobile"
+ device_type: "tablet"
+ android_version: "13"
+ screen_size: "2560x1600"
+ installed_apps:
+ - "com.netflix.mediaclient"
+ - "com.google.android.youtube"
+ description: "Tablet for media and entertainment"
+ auto_connect: true
+ max_retries: 5
+```
+
+---
+
+## Starting Galaxy with Mobile Agents
+
+### Prerequisites
+
+Ensure all components are running before starting Galaxy:
+
+1. ✅ Device Agent Servers running on all machines
+2. ✅ Device Agent Clients connected to their respective servers
+3. ✅ MCP Services running (both data collection and action servers)
+4. ✅ ADB accessible and Android devices connected
+5. ✅ USB debugging enabled on all Android devices
+6. ✅ LLM configured in `config/ufo/agents.yaml` or `config/galaxy/agent.yaml`
+
+### Launch Sequence
+
+#### Step 1: Prepare Android Devices
+
+```bash
+# Check ADB connection to all devices
+adb devices
+
+# Expected output:
+# List of devices attached
+# 192.168.1.103:5555 device
+# 192.168.1.104:5555 device
+# emulator-5554 device
+```
+
+**For Physical Devices:**
+1. Enable USB debugging in Developer Options
+2. Connect via USB or wireless ADB
+3. Accept ADB debugging prompt on device
+
+**For Emulators:**
+1. Start Android emulator
+2. ADB connects automatically
+
+#### Step 2: Start Device Agent Servers
+
+```bash
+# On machine hosting personal phone agent (192.168.1.103)
+python -m ufo.server.app --port 5002 --platform mobile
+
+# On machine hosting work phone agent (192.168.1.104)
+python -m ufo.server.app --port 5003 --platform mobile
+
+# On machine hosting tablet agent (192.168.1.105)
+python -m ufo.server.app --port 5004 --platform mobile
+```
+
+#### Step 3: Start MCP Servers for Each Device
+
+```bash
+# On machine hosting personal phone
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --data-port 8020 \
+ --action-port 8021 \
+ --server both
+
+# On machine hosting work phone
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --data-port 8022 \
+ --action-port 8023 \
+ --server both
+
+# On machine hosting tablet
+python -m ufo.client.mcp.http_servers.mobile_mcp_server \
+ --host localhost \
+ --data-port 8024 \
+ --action-port 8025 \
+ --server both
+```
+
+#### Step 4: Start Mobile Clients
+
+```bash
+# Personal phone client
+python -m ufo.client.client \
+ --ws \
+ --ws-server ws://192.168.1.103:5002/ws \
+ --client-id mobile_phone_personal \
+ --platform mobile
+
+# Work phone client
+python -m ufo.client.client \
+ --ws \
+ --ws-server ws://192.168.1.104:5003/ws \
+ --client-id mobile_phone_work \
+ --platform mobile
+
+# Tablet client
+python -m ufo.client.client \
+ --ws \
+ --ws-server ws://192.168.1.105:5004/ws \
+ --client-id mobile_tablet_home \
+ --platform mobile
+```
+
+#### Step 5: Launch Galaxy
+
+```bash
+# On your control machine (interactive mode)
+python -m galaxy --interactive
+```
+
+**Or launch with a specific request:**
+
+```bash
+python -m galaxy "Your cross-device task description here"
+```
+
+Galaxy will automatically connect to all configured devices and display the orchestration interface.
+
+---
+
+## Example Multi-Device Workflows
+
+### Workflow 1: Cross-Platform Productivity
+
+**User Request:**
+> "Get my meeting notes from email on work phone, summarize them on desktop, and send summary to team via WhatsApp on personal phone"
+
+**Galaxy Orchestration:**
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant Galaxy
+ participant WorkPhone as Work Phone (Android)
+ participant Desktop as Windows Desktop
+ participant PersonalPhone as Personal Phone (Android)
+
+ User->>Galaxy: Request meeting workflow
+ Galaxy->>Galaxy: Decompose task
+
+ Note over Galaxy,WorkPhone: Subtask 1: Get notes from email
+ Galaxy->>WorkPhone: "Open Outlook and find meeting notes"
+ WorkPhone->>WorkPhone: Launch Outlook app
+ WorkPhone->>WorkPhone: Navigate to inbox
+ WorkPhone->>WorkPhone: Find meeting email
+ WorkPhone->>WorkPhone: Extract notes text
+ WorkPhone-->>Galaxy: Notes content
+
+ Note over Galaxy,Desktop: Subtask 2: Summarize on desktop
+ Galaxy->>Desktop: "Summarize meeting notes"
+ Desktop->>Desktop: Open Word
+ Desktop->>Desktop: Paste notes
+ Desktop->>Desktop: Generate summary
+ Desktop-->>Galaxy: Summary document
+
+ Note over Galaxy,PersonalPhone: Subtask 3: Send via WhatsApp
+ Galaxy->>PersonalPhone: "Send summary to team on WhatsApp"
+ PersonalPhone->>PersonalPhone: Launch WhatsApp
+ PersonalPhone->>PersonalPhone: Select team group
+ PersonalPhone->>PersonalPhone: Type summary message
+ PersonalPhone->>PersonalPhone: Send message
+ PersonalPhone-->>Galaxy: Message sent
+
+ Galaxy-->>User: Workflow completed
+```
+
+### Workflow 2: Mobile Testing Across Devices
+
+**User Request:**
+> "Test the new app on phone, tablet, and emulator, capture screenshots of each screen"
+
+**Galaxy Orchestration:**
+
+1. **Mobile Phone**: Install app, navigate through screens, capture screenshots
+2. **Mobile Tablet**: Install app (tablet layout), navigate screens, capture screenshots
+3. **Mobile Emulator**: Install app, run automated test suite, capture screenshots
+4. **Windows Desktop**: Aggregate screenshots, generate test report
+
+### Workflow 3: Location-Based Multi-Device Task
+
+**User Request:**
+> "Find nearest coffee shops on phone, book table using tablet, add calendar event on work phone"
+
+**Galaxy Orchestration:**
+
+1. **Personal Phone**: Launch Maps, search "coffee shops near me", get results
+2. **Tablet**: Open booking app, select coffee shop, book table
+3. **Work Phone**: Open Calendar, create event with location and time
+4. **Galaxy**: Aggregate confirmations and notify user
+
+---
+
+## Task Assignment Behavior
+
+### How Galaxy Routes Tasks to Mobile Agents
+
+Galaxy's ConstellationAgent uses several factors to select the appropriate mobile device for each subtask:
+
+| Factor | Description | Example |
+|--------|-------------|---------|
+| **Capabilities** | Match subtask requirements to device capabilities | `"messaging"` → Personal phone |
+| **OS Requirement** | Platform-specific tasks routed to correct OS | Mobile tasks → Mobile agents |
+| **Metadata Context** | Use device-specific apps and configurations | WhatsApp task → Device with WhatsApp |
+| **Device Type** | Phone vs tablet for different UI requirements | Media viewing → Tablet |
+| **Device Status** | Only assign to online, healthy devices | Skip offline or failing devices |
+| **Load Balancing** | Distribute tasks across similar devices | Round-robin across phones |
+
+### Example Task Decomposition
+
+**User Request:**
+> "Check messages on personal phone, review calendar on work phone, and play video on tablet"
+
+**Galaxy Decomposition:**
+
+```yaml
+Task 1:
+ Description: "Check messages on WhatsApp"
+ Target: mobile_phone_personal
+ Reason: Has "whatsapp" capability and personal messaging apps
+
+Task 2:
+ Description: "Review today's calendar events"
+ Target: mobile_phone_work
+ Reason: Has "calendar" capability and work email/calendar
+
+Task 3:
+ Description: "Play video on YouTube"
+ Target: mobile_tablet_home
+ Reason: Has "media" capability and larger screen suitable for video
+```
+
+---
+
+## Critical Configuration Requirements
+
+!!!danger "Configuration Validation"
+ Ensure these match exactly or Galaxy cannot control the device:
+
+ - **Device ID**: `device_id` in `devices.yaml` must match `--client-id` in client command
+ - **Server URL**: `server_url` in `devices.yaml` must match `--ws-server` in client command
+ - **Platform**: Must include `--platform mobile` in client command
+ - **ADB Access**: Android device must be accessible via ADB
+ - **MCP Servers**: Both data collection and action servers must be running
+
+---
+
+## Monitoring & Debugging
+
+### Verify Device Registration
+
+**Check Galaxy device pool:**
+
+```bash
+curl http://:5000/api/devices
+```
+
+**Expected response:**
+
+```json
+{
+ "devices": [
+ {
+ "device_id": "mobile_phone_personal",
+ "os": "mobile",
+ "status": "online",
+ "capabilities": ["mobile", "android", "messaging", "whatsapp", "maps"]
+ },
+ {
+ "device_id": "mobile_phone_work",
+ "os": "mobile",
+ "status": "online",
+ "capabilities": ["mobile", "android", "email", "calendar", "teams"]
+ }
+ ]
+}
+```
+
+### View Task Assignments
+
+Galaxy logs show task routing decisions:
+
+```log
+INFO - [Galaxy] Task decomposition: 3 subtasks created
+INFO - [Galaxy] Subtask 1 → mobile_phone_personal (capability match: messaging)
+INFO - [Galaxy] Subtask 2 → mobile_phone_work (capability match: calendar)
+INFO - [Galaxy] Subtask 3 → mobile_tablet_home (capability match: media)
+```
+
+### Troubleshooting Device Connection
+
+**Issue**: Mobile agent not appearing in Galaxy device pool
+
+**Diagnosis:**
+
+1. **Check ADB connection:**
+ ```bash
+ adb devices
+ ```
+
+2. **Verify client connection:**
+ ```bash
+ curl http://192.168.1.103:5002/api/clients
+ ```
+
+3. **Check `devices.yaml` configuration** matches client parameters
+
+4. **Review Galaxy logs** for connection errors
+
+5. **Ensure `auto_connect: true`** in `devices.yaml`
+
+6. **Check MCP servers** are running:
+ ```bash
+ curl http://localhost:8020/health # Data collection server
+ curl http://localhost:8021/health # Action server
+ ```
+
+---
+
+## Mobile-Specific Considerations
+
+### Screenshot Capture for Galaxy
+
+Mobile agents automatically capture screenshots during execution, which Galaxy can:
+
+- Display in orchestration UI
+- Include in execution reports
+- Use for debugging failed tasks
+- Share with other agents for context
+
+### Touch Coordinates Across Devices
+
+Different Android devices have different screen sizes and densities. Galaxy handles this by:
+
+- Using control IDs instead of absolute coordinates
+- Having each mobile agent handle device-specific coordinate calculations
+- Storing device resolution in metadata for reference
+
+### App Availability
+
+Galaxy can query `installed_apps` from metadata to:
+
+- Route tasks to devices with required apps
+- Skip devices missing necessary apps
+- Suggest app installation when needed
+
+---
+
+## Related Documentation
+
+- [Mobile Agent Overview](overview.md) - Architecture and design principles
+- [Mobile Agent Commands](commands.md) - MCP tools for device interaction
+- [Galaxy Overview](../galaxy/overview.md) - Multi-device orchestration framework
+- [Galaxy Quick Start](../getting_started/quick_start_galaxy.md) - Galaxy deployment guide
+- [Constellation Orchestrator](../galaxy/constellation_orchestrator/overview.md) - Task orchestration
+- [Galaxy Devices Configuration](../configuration/system/galaxy_devices.md) - Complete device configuration reference
+
+---
+
+## Summary
+
+Using Mobile Agent as a Galaxy device enables sophisticated multi-device orchestration:
+
+- **Cross-Platform Workflows**: Seamlessly combine Android, Windows, and Linux tasks
+- **Capability-Based Routing**: Galaxy selects the right device for each subtask
+- **Visual Context**: Screenshots provide rich execution tracing
+- **Parallel Execution**: Multiple mobile devices work concurrently
+- **Metadata-Aware**: LLM uses device-specific context (installed apps, screen size, etc.)
+- **Robust Caching**: Efficient ADB usage through smart caching strategies
+
+With Mobile Agent in Galaxy, you can orchestrate complex workflows spanning mobile apps, desktop applications, and server systems from a single unified interface.
diff --git a/documents/docs/mobile/commands.md b/documents/docs/mobile/commands.md
new file mode 100644
index 000000000..9c6eae475
--- /dev/null
+++ b/documents/docs/mobile/commands.md
@@ -0,0 +1,1006 @@
+# MobileAgent MCP Commands
+
+MobileAgent interacts with Android devices through MCP (Model Context Protocol) tools provided by two specialized MCP servers. These tools provide atomic building blocks for mobile task execution, isolating device-specific operations within the MCP server layer.
+
+> **📖 Related Documentation:**
+>
+> - [Mobile Agent Overview](overview.md) - Architecture and core responsibilities
+> - [State Machine](state.md) - FSM states and transitions
+> - [Processing Strategy](strategy.md) - How commands are orchestrated in the 4-phase pipeline
+> - [Quick Start Guide](../getting_started/quick_start_mobile.md) - Set up MCP servers for your device
+
+## Command Architecture
+
+### Dual MCP Server Design
+
+MobileAgent uses two separate MCP servers for different responsibilities:
+
+```mermaid
+graph LR
+ A[MobileAgent] --> B[Command Dispatcher]
+ B --> C[Data Collection Server Port 8020]
+ B --> D[Action Server Port 8021]
+
+ C --> E[ADB Commands screencap, uiautomator, pm list]
+ D --> F[ADB Commands input tap/swipe/text, monkey]
+
+ E --> G[Android Device]
+ F --> G
+
+ C -.Shared State.-> H[MobileServerState Singleton]
+ D -.Shared State.-> H
+```
+
+**Why Two Servers?**
+
+- **Separation of Concerns**: Data retrieval vs. device control
+- **Performance**: Data collection can cache aggressively, actions invalidate caches
+- **Security**: Different tools can have different permission levels
+- **Scalability**: Servers can run on different hosts if needed
+
+**Shared State**: Both servers share a singleton `MobileServerState` for:
+- Caching control information (5 seconds TTL)
+- Caching installed apps (5 minutes TTL)
+- Caching UI tree (5 seconds TTL)
+- Coordinating cache invalidation after actions
+
+### Command Dispatcher
+
+The command dispatcher routes commands to the appropriate MCP server:
+
+```python
+from aip.messages import Command
+
+# Create data collection command
+command = Command(
+ tool_name="capture_screenshot",
+ parameters={},
+ tool_type="data_collection"
+)
+
+# Execute command via dispatcher
+results = await command_dispatcher.execute_commands([command])
+screenshot_url = results[0].result
+```
+
+---
+
+## Data Collection Server Tools (Port 8020)
+
+The Data Collection Server provides read-only tools for gathering device information.
+
+### 1. capture_screenshot - Capture Device Screenshot
+
+**Purpose**: Take screenshot from Android device and return as base64-encoded image.
+
+#### Tool Specification
+
+```python
+tool_name = "capture_screenshot"
+parameters = {} # No parameters required
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant ADB
+ participant Device
+
+ Agent->>MCP: capture_screenshot()
+ MCP->>ADB: screencap -p /sdcard/screen_temp.png
+ ADB->>Device: Execute screenshot
+ Device-->>ADB: Screenshot saved
+
+ ADB->>Device: pull /sdcard/screen_temp.png
+ Device-->>ADB: PNG file
+
+ MCP->>MCP: Encode to base64
+ MCP->>ADB: rm /sdcard/screen_temp.png
+ MCP-->>Agent: data:image/png;base64,...
+```
+
+#### Result Format
+
+```python
+# Direct base64 data URI string (not a dict)
+"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA..."
+```
+
+#### Use Cases
+
+| Use Case | Description |
+|----------|-------------|
+| **UI Analysis** | Understand current screen state |
+| **Visual Context** | Provide screenshots to LLM for decision making |
+| **Debugging** | Capture UI state at each step |
+| **Annotation Base** | Base image for control labeling |
+
+#### Error Handling
+
+```python
+# Failures return as exceptions
+try:
+ screenshot_url = await capture_screenshot()
+except Exception as e:
+ # "Failed to capture screenshot on device"
+ # "Failed to pull screenshot from device"
+ pass
+```
+
+---
+
+### 2. get_ui_tree - Get UI Hierarchy XML
+
+**Purpose**: Retrieve the complete UI hierarchy in XML format for detailed UI structure analysis.
+
+#### Tool Specification
+
+```python
+tool_name = "get_ui_tree"
+parameters = {} # No parameters required
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant ADB
+ participant Device
+
+ Agent->>MCP: get_ui_tree()
+ MCP->>ADB: uiautomator dump /sdcard/window_dump.xml
+ ADB->>Device: Dump UI hierarchy
+ Device-->>ADB: XML created
+
+ ADB->>Device: cat /sdcard/window_dump.xml
+ Device-->>ADB: XML content
+ ADB-->>MCP: XML string
+
+ MCP->>MCP: Cache UI tree (5s TTL)
+ MCP-->>Agent: UI tree dictionary
+```
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "ui_tree": """
+
+
+
+ ...
+
+ """,
+ "format": "xml"
+}
+```
+
+#### Use Cases
+
+- Advanced UI analysis requiring full hierarchy
+- Custom control parsing logic
+- Debugging UI structure
+- Extracting accessibility information
+
+---
+
+### 3. get_device_info - Get Device Information
+
+**Purpose**: Gather comprehensive device information including model, Android version, screen size, and battery status.
+
+#### Tool Specification
+
+```python
+tool_name = "get_device_info"
+parameters = {} # No parameters required
+```
+
+#### Information Collected
+
+| Info Type | ADB Command | Data Returned |
+|-----------|-------------|---------------|
+| **Model** | `getprop ro.product.model` | Device model name |
+| **Android Version** | `getprop ro.build.version.release` | Android version (e.g., "13") |
+| **SDK Version** | `getprop ro.build.version.sdk` | API level (e.g., "33") |
+| **Screen Size** | `wm size` | Resolution (e.g., "Physical size: 1080x2400") |
+| **Screen Density** | `wm density` | DPI (e.g., "Physical density: 420") |
+| **Battery Level** | `dumpsys battery` | Battery percentage |
+| **Battery Status** | `dumpsys battery` | Charging status |
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "device_info": {
+ "model": "Pixel 6",
+ "android_version": "13",
+ "sdk_version": "33",
+ "screen_size": "Physical size: 1080x2400",
+ "screen_density": "Physical density: 420",
+ "battery_level": "85",
+ "battery_status": "2" # 2 = Charging, 3 = Discharging
+ },
+ "from_cache": False # True if returned from cache
+}
+```
+
+**Caching**: Device info is cached for 60 seconds as it changes infrequently.
+
+---
+
+### 4. get_mobile_app_target_info - List Installed Apps
+
+**Purpose**: Retrieve list of installed applications as TargetInfo objects.
+
+#### Tool Specification
+
+```python
+tool_name = "get_mobile_app_target_info"
+parameters = {
+ "filter": "", # Filter pattern (optional)
+ "include_system_apps": False, # Include system apps (default: False)
+ "force_refresh": False # Bypass cache (default: False)
+}
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant Cache
+ participant ADB
+ participant Device
+
+ Agent->>MCP: get_mobile_app_target_info(include_system_apps=False)
+
+ alt Cache Hit (not forced refresh)
+ MCP->>Cache: Check cache (5min TTL)
+ Cache-->>MCP: Cached app list
+ MCP-->>Agent: Apps from cache
+ else Cache Miss
+ MCP->>ADB: pm list packages -3
+ ADB->>Device: List user-installed packages
+ Device-->>ADB: Package list
+ ADB-->>MCP: Packages
+
+ MCP->>MCP: Parse to TargetInfo objects
+ MCP->>Cache: Update cache
+ MCP-->>Agent: App list
+ end
+```
+
+#### Result Format
+
+```python
+[
+ {
+ "id": "1",
+ "name": "com.android.chrome",
+ "package": "com.android.chrome"
+ },
+ {
+ "id": "2",
+ "name": "com.google.android.apps.maps",
+ "package": "com.google.android.apps.maps"
+ },
+ {
+ "id": "3",
+ "name": "com.whatsapp",
+ "package": "com.whatsapp"
+ }
+]
+```
+
+**Notes**:
+- `id`: Sequential number for LLM reference
+- `name`: Package name (display name not available via simple ADB)
+- `package`: Full package identifier
+
+**Caching**: Apps list is cached for 5 minutes to reduce overhead.
+
+---
+
+### 5. get_app_window_controls_target_info - Get UI Controls
+
+**Purpose**: Extract UI controls from current screen with IDs for precise interaction.
+
+#### Tool Specification
+
+```python
+tool_name = "get_app_window_controls_target_info"
+parameters = {
+ "force_refresh": False # Bypass cache (default: False)
+}
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant Cache
+ participant ADB
+ participant Device
+
+ Agent->>MCP: get_app_window_controls_target_info()
+
+ alt Cache Hit (not forced refresh)
+ MCP->>Cache: Check cache (5s TTL)
+ Cache-->>MCP: Cached controls
+ MCP-->>Agent: Controls from cache
+ else Cache Miss
+ MCP->>ADB: uiautomator dump /sdcard/window_dump.xml
+ ADB->>Device: Dump UI
+ Device-->>ADB: XML file
+
+ ADB->>Device: cat /sdcard/window_dump.xml
+ Device-->>ADB: XML content
+ ADB-->>MCP: UI hierarchy
+
+ MCP->>MCP: Parse XML
+ MCP->>MCP: Filter meaningful controls
+ MCP->>MCP: Validate rectangles
+ MCP->>MCP: Assign sequential IDs
+ MCP->>Cache: Update cache
+ MCP-->>Agent: Controls list
+ end
+```
+
+#### Control Selection Criteria
+
+Controls are included if they meet any of these criteria:
+
+- `clickable="true"` - Can be tapped
+- `long-clickable="true"` - Supports long-press
+- `scrollable="true"` - Can be scrolled
+- `checkable="true"` - Checkbox or toggle
+- Has `text` or `content-desc` - Has label
+- Type includes "Edit", "Button" - Input or action element
+
+#### Rectangle Validation
+
+Controls with invalid rectangles are filtered out:
+
+```python
+# Bounds format: [left, top, right, bottom]
+# Valid rectangle must have:
+# - right > left (positive width)
+# - bottom > top (positive height)
+# - All coordinates > 0
+if right <= left or bottom <= top or right == 0 or bottom == 0:
+ skip_control() # Invalid rectangle
+```
+
+#### Result Format
+
+```python
+[
+ {
+ "id": "1",
+ "name": "Search",
+ "type": "EditText",
+ "rect": [48, 96, 912, 192] # [left, top, right, bottom] in pixels
+ },
+ {
+ "id": "2",
+ "name": "Search",
+ "type": "ImageButton",
+ "rect": [912, 96, 1032, 192]
+ },
+ {
+ "id": "3",
+ "name": "Maps",
+ "type": "TextView",
+ "rect": [0, 216, 1080, 360]
+ }
+]
+```
+
+**Caching**: Controls are cached for 5 seconds but **automatically invalidated** after any action (UI likely changed).
+
+---
+
+## Action Server Tools (Port 8021)
+
+The Action Server provides tools for device control and manipulation.
+
+### 6. tap - Tap at Coordinates
+
+**Purpose**: Perform tap/click action at specified screen coordinates.
+
+#### Tool Specification
+
+```python
+tool_name = "tap"
+parameters = {
+ "x": 480, # X coordinate (pixels from left)
+ "y": 240 # Y coordinate (pixels from top)
+}
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant ADB
+ participant Device
+
+ Agent->>MCP: tap(x=480, y=240)
+ MCP->>ADB: input tap 480 240
+ ADB->>Device: Inject tap event
+ Device-->>ADB: Success
+ ADB-->>MCP: Success
+
+ MCP->>MCP: Invalidate controls cache
+ MCP-->>Agent: Result
+```
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "tap(480, 240)",
+ "output": "",
+ "error": ""
+}
+```
+
+**Cache Invalidation**: Automatically invalidates control cache after tap (UI likely changed).
+
+---
+
+### 7. swipe - Swipe Gesture
+
+**Purpose**: Perform swipe gesture from start to end coordinates.
+
+#### Tool Specification
+
+```python
+tool_name = "swipe"
+parameters = {
+ "start_x": 500,
+ "start_y": 1500,
+ "end_x": 500,
+ "end_y": 500,
+ "duration": 300 # milliseconds (default: 300)
+}
+```
+
+#### Common Use Cases
+
+| Use Case | Start | End | Description |
+|----------|-------|-----|-------------|
+| **Scroll Up** | (500, 1500) | (500, 500) | Swipe from bottom to top |
+| **Scroll Down** | (500, 500) | (500, 1500) | Swipe from top to bottom |
+| **Scroll Left** | (900, 600) | (100, 600) | Swipe from right to left |
+| **Scroll Right** | (100, 600) | (900, 600) | Swipe from left to right |
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "swipe(500,1500)->(500,500) in 300ms",
+ "output": "",
+ "error": ""
+}
+```
+
+**Cache Invalidation**: Automatically invalidates control cache after swipe.
+
+---
+
+### 8. type_text - Type Text into Control
+
+**Purpose**: Type text into a specific input field control.
+
+#### Tool Specification
+
+```python
+tool_name = "type_text"
+parameters = {
+ "text": "hello world",
+ "control_id": "5", # REQUIRED: Control ID from get_app_window_controls_target_info
+ "control_name": "Search", # REQUIRED: Control name (must match)
+ "clear_current_text": False # Clear existing text first (default: False)
+}
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant Cache
+ participant ADB
+ participant Device
+
+ Agent->>MCP: type_text(text="hello", control_id="5", control_name="Search")
+
+ MCP->>Cache: Get control by ID
+ Cache-->>MCP: Control with rect
+
+ MCP->>MCP: Calculate center position
+ MCP->>ADB: input tap x y (focus control)
+ ADB->>Device: Tap input field
+
+ alt clear_current_text=True
+ MCP->>ADB: input keyevent KEYCODE_DEL (x50)
+ ADB->>Device: Delete existing text
+ end
+
+ MCP->>MCP: Escape text (spaces -> %s)
+ MCP->>ADB: input text hello%sworld
+ ADB->>Device: Type text
+ Device-->>ADB: Success
+
+ MCP->>MCP: Invalidate controls cache
+ MCP-->>Agent: Result
+```
+
+#### Important Notes
+
+!!!warning "Control ID Requirement"
+ The `control_id` parameter is **REQUIRED**. You must:
+
+ 1. Call `get_app_window_controls_target_info` first
+ 2. Identify the input field control
+ 3. Use its `id` and `name` in `type_text`
+
+ The tool will:
+ - Verify the control exists in cache
+ - Click the control to focus it
+ - Then type the text
+
+**Text Escaping**: Spaces are automatically converted to `%s` for Android input shell compatibility.
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "type_text(text='hello world', control_id='5', control_name='Search')",
+ "message": "Clicked control 'Search' at (480, 144) | Typed text: 'hello world'",
+ "control_info": {
+ "id": "5",
+ "name": "Search",
+ "type": "EditText"
+ }
+}
+```
+
+---
+
+### 9. launch_app - Launch Application
+
+**Purpose**: Launch an application by package name or app ID.
+
+#### Tool Specification
+
+```python
+tool_name = "launch_app"
+parameters = {
+ "package_name": "com.google.android.apps.maps", # Package name
+ "id": "2" # Optional: App ID from get_mobile_app_target_info
+}
+```
+
+#### Usage Modes
+
+**Mode 1: Launch by package name**
+
+```python
+launch_app(package_name="com.android.settings")
+```
+
+**Mode 2: Launch from cached app list**
+
+```python
+# First call get_mobile_app_target_info to cache apps
+# Then use app ID from the list
+launch_app(package_name="com.android.settings", id="5")
+```
+
+**Mode 3: Launch by app name (fuzzy search)**
+
+```python
+# If package_name doesn't contain ".", search by name
+launch_app(package_name="Maps") # Finds "com.google.android.apps.maps"
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant ADB
+ participant Device
+
+ Agent->>MCP: launch_app(package_name="com.google.android.apps.maps")
+
+ alt ID provided
+ MCP->>MCP: Verify ID in cache
+ MCP->>MCP: Get package from cache
+ else Name only (no dots)
+ MCP->>ADB: pm list packages
+ MCP->>MCP: Search for matching package
+ end
+
+ MCP->>ADB: monkey -p com.google.android.apps.maps -c android.intent.category.LAUNCHER 1
+ ADB->>Device: Launch app
+ Device-->>ADB: App started
+ ADB-->>MCP: Success
+ MCP-->>Agent: Result
+```
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "message": "Launched com.google.android.apps.maps",
+ "package_name": "com.google.android.apps.maps",
+ "output": "Events injected: 1",
+ "error": "",
+ "app_info": { # If ID was provided
+ "id": "2",
+ "name": "com.google.android.apps.maps",
+ "package": "com.google.android.apps.maps"
+ }
+}
+```
+
+---
+
+### 10. press_key - Press Hardware/Software Key
+
+**Purpose**: Press a hardware or software key for navigation and system actions.
+
+#### Tool Specification
+
+```python
+tool_name = "press_key"
+parameters = {
+ "key_code": "KEYCODE_BACK" # Key code name
+}
+```
+
+#### Common Key Codes
+
+| Key Code | Description | Use Case |
+|----------|-------------|----------|
+| `KEYCODE_HOME` | Home button | Return to home screen |
+| `KEYCODE_BACK` | Back button | Navigate back |
+| `KEYCODE_MENU` | Menu button | Open options menu |
+| `KEYCODE_ENTER` | Enter key | Submit form |
+| `KEYCODE_DEL` | Delete key | Delete character |
+| `KEYCODE_APP_SWITCH` | Recent apps | Switch between apps |
+| `KEYCODE_POWER` | Power button | Lock screen |
+| `KEYCODE_VOLUME_UP` | Volume up | Increase volume |
+| `KEYCODE_VOLUME_DOWN` | Volume down | Decrease volume |
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "press_key(KEYCODE_BACK)",
+ "output": "",
+ "error": ""
+}
+```
+
+---
+
+### 11. click_control - Click Control by ID
+
+**Purpose**: Click a UI control by its ID from the cached control list.
+
+#### Tool Specification
+
+```python
+tool_name = "click_control"
+parameters = {
+ "control_id": "5", # REQUIRED: Control ID from get_app_window_controls_target_info
+ "control_name": "Search Button" # REQUIRED: Control name (must match)
+}
+```
+
+#### Execution Flow
+
+```mermaid
+sequenceDiagram
+ participant Agent
+ participant MCP
+ participant Cache
+ participant ADB
+ participant Device
+
+ Agent->>MCP: click_control(control_id="5", control_name="Search")
+
+ MCP->>Cache: Get control by ID "5"
+ Cache-->>MCP: Control with rect [48,96,912,192]
+
+ MCP->>MCP: Verify name matches
+ MCP->>MCP: Calculate center: x=(48+912)/2, y=(96+192)/2
+
+ MCP->>ADB: input tap 480 144
+ ADB->>Device: Tap at (480, 144)
+ Device-->>ADB: Success
+
+ MCP->>MCP: Invalidate controls cache
+ MCP-->>Agent: Result
+```
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "click_control(id=5, name=Search)",
+ "message": "Clicked control 'Search' at (480, 144)",
+ "control_info": {
+ "id": "5",
+ "name": "Search",
+ "type": "EditText",
+ "rect": [48, 96, 912, 192]
+ }
+}
+```
+
+**Name Verification**: If the provided `control_name` doesn't match the cached control's name, a warning is returned but the action still executes using the ID.
+
+---
+
+### 12. wait - Wait/Sleep
+
+**Purpose**: Wait for a specified duration.
+
+#### Tool Specification
+
+```python
+tool_name = "wait"
+parameters = {
+ "seconds": 1.0 # Duration in seconds (can be decimal)
+}
+```
+
+#### Use Cases
+
+- Wait for app to load
+- Wait for animation to complete
+- Wait for UI transition
+- Pace actions for stability
+
+#### Examples
+
+```python
+wait(seconds=1.0) # Wait 1 second
+wait(seconds=0.5) # Wait 500ms
+wait(seconds=2.5) # Wait 2.5 seconds
+```
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "action": "wait(1.0s)",
+ "message": "Waited for 1.0 seconds"
+}
+```
+
+**Limits**:
+- Minimum: 0 seconds
+- Maximum: 60 seconds
+
+---
+
+### 13. invalidate_cache - Manual Cache Invalidation
+
+**Purpose**: Manually invalidate cached data to force refresh on next query.
+
+#### Tool Specification
+
+```python
+tool_name = "invalidate_cache"
+parameters = {
+ "cache_type": "all" # "controls", "apps", "ui_tree", "device_info", or "all"
+}
+```
+
+#### Cache Types
+
+| Cache Type | Description | Auto-Invalidated |
+|------------|-------------|------------------|
+| `controls` | UI controls list | ✓ After actions |
+| `apps` | Installed apps list | ✗ Never |
+| `ui_tree` | UI hierarchy XML | ✗ Never |
+| `device_info` | Device information | ✗ Never |
+| `all` | All caches | Varies |
+
+#### Result Format
+
+```python
+{
+ "success": True,
+ "message": "Controls cache invalidated"
+}
+```
+
+**Use Cases**:
+- Manually refresh apps list after installing/uninstalling
+- Force UI tree refresh after significant screen change
+- Debug caching issues
+
+---
+
+## Command Execution Pipeline
+
+### Atomic Building Blocks
+
+The MCP tools serve as atomic operations for mobile task execution:
+
+```mermaid
+graph TD
+ A[User Request] --> B[Data Collection Phase]
+ B --> B1[capture_screenshot]
+ B --> B2[get_mobile_app_target_info]
+ B --> B3[get_app_window_controls_target_info]
+
+ B1 --> C[LLM Reasoning]
+ B2 --> C
+ B3 --> C
+
+ C --> D{Select Action}
+ D -->|Launch| E[launch_app]
+ D -->|Type| F[type_text]
+ D -->|Click| G[click_control]
+ D -->|Swipe| H[swipe]
+ D -->|Navigate| I[press_key]
+ D -->|Wait| J[wait]
+
+ E --> K[Capture Result]
+ F --> K
+ G --> K
+ H --> K
+ I --> K
+ J --> K
+
+ K --> L[Update Memory]
+ L --> M{Task Complete?}
+ M -->|No| B
+ M -->|Yes| N[FINISH]
+```
+
+### Command Composition
+
+MobileAgent executes commands sequentially, building on previous results:
+
+```python
+# Round 1: Capture UI and launch app
+{
+ "action": {
+ "function": "launch_app",
+ "arguments": {"package_name": "com.google.android.apps.maps", "id": "2"}
+ }
+}
+# Result: Maps launched
+
+# Round 2: Capture new UI, identify search field
+{
+ "action": {
+ "function": "click_control",
+ "arguments": {"control_id": "5", "control_name": "Search"}
+ }
+}
+# Result: Search field focused
+
+# Round 3: Type query
+{
+ "action": {
+ "function": "type_text",
+ "arguments": {
+ "text": "restaurants",
+ "control_id": "5",
+ "control_name": "Search"
+ }
+ }
+}
+# Result: Text entered
+```
+
+---
+
+## Best Practices
+
+### Data Collection Tools
+
+- Use `get_app_window_controls_target_info` before every action to get fresh control IDs
+- Cache is your friend: don't force refresh unless necessary
+- Annotated screenshots help LLM identify controls precisely
+
+### Action Tools
+
+!!!success "Action Best Practices"
+ - **Always** call `get_app_window_controls_target_info` before `click_control` or `type_text`
+ - Use control IDs instead of coordinates for robustness
+ - Add `wait` after actions that trigger UI changes (app launch, navigation)
+ - Check `success` field in results before considering action successful
+ - Use `press_key(KEYCODE_BACK)` for navigation instead of screen taps when possible
+
+### Caching
+
+- Controls cache: 5 seconds TTL, invalidated after actions
+- Apps cache: 5 minutes TTL, manually invalidate if apps change
+- Device info cache: 60 seconds TTL, useful for metadata
+
+### Error Handling
+
+```python
+# Always check success field
+result = await click_control(control_id="5", control_name="Search")
+if not result["success"]:
+ # Handle error: control not found, device disconnected, etc.
+ pass
+```
+
+---
+
+## Implementation Location
+
+The MCP server implementation can be found in:
+
+```
+ufo/client/mcp/http_servers/
+└── mobile_mcp_server.py
+```
+
+Key components:
+
+- `MobileServerState`: Singleton state manager for caching
+- `create_mobile_data_collection_server()`: Data collection server (port 8020)
+- `create_mobile_action_server()`: Action server (port 8021)
+
+---
+
+## Comparison with Other Agent Commands
+
+| Agent | Command Types | Execution Layer | Visual Context | Result Format |
+|-------|--------------|-----------------|----------------|---------------|
+| **MobileAgent** | UI + Apps + Touch | MCP (ADB) | ✓ Screenshots + Controls | success/message/control_info |
+| **LinuxAgent** | CLI + SysInfo | MCP (SSH) | ✗ Text-only | success/exit_code/stdout/stderr |
+| **AppAgent** | UI + API | Automator + MCP | ✓ Screenshots + Controls | UI state + API responses |
+
+MobileAgent's command set reflects the mobile environment:
+
+- **Touch-based**: tap, swipe instead of click, drag
+- **Visual**: Screenshots are essential for UI understanding
+- **App-centric**: Focus on app launching and switching
+- **Control-based**: Precise control IDs instead of coordinates
+- **Cached**: Aggressive caching to reduce ADB overhead
+
+---
+
+## Next Steps
+
+- [State Machine](state.md) - Understand how command execution fits into the FSM
+- [Processing Strategy](strategy.md) - See how commands are integrated into the 4-phase pipeline
+- [Overview](overview.md) - Return to MobileAgent architecture overview
+- [As Galaxy Device](as_galaxy_device.md) - Configure MobileAgent for multi-device workflows
diff --git a/documents/docs/mobile/overview.md b/documents/docs/mobile/overview.md
new file mode 100644
index 000000000..91aa1de34
--- /dev/null
+++ b/documents/docs/mobile/overview.md
@@ -0,0 +1,256 @@
+# MobileAgent: Android Task Executor
+
+**MobileAgent** is a specialized agent designed for executing tasks on Android mobile devices. It leverages the layered FSM architecture and server-client design to perform intelligent, iterative task execution in mobile environments through UI interaction and app control.
+
+**Quick Links:**
+
+- **New to Mobile Agent?** Start with the [Quick Start Guide](../getting_started/quick_start_mobile.md) - Set up your first Android device agent in 10 minutes
+- **Using as Sub-Agent in Galaxy?** See [Using Mobile Agent as Galaxy Device](as_galaxy_device.md)
+- **Deep Dive:** [State Machine](state.md) • [Processing Strategy](strategy.md) • [MCP Commands](commands.md)
+
+## Architecture Overview
+
+MobileAgent operates as a single-agent instance that interacts with Android devices through UI controls and app management. Unlike the two-tier architecture of UFO (HostAgent + AppAgent), MobileAgent uses a simplified single-agent model optimized for mobile device automation, similar to LinuxAgent but with visual interface support.
+
+## Core Responsibilities
+
+MobileAgent provides the following capabilities for Android device automation:
+
+### UI Interaction
+
+MobileAgent interprets user requests and translates them into appropriate UI interactions on Android devices through screenshots analysis and control manipulation.
+
+**Example:** User request "Search for restaurants on Maps" becomes:
+
+1. Take screenshot and identify app icons
+2. Launch Google Maps app
+3. Identify search field control
+4. Type "restaurants" into search field
+5. Tap search button
+
+### Visual Context Understanding
+
+The agent captures and analyzes device screenshots to understand the current UI state:
+
+- Screenshot capture (clean and annotated)
+- Control identification and labeling
+- UI hierarchy parsing
+- App detection and recognition
+
+### App Management
+
+MobileAgent can manage installed applications:
+
+- List installed apps (user apps and system apps)
+- Launch apps by package name or app name
+- Switch between apps
+- Monitor current app state
+
+### Iterative Task Execution
+
+MobileAgent executes tasks iteratively, evaluating execution outcomes at each step and determining the next action based on results and LLM reasoning.
+
+### Error Handling and Recovery
+
+The agent monitors action execution results and can adapt its strategy when errors occur, such as controls not found or apps not responding.
+
+## Key Characteristics
+
+- **Scope**: Single Android device (UI-based automation)
+- **Lifecycle**: One instance per task session
+- **Hierarchy**: Standalone agent (no child agents)
+- **Communication**: MCP server integration via ADB
+- **Control**: 3-state finite state machine with 4-phase processing pipeline
+- **Visual**: Screenshot-based UI understanding with control annotation
+
+## Execution Workflow
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant MobileAgent
+ participant LLM
+ participant MCPServer
+ participant Android
+
+ User->>MobileAgent: "Search for restaurants on Maps"
+ MobileAgent->>MobileAgent: State: CONTINUE
+
+ MobileAgent->>MCPServer: Capture screenshot
+ MCPServer->>Android: Take screenshot via ADB
+ Android-->>MCPServer: Screenshot PNG
+ MCPServer-->>MobileAgent: Base64 screenshot
+
+ MobileAgent->>MCPServer: Get installed apps
+ MCPServer->>Android: List packages via ADB
+ Android-->>MCPServer: App list
+ MCPServer-->>MobileAgent: Installed apps
+
+ MobileAgent->>MCPServer: Get current controls
+ MCPServer->>Android: UI dump via ADB
+ Android-->>MCPServer: UI hierarchy XML
+ MCPServer-->>MobileAgent: Control list with IDs
+
+ MobileAgent->>LLM: Send prompt with screenshot + apps + controls
+ LLM-->>MobileAgent: Action: launch_app(package="com.google.android.apps.maps")
+
+ MobileAgent->>MCPServer: launch_app
+ MCPServer->>Android: Start app via ADB
+ Android-->>MCPServer: App launched
+ MCPServer-->>MobileAgent: Success
+
+ MobileAgent->>MobileAgent: Update memory
+ MobileAgent->>MobileAgent: State: CONTINUE
+
+ Note over MobileAgent: Next round with new screenshot
+
+ MobileAgent->>MCPServer: Capture new screenshot + controls
+ MobileAgent->>LLM: Prompt with new UI state
+ LLM-->>MobileAgent: Action: type_text(control_id="5", text="restaurants")
+
+ MobileAgent->>MCPServer: click_control + type_text
+ MCPServer->>Android: Execute actions via ADB
+ Android-->>MCPServer: Actions completed
+ MCPServer-->>MobileAgent: Success
+
+ MobileAgent->>MobileAgent: State: FINISH
+ MobileAgent-->>User: Task completed
+```
+
+## Comparison with Other Agents
+
+| Aspect | MobileAgent | LinuxAgent | AppAgent |
+|--------|-------------|------------|----------|
+| **Platform** | Android Mobile | Linux (CLI) | Windows Applications |
+| **States** | 3 (CONTINUE, FINISH, FAIL) | 3 states | 6 states |
+| **Architecture** | Single-agent | Single-agent | Child executor |
+| **Interface** | Mobile UI (touch-based) | Command-line | Desktop GUI |
+| **Processing Phases** | 4 phases (with DATA_COLLECTION) | 3 phases | 4 phases |
+| **Visual** | ✓ Screenshots + Annotations | ✗ Text-only | ✓ Screenshots + Annotations |
+| **MCP Tools** | UI controls + App management | CLI commands | UI controls + API |
+| **Input Method** | Touch (tap, swipe, type) | Keyboard commands | Mouse + Keyboard |
+| **Control Identification** | UI hierarchy + bounds | N/A | UI Automation API |
+
+## Design Principles
+
+MobileAgent exemplifies mobile-specific design considerations:
+
+- **Visual Context**: Screenshot-based UI understanding with control annotation for precise interaction
+- **Control Caching**: Efficient control information caching to reduce ADB overhead
+- **Touch-based Interaction**: Specialized actions for mobile gestures (tap, swipe, long-press)
+- **App-centric Navigation**: Focus on app launching and switching rather than window management
+- **Minimal State Set**: 3-state FSM for deterministic control flow
+- **Modular Strategies**: Clear separation between data collection, LLM interaction, action execution, and memory updates
+- **Traceable Execution**: Complete logging of screenshots, actions, and state transitions
+
+## Deep Dive Topics
+
+Explore the detailed architecture and implementation:
+
+- [State Machine](state.md) - 3-state FSM lifecycle and transitions
+- [Processing Strategy](strategy.md) - 4-phase pipeline (Data Collection, LLM, Action, Memory)
+- [MCP Commands](commands.md) - Mobile UI interaction and app management commands
+- [As Galaxy Device](as_galaxy_device.md) - Using Mobile Agent in multi-device workflows
+
+## Technology Stack
+
+### ADB (Android Debug Bridge)
+
+MobileAgent relies on ADB for all device interactions:
+
+- **Screenshot Capture**: `adb shell screencap` for visual context
+- **UI Hierarchy**: `adb shell uiautomator dump` for control information
+- **Touch Input**: `adb shell input tap/swipe` for user interaction
+- **Text Input**: `adb shell input text` for typing
+- **App Control**: `adb shell monkey` for app launching
+- **Device Info**: `adb shell getprop` for device properties
+
+### MCP Server Architecture
+
+Two separate MCP servers handle different responsibilities:
+
+1. **Data Collection Server** (Port 8020):
+ - Screenshot capture
+ - UI tree retrieval
+ - App list collection
+ - Control information gathering
+ - Device information
+
+2. **Action Server** (Port 8021):
+ - Touch actions (tap, swipe)
+ - Text input
+ - App launching
+ - Key press events
+ - Control clicking
+
+Both servers share a singleton `MobileServerState` for efficient caching and coordination.
+
+## Use Cases
+
+MobileAgent is ideal for:
+
+- **Mobile App Testing**: Automated UI testing across different apps
+- **Cross-App Workflows**: Tasks spanning multiple mobile applications
+- **Data Entry**: Automated form filling and text input
+- **App Navigation**: Exploring and interacting with mobile UIs
+- **Mobile Productivity**: Automating repetitive mobile tasks
+- **Cross-Device Workflows**: As a sub-agent in Galaxy multi-device orchestration
+
+!!!tip "Galaxy Integration"
+ MobileAgent can serve as a device agent in Galaxy's multi-device orchestration framework, executing Android-specific tasks as part of cross-platform workflows alongside Windows and Linux devices.
+
+ See [Using Mobile Agent as Galaxy Device](as_galaxy_device.md) for configuration details.
+
+## Requirements
+
+### Hardware
+
+- Android device or emulator
+- USB connection (for physical devices) or network connection (for emulators)
+- USB debugging enabled on the device
+
+### Software
+
+- ADB (Android Debug Bridge) installed and accessible
+- Android device with API level 21+ (Android 5.0+)
+- Python 3.8+
+- Required Python packages (see requirements.txt)
+
+## Implementation Location
+
+The MobileAgent implementation can be found in:
+
+```
+ufo/
+├── agents/
+│ ├── agent/
+│ │ └── customized_agent.py # MobileAgent class definition
+│ ├── states/
+│ │ └── mobile_agent_state.py # State machine implementation
+│ └── processors/
+│ ├── customized/
+│ │ └── customized_agent_processor.py # MobileAgentProcessor
+│ └── strategies/
+│ └── mobile_agent_strategy.py # Processing strategies
+├── prompter/
+│ └── customized/
+│ └── mobile_agent_prompter.py # Prompt construction
+├── module/
+│ └── sessions/
+│ └── mobile_session.py # Session management
+└── client/
+ └── mcp/
+ └── http_servers/
+ └── mobile_mcp_server.py # MCP server implementation
+```
+
+## Next Steps
+
+To understand MobileAgent's complete architecture:
+
+1. [State Machine](state.md) - Learn about the 3-state FSM
+2. [Processing Strategy](strategy.md) - Understand the 4-phase pipeline
+3. [MCP Commands](commands.md) - Explore mobile UI interaction commands
+4. [As Galaxy Device](as_galaxy_device.md) - Configure for multi-device workflows
+
+For deployment and configuration, see the Quick Start Guide (coming soon).
diff --git a/documents/docs/mobile/state.md b/documents/docs/mobile/state.md
new file mode 100644
index 000000000..18a616b6c
--- /dev/null
+++ b/documents/docs/mobile/state.md
@@ -0,0 +1,403 @@
+# MobileAgent State Machine
+
+MobileAgent uses a **3-state finite state machine (FSM)** to manage Android device task execution flow. The minimal state set captures essential execution progression while maintaining simplicity and predictability. States transition based on LLM decisions and action execution results.
+
+> **📖 Related Documentation:**
+>
+> - [Mobile Agent Overview](overview.md) - Architecture and core responsibilities
+> - [Processing Strategy](strategy.md) - 4-phase pipeline execution in CONTINUE state
+> - [MCP Commands](commands.md) - Available mobile interaction commands
+> - [Quick Start Guide](../getting_started/quick_start_mobile.md) - Set up your first Mobile Agent
+
+## State Machine Architecture
+
+### State Enumeration
+
+```python
+class MobileAgentStatus(Enum):
+ """Store the status of the mobile agent"""
+ CONTINUE = "CONTINUE" # Task is ongoing, requires further actions
+ FINISH = "FINISH" # Task completed successfully
+ FAIL = "FAIL" # Task cannot proceed, unrecoverable error
+```
+
+### State Management
+
+MobileAgent states are managed by `MobileAgentStateManager`, which implements the agent state registry pattern:
+
+```python
+class MobileAgentStateManager(AgentStateManager):
+ """Manages the states of the mobile agent"""
+ _state_mapping: Dict[str, Type[MobileAgentState]] = {}
+
+ @property
+ def none_state(self) -> AgentState:
+ return NoneMobileAgentState()
+```
+
+All MobileAgent states are registered using the `@MobileAgentStateManager.register` decorator, enabling dynamic state lookup by name.
+
+## State Transition Diagram
+
+```mermaid
+stateDiagram-v2
+ [*] --> CONTINUE: Start Task
+
+ CONTINUE --> CONTINUE: More Actions Needed (LLM returns CONTINUE)
+ CONTINUE --> FINISH: Task Complete (LLM returns FINISH)
+ CONTINUE --> FAIL: Unrecoverable Error (LLM returns FAIL or Exception)
+
+ FINISH --> [*]: Session Ends
+ FAIL --> FINISH: Cleanup
+ FINISH --> [*]: Session Ends
+
+ note right of CONTINUE
+ Active execution state:
+ - Capture screenshots
+ - Collect UI controls
+ - Get LLM decision
+ - Execute actions
+ - Update memory
+ end note
+
+ note right of FINISH
+ Terminal state:
+ - Task completed successfully
+ - Results available in memory
+ - Agent can be terminated
+ end note
+
+ note right of FAIL
+ Error terminal state:
+ - Unrecoverable error occurred
+ - Error details logged
+ - Transitions to FINISH for cleanup
+ end note
+```
+
+## State Definitions
+
+### 1. CONTINUE State
+
+**Purpose**: Active execution state where MobileAgent processes the user request and executes mobile actions.
+
+```python
+@MobileAgentStateManager.register
+class ContinueMobileAgentState(MobileAgentState):
+ """The class for the continue mobile agent state"""
+
+ async def handle(self, agent: "MobileAgent", context: Optional["Context"] = None):
+ """Execute the 4-phase processing pipeline"""
+ await agent.process(context)
+
+ def is_round_end(self) -> bool:
+ return False # Round continues
+
+ def is_subtask_end(self) -> bool:
+ return False # Subtask continues
+
+ @classmethod
+ def name(cls) -> str:
+ return MobileAgentStatus.CONTINUE.value
+```
+
+| Property | Value |
+|----------|-------|
+| **Type** | Active |
+| **Processor Executed** | ✓ Yes (4 phases) |
+| **Round Ends** | No |
+| **Subtask Ends** | No |
+| **Duration** | Single round |
+| **Next States** | CONTINUE, FINISH, FAIL |
+
+**Behavior**:
+
+1. **Data Collection Phase**:
+ - Captures device screenshot
+ - Retrieves installed apps list
+ - Collects current screen UI controls
+ - Creates annotated screenshot with control IDs
+
+2. **LLM Interaction Phase**:
+ - Constructs prompts with screenshots and control information
+ - Gets next action from LLM
+ - Parses and validates response
+
+3. **Action Execution Phase**:
+ - Executes mobile actions (tap, swipe, type, launch app, etc.)
+ - Captures execution results
+
+4. **Memory Update Phase**:
+ - Updates memory with screenshots and action results
+ - Stores control information for next round
+
+5. **State Determination**:
+ - Analyzes LLM response for next state
+
+**State Transition Logic**:
+
+- **CONTINUE → CONTINUE**: Task requires more actions to complete (e.g., need to navigate through multiple screens)
+- **CONTINUE → FINISH**: LLM determines task is complete (e.g., successfully filled form and submitted)
+- **CONTINUE → FAIL**: Unrecoverable error encountered (e.g., required app not installed, control not found after multiple attempts)
+
+### 2. FINISH State
+
+**Purpose**: Terminal state indicating successful task completion.
+
+```python
+@MobileAgentStateManager.register
+class FinishMobileAgentState(MobileAgentState):
+ """The class for the finish mobile agent state"""
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ return FinishMobileAgentState() # Remains in FINISH
+
+ def is_subtask_end(self) -> bool:
+ return True # Subtask completed
+
+ def is_round_end(self) -> bool:
+ return True # Round ends
+
+ @classmethod
+ def name(cls) -> str:
+ return MobileAgentStatus.FINISH.value
+```
+
+| Property | Value |
+|----------|-------|
+| **Type** | Terminal |
+| **Processor Executed** | ✗ No |
+| **Round Ends** | Yes |
+| **Subtask Ends** | Yes |
+| **Duration** | Permanent |
+| **Next States** | FINISH (no transition) |
+
+**Behavior**:
+
+- Signals task completion to session manager
+- No further processing occurs
+- Agent instance can be terminated
+- Screenshots and action history available in memory
+
+**FINISH state is reached when**:
+
+- All required mobile actions have been executed successfully
+- The LLM determines the user request has been fulfilled
+- Target UI state has been achieved (e.g., form submitted, information displayed)
+- No errors or exceptions occurred during execution
+
+### 3. FAIL State
+
+**Purpose**: Terminal state indicating task failure due to unrecoverable errors.
+
+```python
+@MobileAgentStateManager.register
+class FailMobileAgentState(MobileAgentState):
+ """The class for the fail mobile agent state"""
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ return FinishMobileAgentState() # Transitions to FINISH for cleanup
+
+ def is_round_end(self) -> bool:
+ return True # Round ends
+
+ def is_subtask_end(self) -> bool:
+ return True # Subtask failed
+
+ @classmethod
+ def name(cls) -> str:
+ return MobileAgentStatus.FAIL.value
+```
+
+| Property | Value |
+|----------|-------|
+| **Type** | Terminal (Error) |
+| **Processor Executed** | ✗ No |
+| **Round Ends** | Yes |
+| **Subtask Ends** | Yes |
+| **Duration** | Transitions to FINISH |
+| **Next States** | FINISH |
+
+**Behavior**:
+
+- Logs failure reason and context
+- Captures final screenshot for debugging
+- Transitions to FINISH state for cleanup
+- Session manager receives failure status
+
+!!!error "Failure Conditions"
+ FAIL state is reached when:
+
+ - **App Unavailable**: Required app is not installed or cannot be launched
+ - **Control Not Found**: Target UI control cannot be located after multiple attempts
+ - **Device Disconnected**: ADB connection lost during execution
+ - **Permission Denied**: Required permissions not granted on device
+ - **Timeout**: Actions take too long to complete
+ - **LLM Explicit Failure**: LLM explicitly indicates task cannot be completed
+ - **Repeated Action Failures**: Multiple consecutive actions fail
+
+**Error Recovery**:
+
+While FAIL is a terminal state, the error information is logged for debugging:
+
+```python
+# Example error logging in FAIL state
+agent.logger.error(f"Mobile task failed: {error_message}")
+agent.logger.debug(f"Last action: {last_action}")
+agent.logger.debug(f"Current screenshot saved to: {screenshot_path}")
+agent.logger.debug(f"UI controls at failure: {current_controls}")
+```
+
+## State Transition Rules
+
+### Transition Decision Logic
+
+State transitions are determined by the LLM's response in the **CONTINUE** state:
+
+```python
+# LLM returns status in response
+parsed_response = {
+ "action": {
+ "function": "click_control",
+ "arguments": {"control_id": "5", "control_name": "Search"},
+ "status": "CONTINUE" # or "FINISH" or "FAIL"
+ },
+ "thought": "Need to click the search button to proceed"
+}
+
+# Agent updates its status based on LLM decision
+agent.status = parsed_response["action"]["status"]
+next_state = MobileAgentStateManager().get_state(agent.status)
+```
+
+### Transition Matrix
+
+| Current State | Condition | Next State | Trigger |
+|---------------|-----------|------------|---------|
+| **CONTINUE** | LLM returns CONTINUE | CONTINUE | More actions needed (e.g., navigating multiple screens) |
+| **CONTINUE** | LLM returns FINISH | FINISH | Task completed (e.g., information found and displayed) |
+| **CONTINUE** | LLM returns FAIL | FAIL | Unrecoverable error (e.g., required control not available) |
+| **CONTINUE** | Exception raised | FAIL | System error (e.g., ADB disconnected) |
+| **FINISH** | Any | FINISH | No transition |
+| **FAIL** | Any | FINISH | Cleanup transition |
+
+## State-Specific Processing
+
+### CONTINUE State Processing Pipeline
+
+When in CONTINUE state, MobileAgent executes the full 4-phase pipeline:
+
+```mermaid
+graph TD
+ A[CONTINUE State] --> B[Phase 1: Data Collection]
+ B --> B1[Capture Screenshot]
+ B1 --> B2[Get Installed Apps]
+ B2 --> B3[Get Current Controls]
+ B3 --> B4[Create Annotated Screenshot]
+
+ B4 --> C[Phase 2: LLM Interaction]
+ C --> C1[Construct Prompt with Visual Context]
+ C1 --> C2[Send to LLM]
+ C2 --> C3[Parse Response]
+
+ C3 --> D[Phase 3: Action Execution]
+ D --> D1[Execute Mobile Action]
+ D1 --> D2[Capture Result]
+
+ D2 --> E[Phase 4: Memory Update]
+ E --> E1[Store Screenshot]
+ E1 --> E2[Store Action Result]
+ E2 --> E3[Update Control Cache]
+
+ E3 --> F{Check Status}
+ F -->|CONTINUE| A
+ F -->|FINISH| G[FINISH State]
+ F -->|FAIL| H[FAIL State]
+```
+
+### Terminal States (FINISH / FAIL)
+
+Terminal states perform no processing:
+
+- **FINISH**: Clean termination, results and screenshots available in memory
+- **FAIL**: Error termination, error details and final screenshot logged
+
+## Deterministic Control Flow
+
+The 3-state design ensures deterministic, traceable execution:
+
+- **Predictable Behavior**: Every execution path is well-defined
+- **Debuggability**: State transitions are logged with screenshots for visual debugging
+- **Testability**: Finite state space simplifies testing
+- **Maintainability**: Simple state set reduces complexity
+- **Visual Traceability**: Screenshots at each state provide visual execution history
+
+## Comparison with Other Agents
+
+| Agent | States | Complexity | Visual | Use Case |
+|-------|--------|------------|--------|----------|
+| **MobileAgent** | 3 | Minimal | ✓ Screenshots | Android mobile automation |
+| **LinuxAgent** | 3 | Minimal | ✗ Text-only | Linux CLI task execution |
+| **AppAgent** | 6 | Moderate | ✓ Screenshots | Windows app automation |
+| **HostAgent** | 7 | High | ✓ Screenshots | Desktop orchestration |
+
+MobileAgent's minimal 3-state design reflects its focused scope: execute mobile UI actions to fulfill user requests. The simplified state machine eliminates unnecessary complexity while maintaining robust error handling and completion detection, similar to LinuxAgent but with visual context support.
+
+## Mobile-Specific Considerations
+
+### Screenshot-Based State Tracking
+
+Unlike LinuxAgent (text-based) or AppAgent (Windows UI API), MobileAgent relies heavily on screenshots for state understanding:
+
+- Each CONTINUE round starts with a fresh screenshot
+- Annotated screenshots show control IDs for precise interaction
+- Screenshots are saved to memory for debugging and analysis
+- Visual context helps LLM understand current UI state
+
+### Control Caching
+
+MobileAgent caches control information to minimize ADB overhead:
+
+- Controls are cached for 5 seconds
+- Cache is invalidated after each action (UI likely changed)
+- Control dictionary enables quick lookup by ID
+- Reduces repeated UI tree parsing
+
+### Touch-Based Interaction
+
+State transitions in MobileAgent are triggered by touch actions rather than keyboard commands:
+
+- **Tap**: Primary interaction method
+- **Swipe**: For scrolling and gestures
+- **Type**: Text input (requires focused control)
+- **Long-press**: For context menus (planned)
+
+## Implementation Details
+
+The state machine implementation can be found in:
+
+```
+ufo/agents/states/mobile_agent_state.py
+```
+
+Key classes:
+
+- `MobileAgentStatus`: State enumeration (CONTINUE, FINISH, FAIL)
+- `MobileAgentStateManager`: State registry and lookup
+- `MobileAgentState`: Abstract base class
+- `ContinueMobileAgentState`: Active execution state with 4-phase pipeline
+- `FinishMobileAgentState`: Successful completion state
+- `FailMobileAgentState`: Error termination state
+- `NoneMobileAgentState`: Initial/undefined state
+
+## Next Steps
+
+- [Processing Strategy](strategy.md) - Understand the 4-phase processing pipeline executed in CONTINUE state
+- [MCP Commands](commands.md) - Explore mobile UI interaction and app management commands
+- [Overview](overview.md) - Return to MobileAgent architecture overview
diff --git a/documents/docs/mobile/strategy.md b/documents/docs/mobile/strategy.md
new file mode 100644
index 000000000..583228694
--- /dev/null
+++ b/documents/docs/mobile/strategy.md
@@ -0,0 +1,886 @@
+# MobileAgent Processing Strategy
+
+MobileAgent executes a **4-phase processing pipeline** in the **CONTINUE** state. Each phase handles a specific aspect of mobile task execution: data collection (screenshots and controls), LLM decision making, action execution, and memory recording. This design separates visual context gathering from prompt construction, LLM reasoning, mobile action execution, and state updates, enhancing modularity and traceability.
+
+> **📖 Related Documentation:**
+>
+> - [Mobile Agent Overview](overview.md) - Architecture and core responsibilities
+> - [State Machine](state.md) - FSM states (this strategy runs in CONTINUE state)
+> - [MCP Commands](commands.md) - Available commands used in each phase
+> - [Quick Start Guide](../getting_started/quick_start_mobile.md) - Set up your first Mobile Agent
+
+## Strategy Assembly
+
+Processing strategies are assembled and orchestrated by the `MobileAgentProcessor` class defined in `ufo/agents/processors/customized/customized_agent_processor.py`. The processor coordinates the 4-phase pipeline execution.
+
+### MobileAgentProcessor Overview
+
+The `MobileAgentProcessor` extends `CustomizedProcessor` and manages the Mobile-specific workflow:
+
+```python
+class MobileAgentProcessor(CustomizedProcessor):
+ """
+ Processor for Mobile Android MCP Agent.
+ Handles data collection, LLM interaction, and action execution for Android devices.
+ """
+
+ def _setup_strategies(self) -> None:
+ """Setup processing strategies for Mobile Agent."""
+
+ # Phase 1: Data Collection (composed strategy - fail_fast=True)
+ self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
+ strategies=[
+ MobileScreenshotCaptureStrategy(fail_fast=True),
+ MobileAppsCollectionStrategy(fail_fast=False),
+ MobileControlsCollectionStrategy(fail_fast=False),
+ ],
+ name="MobileDataCollectionStrategy",
+ fail_fast=True,
+ )
+
+ # Phase 2: LLM Interaction (critical - fail_fast=True)
+ self.strategies[ProcessingPhase.LLM_INTERACTION] = (
+ MobileLLMInteractionStrategy(fail_fast=True)
+ )
+
+ # Phase 3: Action Execution (graceful - fail_fast=False)
+ self.strategies[ProcessingPhase.ACTION_EXECUTION] = (
+ MobileActionExecutionStrategy(fail_fast=False)
+ )
+
+ # Phase 4: Memory Update (graceful - fail_fast=False)
+ self.strategies[ProcessingPhase.MEMORY_UPDATE] = (
+ AppMemoryUpdateStrategy(fail_fast=False)
+ )
+```
+
+### Strategy Registration
+
+| Phase | Strategy Class | fail_fast | Rationale |
+|-------|---------------|-----------|-----------|
+| **DATA_COLLECTION** | `ComposedStrategy` (3 sub-strategies) | ✓ True | Visual context is critical for mobile interaction |
+| **LLM_INTERACTION** | `MobileLLMInteractionStrategy` | ✓ True | LLM failure requires immediate recovery |
+| **ACTION_EXECUTION** | `MobileActionExecutionStrategy` | ✗ False | Action failures can be handled gracefully |
+| **MEMORY_UPDATE** | `AppMemoryUpdateStrategy` | ✗ False | Memory failures shouldn't block execution |
+
+**Fail-Fast vs Graceful:**
+
+- **fail_fast=True**: Critical phases where errors should immediately transition to FAIL state
+- **fail_fast=False**: Non-critical phases where errors can be logged and execution continues
+
+## Four-Phase Pipeline
+
+### Pipeline Execution Flow
+
+```mermaid
+graph LR
+ A[CONTINUE State] --> B[Phase 1: Data Collection]
+ B --> C[Phase 2: LLM Interaction]
+ C --> D[Phase 3: Action Execution]
+ D --> E[Phase 4: Memory Update]
+ E --> F[Determine Next State]
+ F --> G{Status?}
+ G -->|CONTINUE| A
+ G -->|FINISH| H[FINISH State]
+ G -->|FAIL| I[FAIL State]
+```
+
+## Phase 1: Data Collection Strategy (Composed)
+
+**Purpose**: Gather comprehensive visual and structural information about the current mobile UI state.
+
+Phase 1 is a **composed strategy** consisting of three sub-strategies executed sequentially:
+
+1. **Screenshot Capture**: Take device screenshot
+2. **Apps Collection**: List installed applications
+3. **Controls Collection**: Extract UI hierarchy and annotate controls
+
+### Sub-Strategy 1.1: Screenshot Capture
+
+```python
+@depends_on("log_path", "session_step")
+@provides(
+ "clean_screenshot_path",
+ "clean_screenshot_url",
+ "annotated_screenshot_url", # Initially None, set by Controls Collection
+ "screenshot_saved_time",
+)
+class MobileScreenshotCaptureStrategy(BaseProcessingStrategy):
+ """
+ Strategy for capturing Android device screenshots.
+ """
+```
+
+#### Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant MCP
+ participant ADB
+ participant Device
+
+ Strategy->>MCP: capture_screenshot command
+ MCP->>ADB: screencap -p /sdcard/screen_temp.png
+ ADB->>Device: Execute screenshot
+ Device-->>ADB: Screenshot saved
+
+ ADB->>Device: Pull screenshot
+ Device-->>ADB: PNG file
+ ADB-->>MCP: PNG data
+
+ MCP->>MCP: Encode to base64
+ MCP-->>Strategy: data:image/png;base64,...
+
+ Strategy->>Strategy: Save to log_path
+ Strategy-->>Agent: Screenshot URL + path
+```
+
+#### Output
+
+```python
+{
+ "clean_screenshot_path": "logs/.../action_step1.png",
+ "clean_screenshot_url": "data:image/png;base64,iVBORw0KGgoAAAANS...",
+ "annotated_screenshot_url": None, # Set by Controls Collection
+ "screenshot_saved_time": 0.234 # seconds
+}
+```
+
+### Sub-Strategy 1.2: Apps Collection
+
+```python
+@depends_on("clean_screenshot_url")
+@provides("installed_apps", "apps_collection_time")
+class MobileAppsCollectionStrategy(BaseProcessingStrategy):
+ """
+ Strategy for collecting installed apps information from Android device.
+ """
+```
+
+#### Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant MCP
+ participant ADB
+ participant Device
+
+ Strategy->>MCP: get_mobile_app_target_info
+ MCP->>MCP: Check cache (5min TTL)
+
+ alt Cache Hit
+ MCP-->>Strategy: Cached app list
+ else Cache Miss
+ MCP->>ADB: pm list packages -3
+ ADB->>Device: List user apps
+ Device-->>ADB: Package list
+ ADB-->>MCP: Packages
+
+ MCP->>MCP: Parse to TargetInfo
+ MCP->>MCP: Update cache
+ MCP-->>Strategy: App list
+ end
+
+ Strategy-->>Agent: Installed apps
+```
+
+#### Output Format
+
+```python
+{
+ "installed_apps": [
+ {
+ "id": "1",
+ "name": "com.android.chrome",
+ "package": "com.android.chrome"
+ },
+ {
+ "id": "2",
+ "name": "com.google.android.apps.maps",
+ "package": "com.google.android.apps.maps"
+ },
+ ...
+ ],
+ "apps_collection_time": 0.156 # seconds
+}
+```
+
+**Caching**: Apps list is cached for 5 minutes to reduce ADB overhead, as installed apps rarely change during a session.
+
+### Sub-Strategy 1.3: Controls Collection
+
+```python
+@depends_on("clean_screenshot_url")
+@provides(
+ "current_controls",
+ "controls_collection_time",
+ "annotated_screenshot_url",
+ "annotated_screenshot_path",
+ "annotation_dict",
+)
+class MobileControlsCollectionStrategy(BaseProcessingStrategy):
+ """
+ Strategy for collecting current screen controls information from Android device.
+ Creates annotated screenshots with control labels.
+ """
+```
+
+#### Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant MCP
+ participant ADB
+ participant Device
+ participant Photographer
+
+ Strategy->>MCP: get_app_window_controls_target_info
+ MCP->>MCP: Check cache (5s TTL)
+
+ alt Cache Hit
+ MCP-->>Strategy: Cached controls
+ else Cache Miss
+ MCP->>ADB: uiautomator dump /sdcard/window_dump.xml
+ ADB->>Device: Dump UI hierarchy
+ Device-->>ADB: XML file
+
+ ADB->>Device: cat /sdcard/window_dump.xml
+ Device-->>ADB: XML content
+ ADB-->>MCP: UI hierarchy XML
+
+ MCP->>MCP: Parse XML
+ MCP->>MCP: Extract clickable controls
+ MCP->>MCP: Validate rectangles
+ MCP->>MCP: Assign IDs
+ MCP->>MCP: Update cache
+ MCP-->>Strategy: Controls list
+ end
+
+ Strategy->>Strategy: Convert to TargetInfo
+ Strategy->>Photographer: Create annotated screenshot
+ Photographer->>Photographer: Draw control IDs on screenshot
+ Photographer-->>Strategy: Annotated image
+
+ Strategy-->>Agent: Controls + Annotated screenshot
+```
+
+#### UI Hierarchy Parsing
+
+The strategy parses Android UI XML to extract meaningful controls:
+
+```xml
+
+
+
+
+
+
+
+
+
+```
+
+**Control Selection Criteria**:
+
+- `clickable="true"` - Can be tapped
+- `long-clickable="true"` - Supports long-press
+- `scrollable="true"` - Can be scrolled
+- `checkable="true"` - Checkbox or toggle
+- Has `text` or `content-desc` - Has label
+- Type includes "Edit", "Button" - Input or action element
+
+**Rectangle Validation**:
+
+Controls with invalid rectangles are filtered out:
+
+```python
+# Bounds format: [left, top, right, bottom]
+if right <= left or bottom <= top:
+ # Invalid: width or height is zero/negative
+ skip_control()
+```
+
+#### Output Format
+
+```python
+{
+ "current_controls": [
+ {
+ "id": "1",
+ "name": "Search",
+ "type": "EditText",
+ "rect": [48, 96, 912, 192] # [left, top, right, bottom]
+ },
+ {
+ "id": "2",
+ "name": "Search",
+ "type": "ImageButton",
+ "rect": [912, 96, 1032, 192]
+ },
+ ...
+ ],
+ "annotated_screenshot_url": "data:image/png;base64,...",
+ "annotated_screenshot_path": "logs/.../action_step1_annotated.png",
+ "annotation_dict": {
+ "1": {"id": "1", "name": "Search", "type": "EditText", ...},
+ "2": {"id": "2", "name": "Search", "type": "ImageButton", ...},
+ ...
+ },
+ "controls_collection_time": 0.345 # seconds
+}
+```
+
+**Caching**: Controls are cached for 5 seconds, but the cache is invalidated after every action (UI likely changed).
+
+### Composed Strategy Execution
+
+The three sub-strategies are executed sequentially in a single composed strategy:
+
+```python
+ComposedStrategy(
+ strategies=[
+ MobileScreenshotCaptureStrategy(fail_fast=True),
+ MobileAppsCollectionStrategy(fail_fast=False),
+ MobileControlsCollectionStrategy(fail_fast=False),
+ ],
+ name="MobileDataCollectionStrategy",
+ fail_fast=True, # Overall failure if screenshot capture fails
+)
+```
+
+**Execution Order**:
+
+1. Screenshot Capture (critical)
+2. Apps Collection (optional, continues on failure)
+3. Controls Collection (optional, continues on failure)
+
+---
+
+## Phase 2: LLM Interaction Strategy
+
+**Purpose**: Construct mobile-specific prompts with visual context and obtain next action from LLM.
+
+### Strategy Implementation
+
+```python
+@depends_on("installed_apps", "current_controls", "clean_screenshot_url")
+@provides(
+ "parsed_response",
+ "response_text",
+ "llm_cost",
+ "prompt_message",
+ "action",
+ "thought",
+ "comment",
+)
+class MobileLLMInteractionStrategy(AppLLMInteractionStrategy):
+ """
+ Strategy for LLM interaction with Mobile Agent specific prompting.
+ """
+```
+
+### Phase 2 Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant Agent
+ participant Prompter
+ participant LLM
+
+ Strategy->>Agent: Get previous plan
+ Strategy->>Agent: Get blackboard context
+ Agent-->>Strategy: Previous execution results
+
+ Strategy->>Prompter: Construct mobile prompt
+ Prompter->>Prompter: Build system message (APIs + examples)
+ Prompter->>Prompter: Add screenshot images
+ Prompter->>Prompter: Add annotated screenshot
+ Prompter->>Prompter: Add text prompt with context
+ Prompter-->>Strategy: Complete multimodal prompt
+
+ Strategy->>LLM: Send prompt
+ LLM-->>Strategy: Mobile action + status
+
+ Strategy->>Strategy: Parse response
+ Strategy->>Strategy: Validate action
+ Strategy-->>Agent: Parsed response + cost
+```
+
+### Prompt Construction
+
+The strategy constructs comprehensive multimodal prompts:
+
+```python
+prompt_message = agent.message_constructor(
+ dynamic_examples=[], # Few-shot examples (optional)
+ dynamic_knowledge="", # Retrieved knowledge (optional)
+ plan=plan, # Previous execution plan
+ request=request, # User request
+ installed_apps=installed_apps, # Available apps
+ current_controls=current_controls, # UI controls with IDs
+ screenshot_url=clean_screenshot_url, # Clean screenshot
+ annotated_screenshot_url=annotated_screenshot_url, # With control IDs
+ blackboard_prompt=blackboard_prompt, # Shared context
+ last_success_actions=last_success_actions # Successful actions
+)
+```
+
+### Multimodal Content Structure
+
+The prompt includes both visual and textual elements:
+
+```python
+user_content = [
+ # 1. Clean screenshot (for visual understanding)
+ {
+ "type": "image_url",
+ "image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}
+ },
+
+ # 2. Annotated screenshot (for control identification)
+ {
+ "type": "image_url",
+ "image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}
+ },
+
+ # 3. Text prompt with context
+ {
+ "type": "text",
+ "text": """
+ [Previous Plan]: [...]
+ [User Request]: Search for restaurants on Maps
+ [Installed Apps]: [
+ {"id": "1", "name": "com.google.android.apps.maps", ...},
+ ...
+ ]
+ [Current Screen Controls]: [
+ {"id": "1", "name": "Search", "type": "EditText", ...},
+ {"id": "2", "name": "Search", "type": "ImageButton", ...},
+ ...
+ ]
+ [Last Success Actions]: [...]
+ """
+ }
+]
+```
+
+### LLM Response Format
+
+The LLM returns a structured mobile action:
+
+```json
+{
+ "thought": "I need to launch Google Maps app first",
+ "action": {
+ "function": "launch_app",
+ "arguments": {
+ "package_name": "com.google.android.apps.maps",
+ "id": "1"
+ },
+ "status": "CONTINUE"
+ },
+ "comment": "Launching Maps to search for restaurants"
+}
+```
+
+### Mobile-Specific Features
+
+**Visual Context Priority**: LLM sees both clean and annotated screenshots, enabling better UI understanding than text-only descriptions.
+
+**Control ID References**: Annotated screenshot shows control IDs, allowing LLM to precisely reference UI elements in actions.
+
+**App Awareness**: LLM knows which apps are installed, enabling intelligent app selection and launching.
+
+**Touch-Based Actions**: LLM generates mobile-specific actions (tap, swipe, type) instead of desktop actions (click, drag, keyboard).
+
+---
+
+## Phase 3: Action Execution Strategy
+
+**Purpose**: Execute mobile actions returned by LLM and capture structured results.
+
+### Strategy Implementation
+
+```python
+class MobileActionExecutionStrategy(AppActionExecutionStrategy):
+ """
+ Strategy for executing actions in Mobile Agent.
+ """
+```
+
+### Phase 3 Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant MCP
+ participant ADB
+ participant Device
+
+ Strategy->>Strategy: Extract action from LLM response
+
+ alt launch_app
+ Strategy->>MCP: launch_app(package_name)
+ MCP->>ADB: monkey -p package_name
+ else click_control
+ Strategy->>MCP: click_control(control_id, control_name)
+ MCP->>MCP: Get control from cache
+ MCP->>MCP: Calculate center position
+ MCP->>ADB: input tap x y
+ else type_text
+ Strategy->>MCP: type_text(text, control_id, ...)
+ MCP->>ADB: input tap (focus control)
+ MCP->>ADB: input text (type)
+ else swipe
+ Strategy->>MCP: swipe(start_x, start_y, end_x, end_y)
+ MCP->>ADB: input swipe ...
+ else tap
+ Strategy->>MCP: tap(x, y)
+ MCP->>ADB: input tap x y
+ else press_key
+ Strategy->>MCP: press_key(key_code)
+ MCP->>ADB: input keyevent KEY_CODE
+ else wait
+ Strategy->>Strategy: asyncio.sleep(seconds)
+ end
+
+ ADB->>Device: Execute command
+ Device-->>ADB: Result
+ ADB-->>MCP: Success/Failure
+
+ MCP->>MCP: Invalidate controls cache
+ MCP-->>Strategy: Execution result
+
+ Strategy->>Strategy: Create action info
+ Strategy->>Strategy: Format for memory
+ Strategy-->>Agent: Execution results
+```
+
+### Action Execution Flow
+
+```python
+# Extract parsed LLM response
+parsed_response: AppAgentResponse = context.get_local("parsed_response")
+command_dispatcher = context.global_context.command_dispatcher
+
+# Execute the action via MCP
+execution_results = await self._execute_app_action(
+ command_dispatcher,
+ parsed_response.action
+)
+```
+
+### Result Capture
+
+Execution results are structured for downstream processing:
+
+```python
+{
+ "success": True,
+ "action": "click_control(id=5, name=Search)",
+ "message": "Clicked control 'Search' at (480, 144)",
+ "control_info": {
+ "id": "5",
+ "name": "Search",
+ "type": "EditText",
+ "rect": [48, 96, 912, 192]
+ }
+}
+```
+
+### Action Info Creation
+
+Results are formatted into `ActionCommandInfo` objects:
+
+```python
+actions = self._create_action_info(
+ parsed_response.action,
+ execution_results,
+)
+
+action_info = ListActionCommandInfo(actions)
+action_info.color_print() # Pretty print to console
+```
+
+### Cache Invalidation
+
+After each action, control caches are invalidated:
+
+```python
+# Mobile MCP server automatically invalidates caches after actions
+# This ensures next round gets fresh UI state
+mobile_state.invalidate_controls()
+```
+
+---
+
+## Phase 4: Memory Update Strategy
+
+**Purpose**: Persist execution results, screenshots, and control information into agent memory for future reference.
+
+### Strategy Implementation
+
+MobileAgent reuses the `AppMemoryUpdateStrategy` from the app agent framework:
+
+```python
+self.strategies[ProcessingPhase.MEMORY_UPDATE] = AppMemoryUpdateStrategy(
+ fail_fast=False # Memory failures shouldn't stop process
+)
+```
+
+### Phase 4 Workflow
+
+```mermaid
+sequenceDiagram
+ participant Strategy
+ participant Memory
+ participant Context
+
+ Strategy->>Context: Get execution results
+ Strategy->>Context: Get LLM response
+ Strategy->>Context: Get screenshots
+
+ Strategy->>Memory: Create memory item
+ Memory->>Memory: Store screenshots (clean + annotated)
+ Memory->>Memory: Store action details
+ Memory->>Memory: Store control information
+ Memory->>Memory: Store timestamp
+
+ Strategy->>Context: Update round result
+ Strategy-->>Agent: Memory updated
+```
+
+### Memory Structure
+
+Each execution round is stored as a memory item:
+
+```python
+{
+ "round": 1,
+ "request": "Search for restaurants on Maps",
+ "thought": "I need to launch Google Maps app first",
+ "action": {
+ "function": "launch_app",
+ "arguments": {
+ "package_name": "com.google.android.apps.maps",
+ "id": "1"
+ }
+ },
+ "result": {
+ "success": True,
+ "message": "Launched com.google.android.apps.maps"
+ },
+ "screenshots": {
+ "clean": "logs/.../action_step1.png",
+ "annotated": "logs/.../action_step1_annotated.png"
+ },
+ "controls": [
+ {"id": "1", "name": "Search", "type": "EditText", ...},
+ ...
+ ],
+ "status": "CONTINUE",
+ "timestamp": "2025-11-14T10:30:45"
+}
+```
+
+### Iterative Refinement
+
+Memory enables iterative refinement across rounds:
+
+1. **Round 1**: Launch Maps app → Maps opened
+2. **Round 2**: Click search field (using control ID from Round 1 screenshot)
+3. **Round 3**: Type "restaurants" → Text entered
+4. **Round 4**: Click search button → Results displayed
+
+Each round builds on previous results and screenshots stored in memory.
+
+### Visual Debugging
+
+Memory stores screenshots for each round, enabling visual debugging:
+
+- **Clean Screenshots**: Show actual device UI
+- **Annotated Screenshots**: Show control IDs used by LLM
+- **Action Sequence**: Visual trace of entire task execution
+
+---
+
+## Middleware Stack
+
+MobileAgent uses specialized middleware for logging:
+
+```python
+def _setup_middleware(self) -> None:
+ """Setup middleware pipeline for Mobile Agent"""
+ self.middleware_chain = [MobileLoggingMiddleware()]
+```
+
+### MobileLoggingMiddleware
+
+Provides enhanced logging specific to Mobile operations:
+
+```python
+class MobileLoggingMiddleware(AppAgentLoggingMiddleware):
+ """Specialized logging middleware for Mobile Agent"""
+
+ def starting_message(self, context: ProcessingContext) -> str:
+ request = context.get("request") or "Unknown Request"
+ return f"Completing the user request: [bold cyan]{request}[/bold cyan] on Mobile."
+```
+
+**Logged Information**:
+
+- User request
+- Screenshots captured (with paths)
+- Apps collected
+- Controls identified (with IDs)
+- Each mobile action executed
+- Action results
+- State transitions
+- LLM costs
+- Timing information
+
+---
+
+## Context Finalization
+
+After processing, the processor updates global context:
+
+```python
+def _finalize_processing_context(self, processing_context: ProcessingContext):
+ """Finalize processing context by updating ContextNames fields"""
+ super()._finalize_processing_context(processing_context)
+
+ try:
+ result = processing_context.get_local("result")
+ if result:
+ self.global_context.set(ContextNames.ROUND_RESULT, result)
+ except Exception as e:
+ self.logger.warning(f"Failed to update context: {e}")
+```
+
+This makes execution results available to:
+
+- Subsequent rounds (iterative execution)
+- Other agents (if part of multi-agent workflow)
+- Session manager (for monitoring and logging)
+
+---
+
+## Strategy Dependency Graph
+
+The four phases have clear dependencies:
+
+```mermaid
+graph TD
+ A[log_path + session_step] --> B[Phase 1.1: Screenshot Capture]
+ B --> C[clean_screenshot_url]
+
+ C --> D[Phase 1.2: Apps Collection]
+ D --> E[installed_apps]
+
+ C --> F[Phase 1.3: Controls Collection]
+ F --> G[current_controls]
+ F --> H[annotated_screenshot_url]
+ F --> I[annotation_dict]
+
+ E --> J[Phase 2: LLM Interaction]
+ G --> J
+ C --> J
+ H --> J
+ J --> K[parsed_response]
+ J --> L[llm_cost]
+
+ K --> M[Phase 3: Action Execution]
+ I --> M
+ M --> N[execution_result]
+ M --> O[action_info]
+
+ K --> P[Phase 4: Memory Update]
+ N --> P
+ O --> P
+ C --> P
+ H --> P
+ P --> Q[Memory Updated]
+
+ Q --> R[Next Round or Terminal State]
+```
+
+---
+
+## Modular Design Benefits
+
+The 4-phase strategy design provides:
+
+!!!success "Modularity Benefits"
+ - **Separation of Concerns**: Data collection, LLM reasoning, action execution, and memory are isolated
+ - **Visual Context**: Screenshots provide rich UI understanding beyond text descriptions
+ - **Testability**: Each phase can be tested independently with mocked data
+ - **Extensibility**: New data collection strategies can be added (e.g., accessibility info)
+ - **Reusability**: Memory strategy is shared with AppAgent
+ - **Maintainability**: Clear boundaries between perception, decision, and action
+ - **Traceability**: Each phase logs its operations independently with visual artifacts
+ - **Performance**: Caching strategies reduce ADB overhead
+
+---
+
+## Comparison with Other Agents
+
+| Agent | Phases | Data Collection | Visual | LLM | Action | Memory |
+|-------|--------|----------------|--------|-----|--------|--------|
+| **MobileAgent** | 4 | ✓ Screenshots + Controls + Apps | ✓ Multimodal | ✓ Mobile actions | ✓ Touch/swipe | ✓ Results + Screenshots |
+| **LinuxAgent** | 3 | ✗ On-demand | ✗ Text-only | ✓ CLI commands | ✓ Shell | ✓ Results |
+| **AppAgent** | 4 | ✓ Screenshots + UI | ✓ Multimodal | ✓ UI actions | ✓ GUI + API | ✓ Results + Screenshots |
+| **HostAgent** | 4 | ✓ Desktop snapshot | ✓ Multimodal | ✓ App selection | ✓ Orchestration | ✓ Results |
+
+MobileAgent's 4-phase pipeline includes **DATA_COLLECTION** phase because:
+
+- Mobile UI requires visual context (screenshots)
+- Control identification needs UI hierarchy parsing
+- Touch targets need precise coordinates
+- Apps list informs available actions
+- Annotation creates visual correspondence between LLM and execution
+
+This reflects the visual, touch-based nature of mobile interaction.
+
+---
+
+## Implementation Location
+
+The strategy implementations can be found in:
+
+```
+ufo/agents/processors/
+├── customized/
+│ └── customized_agent_processor.py # MobileAgentProcessor
+└── strategies/
+ └── mobile_agent_strategy.py # Mobile-specific strategies
+```
+
+Key classes:
+
+- `MobileAgentProcessor`: Strategy orchestrator
+- `MobileScreenshotCaptureStrategy`: Screenshot capture via ADB
+- `MobileAppsCollectionStrategy`: Installed apps collection
+- `MobileControlsCollectionStrategy`: UI controls extraction and annotation
+- `MobileLLMInteractionStrategy`: Multimodal prompt construction and LLM interaction
+- `MobileActionExecutionStrategy`: Mobile action execution
+- `MobileLoggingMiddleware`: Enhanced logging
+
+---
+
+## Next Steps
+
+- [MCP Commands](commands.md) - Explore the mobile UI interaction and app management commands
+- [State Machine](state.md) - Understand the 3-state FSM that controls strategy execution
+- [Overview](overview.md) - Return to MobileAgent architecture overview
diff --git a/documents/docs/project_directory_structure.md b/documents/docs/project_directory_structure.md
index 2b44bef6d..42a66c9b7 100644
--- a/documents/docs/project_directory_structure.md
+++ b/documents/docs/project_directory_structure.md
@@ -224,6 +224,27 @@ Lightweight CLI-based agent for Linux devices that integrates with Galaxy as a t
---
+## 📱 Mobile Agent
+
+Android device automation agent that enables UI automation, app control, and mobile-specific operations through ADB integration.
+
+**Key Features**:
+- **UI Automation**: Touch, swipe, and text input via ADB
+- **Visual Context**: Screenshot capture and UI hierarchy analysis
+- **App Management**: Launch apps, navigate between applications
+- **Galaxy Integration**: Serve as mobile device in cross-platform workflows
+- **Platform Support**: Android devices (physical and emulators)
+
+**Configuration**: Configured in `config/ufo/third_party.yaml` under `THIRD_PARTY_AGENT_CONFIG.MobileAgent`
+
+**Mobile Agent Documentation:**
+
+- [Mobile Agent Overview](mobile/overview.md) - Architecture and capabilities
+- [Quick Start](getting_started/quick_start_mobile.md) - Setup and deployment
+- [As Galaxy Device](mobile/as_galaxy_device.md) - Integration with Galaxy
+
+---
+
## ⚙️ Configuration (`config/`)
Modular configuration system with type-safe schemas and auto-discovery.
@@ -334,24 +355,25 @@ Auto-generated execution logs organized by task and timestamp, including screens
---
-## 🎯 Galaxy vs UFO² vs Linux Agent: When to Use What?
+## 🎯 Galaxy vs UFO² vs Linux Agent vs Mobile Agent: When to Use What?
-| Aspect | Galaxy | UFO² | Linux Agent |
-|--------|--------|------|-------------|
-| **Scope** | Multi-device orchestration | Single-device Windows automation | Single-device Linux CLI |
-| **Use Cases** | Cross-platform workflows, distributed tasks | Desktop automation, Office tasks | Server management, CLI operations |
-| **Architecture** | DAG-based task workflows | Two-tier state machines | Simple CLI executor |
-| **Platform** | Orchestrator (platform-agnostic) | Windows | Linux |
-| **Complexity** | Complex multi-step workflows | Simple to moderate tasks | Simple command execution |
-| **Best For** | Cross-device collaboration | Windows desktop tasks | Linux server operations |
-| **Integration** | Orchestrates all agents | Can be Galaxy device | Can be Galaxy device |
+| Aspect | Galaxy | UFO² | Linux Agent | Mobile Agent |
+|--------|--------|------|-------------|--------------|
+| **Scope** | Multi-device orchestration | Single-device Windows automation | Single-device Linux CLI | Single-device Android automation |
+| **Use Cases** | Cross-platform workflows, distributed tasks | Desktop automation, Office tasks | Server management, CLI operations | Mobile app testing, UI automation |
+| **Architecture** | DAG-based task workflows | Two-tier state machines | Simple CLI executor | UI automation via ADB |
+| **Platform** | Orchestrator (platform-agnostic) | Windows | Linux | Android |
+| **Complexity** | Complex multi-step workflows | Simple to moderate tasks | Simple command execution | UI interaction and app control |
+| **Best For** | Cross-device collaboration | Windows desktop tasks | Linux server operations | Mobile app automation |
+| **Integration** | Orchestrates all agents | Can be Galaxy device | Can be Galaxy device | Can be Galaxy device |
**Choosing the Right Framework:**
- **Use Galaxy** when: Tasks span multiple devices/platforms, complex workflows with dependencies
- **Use UFO² Standalone** when: Single-device Windows automation, rapid prototyping
- **Use Linux Agent** when: Linux server/CLI operations needed in Galaxy workflows
-- **Best Practice**: Galaxy orchestrates UFO² (Windows) + Linux Agent (Linux) for cross-platform tasks
+- **Use Mobile Agent** when: Android device automation, mobile app testing, UI interactions
+- **Best Practice**: Galaxy orchestrates UFO² (Windows) + Linux Agent (Linux) + Mobile Agent (Android) for comprehensive cross-platform tasks
---
@@ -390,6 +412,7 @@ python -m ufo --task --config_path config/ufo/
- [Galaxy Quick Start](getting_started/quick_start_galaxy.md)
- [UFO² Quick Start](getting_started/quick_start_ufo2.md)
- [Linux Agent Quick Start](getting_started/quick_start_linux.md)
+- [Mobile Agent Quick Start](getting_started/quick_start_mobile.md)
- [Migration Guide](getting_started/migration_ufo2_to_galaxy.md)
### Galaxy Framework
@@ -410,6 +433,10 @@ python -m ufo --task --config_path config/ufo/
- [Linux Agent Overview](linux/overview.md)
- [As Galaxy Device](linux/as_galaxy_device.md)
+### Mobile Agent
+- [Mobile Agent Overview](mobile/overview.md)
+- [As Galaxy Device](mobile/as_galaxy_device.md)
+
### MCP System
- [MCP Overview](mcp/overview.md)
- [Local Servers](mcp/local_servers.md)
diff --git a/documents/docs/tutorials/creating_device_agent/configuration.md b/documents/docs/tutorials/creating_device_agent/configuration.md
index 142eaf884..c0bf43798 100644
--- a/documents/docs/tutorials/creating_device_agent/configuration.md
+++ b/documents/docs/tutorials/creating_device_agent/configuration.md
@@ -366,7 +366,8 @@ system: |-
user: |-
{{user_request}}
[See attached image]
- {{ui_tree}}
+ {{installed_apps}}
+ {{current_controls}}
{{last_success_actions}}
{{prev_plan}}
@@ -407,9 +408,9 @@ example2:
Find and tap the "Login" button on the current screen.
Response:
observation: |-
- The current screenshot shows a login screen with email and password input fields. There is a button with text "Login" visible near the bottom of the screen. According to the UI tree, the button is located at coordinates (540, 1650) with resource-id "com.example.app:id/login_button".
+ The current screenshot shows a login screen with email and password input fields. There is a button with text "Login" visible near the bottom of the screen. According to the current screen controls list, the button is located at coordinates (540, 1650) with resource-id "com.example.app:id/login_button".
thought: |-
- I can see the Login button in the UI tree. I'll tap it using the coordinates provided.
+ I can see the Login button in the controls list. I'll tap it using the coordinates provided.
action:
function: |-
tap_screen
diff --git a/documents/docs/tutorials/creating_device_agent/core_components.md b/documents/docs/tutorials/creating_device_agent/core_components.md
index 5b0c3bf15..4b03f4fc4 100644
--- a/documents/docs/tutorials/creating_device_agent/core_components.md
+++ b/documents/docs/tutorials/creating_device_agent/core_components.md
@@ -195,7 +195,6 @@ class MobileAgent(CustomizedAgent):
name: str,
main_prompt: str,
example_prompt: str,
- platform: str = "android", # Platform: "android" or "ios"
) -> None:
"""
Initialize the MobileAgent.
@@ -203,7 +202,6 @@ class MobileAgent(CustomizedAgent):
:param name: Agent instance name
:param main_prompt: Main prompt template path
:param example_prompt: Example prompt template path
- :param platform: Mobile platform ("android" or "ios")
"""
super().__init__(
name=name,
@@ -211,63 +209,74 @@ class MobileAgent(CustomizedAgent):
example_prompt=example_prompt,
process_name=None,
app_root_name=None,
- is_visual=True, # Mobile agents typically use screenshots
+ is_visual=None, # Visual mode set by processor based on data collection
)
- # Store platform information
- self._platform = platform
-
- # Initialize blackboard
+ # Initialize blackboard for multi-agent coordination
self._blackboard = Blackboard()
# Set default state
self.set_state(self.default_state)
+ # Track context provision
+ self._context_provision_executed = False
+
# Logger
self.logger = logging.getLogger(__name__)
self.logger.info(
- f"MobileAgent initialized for platform: {platform}"
+ f"Main prompt: {main_prompt}, Example prompt: {example_prompt}"
)
def get_prompter(
self, is_visual: bool, main_prompt: str, example_prompt: str
) -> MobileAgentPrompter:
- """Get the prompter for MobileAgent."""
+ """
+ Get the prompter for MobileAgent.
+
+ :param is_visual: Whether the agent uses visual mode (enabled for MobileAgent)
+ :param main_prompt: Main prompt template path
+ :param example_prompt: Example prompt template path
+ :return: MobileAgentPrompter instance
+ """
return MobileAgentPrompter(main_prompt, example_prompt)
@property
def default_state(self) -> ContinueMobileAgentState:
- """Get the default state."""
+ """
+ Get the default state.
+
+ :return: ContinueMobileAgentState instance
+ """
return ContinueMobileAgentState()
@property
def blackboard(self) -> Blackboard:
- """Get the blackboard."""
+ """
+ Get the blackboard for multi-agent coordination.
+
+ :return: Blackboard instance
+ """
return self._blackboard
-
- @property
- def platform(self) -> str:
- """Get the mobile platform (android/ios)."""
- return self._platform
```
### Key Differences from LinuxAgent
| Aspect | LinuxAgent | MobileAgent |
|--------|-----------|-------------|
-| **is_visual** | `None` (no screenshots) | `True` (UI screenshots needed) |
-| **Platform Tracking** | Not needed | `self._platform` stores "android"/"ios" |
+| **is_visual** | `None` (no screenshots by default) | `None` (visual mode managed by processor) |
+| **Platform Tracking** | Not needed | Not explicitly tracked (handled by device metadata) |
| **Processor** | `LinuxAgentProcessor` | `MobileAgentProcessor` |
| **Prompter** | `LinuxAgentPrompter` | `MobileAgentPrompter` |
| **Default State** | `ContinueLinuxAgentState` | `ContinueMobileAgentState` |
+| **Data Collection** | No screenshots | Screenshots, apps, UI controls via strategies |
!!! tip "Agent Class Best Practices"
- ✅ Always call `super().__init__()` first
- ✅ Initialize blackboard for multi-agent coordination
- - ✅ Set `is_visual=True` if your agent uses screenshots
+ - ✅ Set `is_visual=None` and let processor determine visual mode
- ✅ Use meaningful logger messages for debugging
- - ✅ Store platform-specific metadata as properties
- ✅ Keep initialization logic minimal (delegate to processor)
+ - ✅ Track context provision to avoid redundant operations
---
@@ -395,10 +404,14 @@ from ufo.agents.processors.strategies.customized_agent_processing_strategy impor
CustomizedScreenshotCaptureStrategy,
)
from ufo.agents.processors.strategies.mobile_agent_strategy import (
+ MobileScreenshotCaptureStrategy,
+ MobileAppsCollectionStrategy,
+ MobileControlsCollectionStrategy,
MobileActionExecutionStrategy,
MobileLLMInteractionStrategy,
MobileLoggingMiddleware,
)
+from ufo.agents.processors.strategies.processing_strategy import ComposedStrategy
if TYPE_CHECKING:
from ufo.agents.agent.customized_agent import MobileAgent
@@ -409,8 +422,8 @@ class MobileAgentProcessor(CustomizedProcessor):
Processor for MobileAgent.
Manages execution pipeline with mobile-specific strategies:
- - Data Collection: Screenshots and UI hierarchy
- - LLM Interaction: Mobile UI understanding
+ - Data Collection: Screenshots, installed apps, and UI controls
+ - LLM Interaction: Mobile UI understanding with visual context
- Action Execution: Touch gestures, swipes, taps
- Memory Update: Context tracking
"""
@@ -418,11 +431,15 @@ class MobileAgentProcessor(CustomizedProcessor):
def _setup_strategies(self) -> None:
"""Setup processing strategies for MobileAgent."""
- # Phase 1: Data Collection (screenshots + UI tree)
- self.strategies[ProcessingPhase.DATA_COLLECTION] = (
- CustomizedScreenshotCaptureStrategy(
- fail_fast=True # Stop if screenshot capture fails
- )
+ # Phase 1: Data Collection (compose multiple strategies)
+ self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
+ strategies=[
+ MobileScreenshotCaptureStrategy(fail_fast=True),
+ MobileAppsCollectionStrategy(fail_fast=False),
+ MobileControlsCollectionStrategy(fail_fast=False),
+ ],
+ name="MobileDataCollectionStrategy",
+ fail_fast=True, # Stop if critical data (screenshot) fails
)
# Phase 2: LLM Interaction (mobile UI understanding)
@@ -435,7 +452,7 @@ class MobileAgentProcessor(CustomizedProcessor):
# Phase 3: Action Execution (touch gestures)
self.strategies[ProcessingPhase.ACTION_EXECUTION] = (
MobileActionExecutionStrategy(
- fail_fast=False # Retry on action failures
+ fail_fast=False # Continue on action failures for retry
)
)
@@ -459,13 +476,9 @@ class MobileAgentProcessor(CustomizedProcessor):
try:
# Extract mobile-specific results
result = processing_context.get_local("result")
- ui_state = processing_context.get_local("ui_state")
if result:
self.global_context.set(ContextNames.ROUND_RESULT, result)
- if ui_state:
- # Store UI state for next round
- self.global_context.set("MOBILE_UI_STATE", ui_state)
except Exception as e:
self.logger.warning(f"Failed to finalize context: {e}")
@@ -1132,7 +1145,7 @@ if TYPE_CHECKING:
from ufo.agents.agent.customized_agent import MobileAgent
-@depends_on("request", "screenshot", "ui_tree")
+@depends_on("request", "clean_screenshot_url", "annotated_screenshot_url", "installed_apps", "current_controls")
@provides(
"parsed_response",
"response_text",
@@ -1146,7 +1159,7 @@ class MobileLLMInteractionStrategy(AppLLMInteractionStrategy):
"""
LLM interaction strategy for MobileAgent.
- Handles mobile UI screenshots and hierarchy for LLM understanding.
+ Handles mobile UI screenshots, installed apps, and screen controls for LLM understanding.
"""
def __init__(self, fail_fast: bool = True) -> None:
@@ -1159,32 +1172,40 @@ class MobileLLMInteractionStrategy(AppLLMInteractionStrategy):
try:
# Extract mobile-specific context
request = context.get("request")
- screenshot = context.get_local("screenshot")
- ui_tree = context.get_local("ui_tree")
+ screenshot_url = context.get_local("clean_screenshot_url")
+ annotated_screenshot_url = context.get_local("annotated_screenshot_url")
+ installed_apps = context.get_local("installed_apps", [])
+ current_controls = context.get_local("current_controls", [])
+
+ self.logger.info("Building Mobile Agent prompt")
- self.logger.info(f"Building Mobile Agent prompt for {agent.platform}")
+ # Get blackboard context (if multi-agent)
+ blackboard_prompt = []
+ if not agent.blackboard.is_empty():
+ blackboard_prompt = agent.blackboard.blackboard_to_prompt()
- # Build prompt with mobile context
+ # Construct prompt message with mobile-specific data
prompt_message = agent.message_constructor(
dynamic_examples=[],
dynamic_knowledge="",
plan=self._get_prev_plan(agent),
request=request,
- screenshot=screenshot,
- ui_tree=ui_tree,
- blackboard_prompt=(
- agent.blackboard.blackboard_to_prompt()
- if not agent.blackboard.is_empty() else []
- ),
+ installed_apps=installed_apps,
+ current_controls=current_controls,
+ screenshot_url=screenshot_url,
+ annotated_screenshot_url=annotated_screenshot_url,
+ blackboard_prompt=blackboard_prompt,
last_success_actions=self._get_last_success_actions(agent),
)
# Get LLM response
+ self.logger.info("Getting LLM response for Mobile Agent")
response_text, llm_cost = await self._get_llm_response(
agent, prompt_message
)
# Parse response
+ self.logger.info("Parsing Mobile Agent response")
parsed_response = self._parse_app_response(agent, response_text)
return ProcessingResult(
@@ -1421,7 +1442,7 @@ class MobileAgentPrompter(AppAgentPrompter):
"""
Prompter for MobileAgent.
- Handles mobile UI screenshots and hierarchy for LLM prompts.
+ Handles mobile UI screenshots, installed apps, and control information for LLM prompts.
"""
def __init__(
@@ -1455,16 +1476,18 @@ class MobileAgentPrompter(AppAgentPrompter):
self,
prev_plan: List[str],
user_request: str,
- ui_tree: str = "",
+ installed_apps: List[Dict[str, Any]],
+ current_controls: List[Dict[str, Any]],
retrieved_docs: str = "",
last_success_actions: List[Dict[str, Any]] = [],
) -> str:
"""
- Construct user prompt with mobile UI context.
+ Construct user prompt with mobile context.
:param prev_plan: Previous plan
:param user_request: User request
- :param ui_tree: Mobile UI hierarchy (XML/JSON)
+ :param installed_apps: List of installed apps on the device
+ :param current_controls: List of current screen controls
:param retrieved_docs: Retrieved docs
:param last_success_actions: Last actions
:return: User prompt string
@@ -1472,7 +1495,8 @@ class MobileAgentPrompter(AppAgentPrompter):
prompt = self.prompt_template["user"].format(
prev_plan=json.dumps(prev_plan),
user_request=user_request,
- ui_tree=ui_tree, # Mobile-specific
+ installed_apps=json.dumps(installed_apps),
+ current_controls=json.dumps(current_controls),
retrieved_docs=retrieved_docs,
last_success_actions=json.dumps(last_success_actions),
)
@@ -1483,47 +1507,83 @@ class MobileAgentPrompter(AppAgentPrompter):
self,
prev_plan: List[str],
user_request: str,
- screenshot: Any = None, # Mobile screenshot
- ui_tree: str = "",
+ installed_apps: List[Dict[str, Any]],
+ current_controls: List[Dict[str, Any]],
+ screenshot_url: str = None, # Clean screenshot (base64 URL)
+ annotated_screenshot_url: str = None, # Annotated screenshot (base64 URL)
retrieved_docs: str = "",
last_success_actions: List[Dict[str, Any]] = [],
) -> List[Dict[str, str]]:
"""
- Construct user content with screenshot for vision LLMs.
+ Construct user content with screenshots for vision LLMs.
:param prev_plan: Previous plan
:param user_request: User request
- :param screenshot: Screenshot image (base64 or path)
- :param ui_tree: UI hierarchy
+ :param installed_apps: List of installed apps
+ :param current_controls: List of current screen controls
+ :param screenshot_url: Clean screenshot (base64 URL)
+ :param annotated_screenshot_url: Annotated screenshot (base64 URL)
:param retrieved_docs: Retrieved docs
:param last_success_actions: Last actions
- :return: List of content dicts (text + image)
+ :return: List of content dicts (images + text)
"""
user_content = []
+ # Add screenshots if available (for vision LLMs)
+ if screenshot_url:
+ user_content.append({
+ "type": "image_url",
+ "image_url": {"url": screenshot_url},
+ })
+
+ if annotated_screenshot_url:
+ user_content.append({
+ "type": "image_url",
+ "image_url": {"url": annotated_screenshot_url},
+ })
+
# Add text prompt
user_content.append({
"type": "text",
"text": self.user_prompt_construction(
prev_plan=prev_plan,
user_request=user_request,
- ui_tree=ui_tree,
+ installed_apps=installed_apps,
+ current_controls=current_controls,
retrieved_docs=retrieved_docs,
last_success_actions=last_success_actions,
),
})
- # Add screenshot if available (for vision LLMs)
- if screenshot:
- user_content.append({
- "type": "image_url",
- "image_url": {
- "url": f"data:image/png;base64,{screenshot}"
- },
- })
-
return user_content
```
+ last_success_actions: List[Dict[str, Any]] = [],
+ ) -> str:
+ """
+ Construct user prompt with mobile UI context.
+
+ :param prev_plan: Previous plan
+ :param user_request: User request
+ :param ui_tree: Mobile UI hierarchy (XML/JSON)
+ :param retrieved_docs: Retrieved docs
+ :param last_success_actions: Last actions
+ :return: User prompt string
+ """
+ prompt = self.prompt_template["user"].format(
+ prev_plan=json.dumps(prev_plan),
+ user_request=user_request,
+ ui_tree=ui_tree, # Mobile-specific
+ retrieved_docs=retrieved_docs,
+ last_success_actions=json.dumps(last_success_actions),
+ )
+
+ return prompt
+
+```
+
+!!! note "Note on user_content_construction"
+ The actual `user_content_construction` method is already shown above in the MobileAgentPrompter class.
+ It handles screenshot URLs and control information for vision LLMs.
### Prompter Best Practices
@@ -1533,6 +1593,8 @@ class MobileAgentPrompter(AppAgentPrompter):
- ✅ Use `user_content_construction()` for multi-modal content
- ✅ Format examples with `examples_prompt_helper()`
- ✅ Format APIs with `api_prompt_helper()`
+- ✅ Pass screenshot URLs (base64) for vision model support
+- ✅ Include installed_apps and current_controls for mobile context
- ❌ Don't hardcode prompts - use YAML templates
---
@@ -1561,13 +1623,11 @@ class TestMobileAgent:
name="test_mobile_agent",
main_prompt="ufo/prompts/third_party/mobile_agent.yaml",
example_prompt="ufo/prompts/third_party/mobile_agent_example.yaml",
- platform="android",
)
def test_agent_initialization(self, agent):
"""Test agent initializes correctly."""
assert agent.name == "test_mobile_agent"
- assert agent.platform == "android"
assert agent.prompter is not None
assert agent.blackboard is not None
@@ -1602,7 +1662,6 @@ class TestMobileAgentPipeline:
name="test_agent",
main_prompt="ufo/prompts/third_party/mobile_agent.yaml",
example_prompt="ufo/prompts/third_party/mobile_agent_example.yaml",
- platform="android",
)
context = Context()
diff --git a/documents/docs/tutorials/creating_device_agent/example_mobile_agent.md b/documents/docs/tutorials/creating_device_agent/example_mobile_agent.md
index 8fdbba21d..7e045a1c1 100644
--- a/documents/docs/tutorials/creating_device_agent/example_mobile_agent.md
+++ b/documents/docs/tutorials/creating_device_agent/example_mobile_agent.md
@@ -67,16 +67,32 @@ For now, study the **LinuxAgent** implementation as a complete reference:
processor_cls=MobileAgentProcessor
)
class MobileAgent(CustomizedAgent):
- def __init__(self, name, main_prompt, example_prompt, platform="android"):
+ def __init__(self, name, main_prompt, example_prompt):
super().__init__(name, main_prompt, example_prompt,
- process_name=None, app_root_name=None, is_visual=True)
- self._platform = platform
+ process_name=None, app_root_name=None, is_visual=None)
self._blackboard = Blackboard()
self.set_state(self.default_state)
+ self._context_provision_executed = False
@property
def default_state(self):
return ContinueMobileAgentState()
+
+ def message_constructor(
+ self,
+ dynamic_examples,
+ dynamic_knowledge,
+ plan,
+ request,
+ installed_apps,
+ current_controls,
+ screenshot_url=None,
+ annotated_screenshot_url=None,
+ blackboard_prompt=None,
+ last_success_actions=None,
+ ):
+ # Construct prompt for LLM with mobile-specific context
+ return self.prompter.prompt_construction(...)
```
## Related Documentation
diff --git a/documents/docs/tutorials/creating_device_agent/overview.md b/documents/docs/tutorials/creating_device_agent/overview.md
index d4ed243c7..a5c7c25b2 100644
--- a/documents/docs/tutorials/creating_device_agent/overview.md
+++ b/documents/docs/tutorials/creating_device_agent/overview.md
@@ -494,9 +494,10 @@ For experienced developers, here's a **minimal implementation checklist**:
class MobileAgent(CustomizedAgent):
def __init__(self, name, main_prompt, example_prompt):
super().__init__(name, main_prompt, example_prompt,
- process_name=None, app_root_name=None, is_visual=True)
+ process_name=None, app_root_name=None, is_visual=None)
self._blackboard = Blackboard()
self.set_state(self.default_state)
+ self._context_provision_executed = False
@property
def default_state(self):
@@ -510,6 +511,17 @@ class MobileAgent(CustomizedAgent):
class MobileAgentProcessor(CustomizedProcessor):
def _setup_strategies(self):
+ # Compose multiple data collection strategies
+ self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
+ strategies=[
+ MobileScreenshotCaptureStrategy(fail_fast=True),
+ MobileAppsCollectionStrategy(fail_fast=False),
+ MobileControlsCollectionStrategy(fail_fast=False),
+ ],
+ name="MobileDataCollectionStrategy",
+ fail_fast=True,
+ )
+
self.strategies[ProcessingPhase.LLM_INTERACTION] = (
MobileLLMInteractionStrategy(fail_fast=True)
)
diff --git a/documents/mkdocs.yml b/documents/mkdocs.yml
index db2e3f310..ddad6b286 100644
--- a/documents/mkdocs.yml
+++ b/documents/mkdocs.yml
@@ -9,6 +9,7 @@ nav:
- Quick Start (UFO³ Agent Galaxy): getting_started/quick_start_galaxy.md
- Quick Start (UFO²): getting_started/quick_start_ufo2.md
- Quick Start (Linux Agent): getting_started/quick_start_linux.md
+ - Quick Start (Mobile Agent): getting_started/quick_start_mobile.md
- Migration UFO² → UFO³: getting_started/migration_ufo2_to_galaxy.md
- More Guidance: getting_started/more_guidance.md
- Configuration & Setup:
@@ -145,6 +146,12 @@ nav:
- State Machine: linux/state.md
- Processing Strategy: linux/strategy.md
- MCP Commands: linux/commands.md
+ - Mobile Agent:
+ - Overview: mobile/overview.md
+ - Using as Galaxy Device: mobile/as_galaxy_device.md
+ - State Machine: mobile/state.md
+ - Processing Strategy: mobile/strategy.md
+ - MCP Commands: mobile/commands.md
- Tutorials & Development:
- Creating Custom MCP Servers: tutorials/creating_mcp_servers.md
- Creating Custom Third-Party Agents: tutorials/creating_third_party_agents.md
@@ -227,6 +234,7 @@ nav:
- ConstellationEditor: mcp/servers/constellation_editor.md
- HardwareExecutor: mcp/servers/hardware_executor.md
- BashExecutor: mcp/servers/bash_executor.md
+ - MobileExecutor: mcp/servers/mobile_executor.md
- About:
- Contributing: about/CONTRIBUTING.md
- License: about/LICENSE.md
diff --git a/galaxy/README.md b/galaxy/README.md
index 9493a3489..a15842f98 100644
--- a/galaxy/README.md
+++ b/galaxy/README.md
@@ -600,8 +600,9 @@ UFO³ is designed as a **universal orchestration framework** that seamlessly int
**Multi-Platform Support:**
- 🪟 **Windows** — Desktop automation via UFO²
- 🐧 **Linux** — Server management, DevOps, data processing
-- 📱 **Mobile** — Extend to iOS/Android (coming soon)
+- 📱 **Android** — Mobile device automation via MCP
- 🌐 **Web** — Browser-based agents (coming soon)
+- 🍎 **macOS** — Desktop automation (coming soon)
- 🤖 **IoT/Embedded** — Edge devices and sensors (coming soon)
**Developer-Friendly:**
diff --git a/galaxy/README_ZH.md b/galaxy/README_ZH.md
index c6c99b82c..e61005c39 100644
--- a/galaxy/README_ZH.md
+++ b/galaxy/README_ZH.md
@@ -601,8 +601,9 @@ UFO³ 设计为**通用编排框架**,可无缝集成跨平台的异构设备
**多平台支持:**
- 🪟 **Windows** — 通过 UFO² 实现桌面自动化
- 🐧 **Linux** — 服务器管理、DevOps、数据处理
-- 📱 **移动** — 扩展到 iOS/Android(即将推出)
+- 📱 **Android** — 通过 MCP 实现移动设备自动化
- 🌐 **Web** — 基于浏览器的智能体(即将推出)
+- 🍎 **macOS** — 桌面自动化(即将推出)
- 🤖 **IoT/嵌入式** — 边缘设备和传感器(即将推出)
**开发者友好:**
diff --git a/tests/aip/test_binary_transfer.py b/tests/aip/test_binary_transfer.py
new file mode 100644
index 000000000..34d6bd338
--- /dev/null
+++ b/tests/aip/test_binary_transfer.py
@@ -0,0 +1,492 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""
+Unit Tests for AIP Binary Transfer
+
+Tests the binary transfer capabilities of the Agent Interaction Protocol,
+including adapters, transport, and protocol layers.
+"""
+
+import asyncio
+import os
+import tempfile
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+
+from aip.messages import (
+ BinaryMetadata,
+ ChunkMetadata,
+ FileTransferStart,
+ FileTransferComplete,
+)
+from aip.protocol import AIPProtocol
+from aip.transport import WebSocketTransport
+from aip.transport.adapters import (
+ FastAPIWebSocketAdapter,
+ WebSocketAdapter,
+ WebSocketsLibAdapter,
+)
+
+
+# ============================================================================
+# Adapter Tests
+# ============================================================================
+
+
+class TestWebSocketAdapterBinary:
+ """Test binary methods in WebSocket adapters"""
+
+ @pytest.mark.asyncio
+ async def test_fastapi_adapter_send_bytes(self):
+ """Test FastAPI adapter send_bytes method"""
+ mock_ws = MagicMock()
+ mock_ws.send_bytes = AsyncMock()
+ mock_ws.client_state = MagicMock()
+
+ adapter = FastAPIWebSocketAdapter(mock_ws)
+
+ test_data = b"test binary data"
+ await adapter.send_bytes(test_data)
+
+ mock_ws.send_bytes.assert_called_once_with(test_data)
+
+ @pytest.mark.asyncio
+ async def test_fastapi_adapter_receive_bytes(self):
+ """Test FastAPI adapter receive_bytes method"""
+ mock_ws = MagicMock()
+ test_data = b"received binary data"
+ mock_ws.receive_bytes = AsyncMock(return_value=test_data)
+
+ adapter = FastAPIWebSocketAdapter(mock_ws)
+ received = await adapter.receive_bytes()
+
+ assert received == test_data
+ mock_ws.receive_bytes.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_fastapi_adapter_receive_auto_binary(self):
+ """Test FastAPI adapter receive_auto with binary frame"""
+ mock_ws = MagicMock()
+ test_data = b"binary frame"
+ mock_ws.receive = AsyncMock(return_value={"bytes": test_data})
+
+ adapter = FastAPIWebSocketAdapter(mock_ws)
+ received = await adapter.receive_auto()
+
+ assert received == test_data
+ assert isinstance(received, bytes)
+
+ @pytest.mark.asyncio
+ async def test_fastapi_adapter_receive_auto_text(self):
+ """Test FastAPI adapter receive_auto with text frame"""
+ mock_ws = MagicMock()
+ test_data = "text frame"
+ mock_ws.receive = AsyncMock(return_value={"text": test_data})
+
+ adapter = FastAPIWebSocketAdapter(mock_ws)
+ received = await adapter.receive_auto()
+
+ assert received == test_data
+ assert isinstance(received, str)
+
+ @pytest.mark.asyncio
+ async def test_websockets_lib_adapter_send_bytes(self):
+ """Test websockets library adapter send_bytes method"""
+ mock_ws = MagicMock()
+ mock_ws.send = AsyncMock()
+ mock_ws.closed = False
+
+ adapter = WebSocketsLibAdapter(mock_ws)
+
+ test_data = b"test binary data"
+ await adapter.send_bytes(test_data)
+
+ mock_ws.send.assert_called_once_with(test_data)
+
+ @pytest.mark.asyncio
+ async def test_websockets_lib_adapter_receive_bytes(self):
+ """Test websockets library adapter receive_bytes method"""
+ mock_ws = MagicMock()
+ test_data = b"received binary data"
+ mock_ws.recv = AsyncMock(return_value=test_data)
+ mock_ws.closed = False
+
+ adapter = WebSocketsLibAdapter(mock_ws)
+ received = await adapter.receive_bytes()
+
+ assert received == test_data
+ mock_ws.recv.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_websockets_lib_adapter_receive_bytes_error(self):
+ """Test websockets library adapter receive_bytes with text frame (error)"""
+ mock_ws = MagicMock()
+ mock_ws.recv = AsyncMock(return_value="text frame") # Wrong type
+ mock_ws.closed = False
+
+ adapter = WebSocketsLibAdapter(mock_ws)
+
+ with pytest.raises(ValueError, match="Expected binary"):
+ await adapter.receive_bytes()
+
+ @pytest.mark.asyncio
+ async def test_websockets_lib_adapter_receive_auto(self):
+ """Test websockets library adapter receive_auto method"""
+ mock_ws = MagicMock()
+ test_data = b"auto-detected binary"
+ mock_ws.recv = AsyncMock(return_value=test_data)
+ mock_ws.closed = False
+
+ adapter = WebSocketsLibAdapter(mock_ws)
+ received = await adapter.receive_auto()
+
+ assert received == test_data
+ assert isinstance(received, bytes)
+
+
+# ============================================================================
+# Transport Tests
+# ============================================================================
+
+
+class TestWebSocketTransportBinary:
+ """Test binary methods in WebSocketTransport"""
+
+ @pytest.mark.asyncio
+ async def test_send_binary(self):
+ """Test send_binary method"""
+ mock_adapter = MagicMock(spec=WebSocketAdapter)
+ mock_adapter.send_bytes = AsyncMock()
+ mock_adapter.is_open = MagicMock(return_value=True)
+
+ transport = WebSocketTransport()
+ transport._adapter = mock_adapter
+ transport._state = transport._state.CONNECTED
+
+ test_data = b"test binary data"
+ await transport.send_binary(test_data)
+
+ mock_adapter.send_bytes.assert_called_once_with(test_data)
+
+ @pytest.mark.asyncio
+ async def test_send_binary_not_connected(self):
+ """Test send_binary when not connected"""
+ transport = WebSocketTransport()
+
+ with pytest.raises(ConnectionError, match="not connected"):
+ await transport.send_binary(b"test")
+
+ @pytest.mark.asyncio
+ async def test_receive_binary(self):
+ """Test receive_binary method"""
+ mock_adapter = MagicMock(spec=WebSocketAdapter)
+ test_data = b"received binary data"
+ mock_adapter.receive_bytes = AsyncMock(return_value=test_data)
+ mock_adapter.is_open = MagicMock(return_value=True)
+
+ transport = WebSocketTransport()
+ transport._adapter = mock_adapter
+ transport._state = transport._state.CONNECTED
+
+ received = await transport.receive_binary()
+
+ assert received == test_data
+ mock_adapter.receive_bytes.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_receive_auto_binary(self):
+ """Test receive_auto with binary frame"""
+ mock_adapter = MagicMock(spec=WebSocketAdapter)
+ test_data = b"binary frame"
+ mock_adapter.receive_auto = AsyncMock(return_value=test_data)
+
+ transport = WebSocketTransport()
+ transport._adapter = mock_adapter
+ transport._state = transport._state.CONNECTED
+
+ received = await transport.receive_auto()
+
+ assert received == test_data
+ assert isinstance(received, bytes)
+
+ @pytest.mark.asyncio
+ async def test_receive_auto_text(self):
+ """Test receive_auto with text frame"""
+ mock_adapter = MagicMock(spec=WebSocketAdapter)
+ test_data = "text frame"
+ mock_adapter.receive_auto = AsyncMock(return_value=test_data)
+
+ transport = WebSocketTransport()
+ transport._adapter = mock_adapter
+ transport._state = transport._state.CONNECTED
+
+ received = await transport.receive_auto()
+
+ assert received == test_data
+ assert isinstance(received, str)
+
+
+# ============================================================================
+# Protocol Tests
+# ============================================================================
+
+
+class TestAIPProtocolBinary:
+ """Test binary message handling in AIPProtocol"""
+
+ @pytest.mark.asyncio
+ async def test_send_binary_message(self):
+ """Test send_binary_message method"""
+ mock_transport = MagicMock(spec=WebSocketTransport)
+ mock_transport.send = AsyncMock()
+ mock_transport.send_binary = AsyncMock()
+
+ protocol = AIPProtocol(mock_transport)
+
+ test_data = b"test binary content"
+ metadata = {
+ "filename": "test.bin",
+ "mime_type": "application/octet-stream",
+ }
+
+ await protocol.send_binary_message(test_data, metadata)
+
+ # Verify metadata was sent as text frame
+ assert mock_transport.send.called
+ sent_metadata = mock_transport.send.call_args[0][0]
+ assert b'"type": "binary_data"' in sent_metadata
+ assert b'"size": 19' in sent_metadata # len(test_data)
+
+ # Verify binary data was sent as binary frame
+ mock_transport.send_binary.assert_called_once_with(test_data)
+
+ @pytest.mark.asyncio
+ async def test_receive_binary_message(self):
+ """Test receive_binary_message method"""
+ import json
+
+ mock_transport = MagicMock(spec=WebSocketTransport)
+
+ # Prepare metadata
+ metadata = {
+ "type": "binary_data",
+ "filename": "test.bin",
+ "size": 19,
+ }
+ metadata_json = json.dumps(metadata).encode("utf-8")
+
+ # Prepare binary data
+ test_data = b"test binary content"
+
+ # Mock transport receive methods
+ mock_transport.receive = AsyncMock(return_value=metadata_json)
+ mock_transport.receive_binary = AsyncMock(return_value=test_data)
+
+ protocol = AIPProtocol(mock_transport)
+
+ received_data, received_metadata = await protocol.receive_binary_message()
+
+ assert received_data == test_data
+ assert received_metadata["filename"] == "test.bin"
+ assert received_metadata["size"] == 19
+
+ @pytest.mark.asyncio
+ async def test_receive_binary_message_size_validation_fail(self):
+ """Test receive_binary_message with size mismatch"""
+ import json
+
+ mock_transport = MagicMock(spec=WebSocketTransport)
+
+ # Metadata says 100 bytes, but we send 19
+ metadata = {
+ "type": "binary_data",
+ "size": 100, # Wrong size
+ }
+ metadata_json = json.dumps(metadata).encode("utf-8")
+ test_data = b"test binary content" # Only 19 bytes
+
+ mock_transport.receive = AsyncMock(return_value=metadata_json)
+ mock_transport.receive_binary = AsyncMock(return_value=test_data)
+
+ protocol = AIPProtocol(mock_transport)
+
+ with pytest.raises(ValueError, match="Size mismatch"):
+ await protocol.receive_binary_message(validate_size=True)
+
+ @pytest.mark.asyncio
+ async def test_send_file(self):
+ """Test send_file method"""
+ mock_transport = MagicMock(spec=WebSocketTransport)
+ mock_transport.send = AsyncMock()
+ mock_transport.send_binary = AsyncMock()
+
+ protocol = AIPProtocol(mock_transport)
+
+ # Create a temporary test file
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".txt") as temp_file:
+ temp_file.write(b"Test file content for chunked transfer" * 1000)
+ temp_file_path = temp_file.name
+
+ try:
+ await protocol.send_file(temp_file_path, chunk_size=1024)
+
+ # Verify file_transfer_start was sent
+ assert mock_transport.send.called
+ start_msg = mock_transport.send.call_args_list[0][0][0]
+ assert b'"type": "file_transfer_start"' in start_msg
+
+ # Verify chunks were sent
+ assert mock_transport.send_binary.called
+
+ # Verify file_transfer_complete was sent
+ complete_msg = mock_transport.send.call_args_list[-1][0][0]
+ assert b'"type": "file_transfer_complete"' in complete_msg
+
+ finally:
+ os.unlink(temp_file_path)
+
+ @pytest.mark.asyncio
+ async def test_receive_file(self):
+ """Test receive_file method"""
+ import json
+
+ mock_transport = MagicMock(spec=WebSocketTransport)
+
+ # Prepare file transfer messages
+ start_msg = {
+ "type": "file_transfer_start",
+ "filename": "test.bin",
+ "size": 2048,
+ "chunk_size": 1024,
+ "total_chunks": 2,
+ }
+
+ chunk1_meta = {"type": "binary_data", "chunk_num": 0, "size": 1024}
+ chunk2_meta = {"type": "binary_data", "chunk_num": 1, "size": 1024}
+
+ complete_msg = {
+ "type": "file_transfer_complete",
+ "filename": "test.bin",
+ "total_chunks": 2,
+ "checksum": "abc123",
+ }
+
+ # Mock transport to return messages in sequence
+ mock_transport.receive = AsyncMock(
+ side_effect=[
+ json.dumps(start_msg).encode("utf-8"),
+ json.dumps(chunk1_meta).encode("utf-8"),
+ json.dumps(chunk2_meta).encode("utf-8"),
+ json.dumps(complete_msg).encode("utf-8"),
+ ]
+ )
+
+ mock_transport.receive_binary = AsyncMock(
+ side_effect=[
+ b"A" * 1024, # Chunk 1
+ b"B" * 1024, # Chunk 2
+ ]
+ )
+
+ protocol = AIPProtocol(mock_transport)
+
+ # Receive file
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".bin") as temp_file:
+ output_path = temp_file.name
+
+ try:
+ metadata = await protocol.receive_file(output_path, validate_checksum=False)
+
+ assert metadata["filename"] == "test.bin"
+ assert metadata["size"] == 2048
+
+ # Verify file was written
+ with open(output_path, "rb") as f:
+ content = f.read()
+ assert len(content) == 2048
+ assert content[:1024] == b"A" * 1024
+ assert content[1024:] == b"B" * 1024
+
+ finally:
+ if os.path.exists(output_path):
+ os.unlink(output_path)
+
+
+# ============================================================================
+# Message Type Tests
+# ============================================================================
+
+
+class TestBinaryMessageTypes:
+ """Test binary message type definitions"""
+
+ def test_binary_metadata(self):
+ """Test BinaryMetadata model"""
+ metadata = BinaryMetadata(
+ filename="test.png",
+ mime_type="image/png",
+ size=1024,
+ checksum="abc123",
+ )
+
+ assert metadata.type == "binary_data"
+ assert metadata.filename == "test.png"
+ assert metadata.size == 1024
+
+ def test_file_transfer_start(self):
+ """Test FileTransferStart model"""
+ start_msg = FileTransferStart(
+ filename="large_file.bin",
+ size=10485760, # 10MB
+ chunk_size=1048576, # 1MB
+ total_chunks=10,
+ mime_type="application/octet-stream",
+ )
+
+ assert start_msg.type == "file_transfer_start"
+ assert start_msg.total_chunks == 10
+
+ def test_file_transfer_complete(self):
+ """Test FileTransferComplete model"""
+ complete_msg = FileTransferComplete(
+ filename="large_file.bin", total_chunks=10, checksum="def456"
+ )
+
+ assert complete_msg.type == "file_transfer_complete"
+ assert complete_msg.checksum == "def456"
+
+ def test_chunk_metadata(self):
+ """Test ChunkMetadata model"""
+ chunk = ChunkMetadata(chunk_num=5, chunk_size=1048576, checksum="chunk5hash")
+
+ assert chunk.chunk_num == 5
+ assert chunk.chunk_size == 1048576
+
+
+# ============================================================================
+# Integration Tests
+# ============================================================================
+
+
+class TestBinaryTransferIntegration:
+ """Integration tests for complete binary transfer scenarios"""
+
+ @pytest.mark.asyncio
+ async def test_full_binary_message_roundtrip(self):
+ """Test complete binary message send and receive"""
+ # This test would require a real WebSocket connection
+ # For now, we test with mocks
+ pass
+
+ @pytest.mark.asyncio
+ async def test_full_file_transfer_roundtrip(self):
+ """Test complete file transfer send and receive"""
+ # This test would require a real WebSocket connection
+ # For now, we test with mocks
+ pass
+
+
+if __name__ == "__main__":
+ pytest.main([__file__, "-v"])
diff --git a/tests/integration/test_mobile_mcp_server.py b/tests/integration/test_mobile_mcp_server.py
new file mode 100644
index 000000000..5580d6674
--- /dev/null
+++ b/tests/integration/test_mobile_mcp_server.py
@@ -0,0 +1,555 @@
+"""
+Integration test for Mobile MCP Servers (Android)
+Tests the mobile data collection and action servers with an actual Android emulator/device.
+
+Prerequisites:
+- Android emulator or physical device must be running
+- ADB must be installed and accessible
+- Device must be connected and visible via 'adb devices'
+
+Usage:
+ pytest tests/integration/test_mobile_mcp_server.py -v
+
+Or run specific tests:
+ pytest tests/integration/test_mobile_mcp_server.py::TestMobileMCPServers::test_data_collection_server -v
+"""
+
+import asyncio
+import logging
+import subprocess
+import time
+from typing import Any, Dict, List, Optional
+
+import pytest
+
+from aip.messages import Command, ResultStatus
+from ufo.client.computer import CommandRouter, ComputerManager
+from ufo.client.mcp.mcp_server_manager import MCPServerManager
+
+
+# Configure logging for tests
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+)
+
+
+class TestMobileMCPServers:
+ """Integration tests for Mobile MCP Servers"""
+
+ @pytest.fixture(scope="class")
+ def check_adb_connection(self):
+ """Check if ADB is available and a device is connected"""
+ try:
+ result = subprocess.run(
+ ["adb", "devices"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+
+ if result.returncode != 0:
+ pytest.skip("ADB not found or not working properly")
+
+ devices = [line for line in result.stdout.split("\n") if "\tdevice" in line]
+ if not devices:
+ pytest.skip(
+ "No Android device/emulator connected. Please connect a device and run 'adb devices'"
+ )
+
+ print(f"\n✅ Found {len(devices)} connected device(s)")
+ return True
+
+ except FileNotFoundError:
+ pytest.skip(
+ "ADB not found in PATH. Please install Android SDK platform-tools."
+ )
+ except Exception as e:
+ pytest.skip(f"Error checking ADB: {e}")
+
+ @pytest.fixture(scope="class")
+ def mobile_agent_config(self):
+ """Configuration for MobileAgent with data collection and action servers"""
+ return {
+ "mcp": {
+ "MobileAgent": {
+ "default": {
+ "data_collection": [
+ {
+ "namespace": "MobileDataCollector",
+ "type": "http",
+ "host": "localhost",
+ "port": 8020,
+ "path": "/mcp",
+ "reset": False,
+ }
+ ],
+ "action": [
+ {
+ "namespace": "MobileActionExecutor",
+ "type": "http",
+ "host": "localhost",
+ "port": 8021,
+ "path": "/mcp",
+ "reset": False,
+ }
+ ],
+ }
+ }
+ }
+ }
+
+ @pytest.fixture(scope="class")
+ async def command_router(self, mobile_agent_config):
+ """Create CommandRouter with MobileAgent configuration"""
+ mcp_server_manager = MCPServerManager()
+ computer_manager = ComputerManager(mobile_agent_config, mcp_server_manager)
+ router = CommandRouter(computer_manager)
+
+ # Give servers time to initialize
+ await asyncio.sleep(1)
+
+ yield router
+
+ # Cleanup
+ computer_manager.reset()
+
+ @pytest.mark.asyncio
+ async def test_data_collection_server(self, check_adb_connection, command_router):
+ """Test data collection server tools"""
+
+ print("\n=== Testing Mobile Data Collection Server ===")
+
+ # Test 1: Get device info
+ print("\n📱 Test 1: Getting device information...")
+ commands = [
+ Command(
+ tool_name="get_device_info",
+ tool_type="data_collection",
+ parameters={},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ device_info = results[0].result
+ assert device_info is not None
+ assert "success" in device_info
+ assert device_info["success"] is True
+ assert "device_info" in device_info
+
+ print(f"✅ Device Info: {device_info['device_info']}")
+
+ # Test 2: Capture screenshot
+ print("\n📸 Test 2: Capturing screenshot...")
+ commands = [
+ Command(
+ tool_name="capture_screenshot",
+ tool_type="data_collection",
+ parameters={"format": "base64"},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ screenshot_result = results[0].result
+ assert screenshot_result is not None
+ assert screenshot_result["success"] is True
+ assert "image" in screenshot_result
+ assert screenshot_result["image"].startswith("data:image/png;base64,")
+
+ print(
+ f"✅ Screenshot captured: {screenshot_result['width']}x{screenshot_result['height']}"
+ )
+
+ # Test 3: Get UI tree
+ print("\n🌲 Test 3: Getting UI hierarchy tree...")
+ commands = [
+ Command(
+ tool_name="get_ui_tree",
+ tool_type="data_collection",
+ parameters={},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ ui_tree_result = results[0].result
+ assert ui_tree_result is not None
+ assert ui_tree_result["success"] is True
+ assert "ui_tree" in ui_tree_result
+ assert ui_tree_result["format"] == "xml"
+
+ print(f"✅ UI tree retrieved: {len(ui_tree_result['ui_tree'])} characters")
+
+ # Test 4: Get installed apps
+ print("\n📱 Test 4: Getting installed apps...")
+ commands = [
+ Command(
+ tool_name="get_mobile_app_target_info",
+ tool_type="data_collection",
+ parameters={"include_system_apps": False, "force_refresh": True},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ apps = results[0].result
+ assert apps is not None
+ assert isinstance(apps, list)
+
+ print(f"✅ Found {len(apps)} user-installed apps")
+ if apps:
+ print(f" Sample app: {apps[0]}")
+
+ # Test 5: Get UI controls
+ print("\n🎮 Test 5: Getting current screen controls...")
+ commands = [
+ Command(
+ tool_name="get_app_window_controls_target_info",
+ tool_type="data_collection",
+ parameters={"force_refresh": True},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ controls = results[0].result
+ assert controls is not None
+ assert isinstance(controls, list)
+
+ print(f"✅ Found {len(controls)} controls on current screen")
+ if controls:
+ print(f" Sample control: {controls[0]}")
+
+ @pytest.mark.asyncio
+ async def test_action_server(self, check_adb_connection, command_router):
+ """Test action server tools"""
+
+ print("\n=== Testing Mobile Action Server ===")
+
+ # Test 1: Press HOME key
+ print("\n🏠 Test 1: Pressing HOME key...")
+ commands = [
+ Command(
+ tool_name="press_key",
+ tool_type="action",
+ parameters={"key_code": "KEYCODE_HOME"},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ result = results[0].result
+ assert result is not None
+ assert result["success"] is True
+
+ print(f"✅ HOME key pressed successfully")
+
+ # Wait for animation
+ await asyncio.sleep(1)
+
+ # Test 2: Tap at center of screen
+ print("\n👆 Test 2: Tapping at screen center...")
+ commands = [
+ Command(
+ tool_name="tap",
+ tool_type="action",
+ parameters={"x": 540, "y": 960}, # Common center for 1080x1920
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ result = results[0].result
+ assert result is not None
+ assert result["success"] is True
+
+ print(f"✅ Tap executed: {result['action']}")
+
+ await asyncio.sleep(0.5)
+
+ # Test 3: Swipe gesture (scroll down)
+ print("\n👇 Test 3: Performing swipe gesture...")
+ commands = [
+ Command(
+ tool_name="swipe",
+ tool_type="action",
+ parameters={
+ "start_x": 540,
+ "start_y": 1200,
+ "end_x": 540,
+ "end_y": 600,
+ "duration": 300,
+ },
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ result = results[0].result
+ assert result is not None
+ assert result["success"] is True
+
+ print(f"✅ Swipe executed: {result['action']}")
+
+ await asyncio.sleep(0.5)
+
+ # Test 4: Invalidate cache
+ print("\n🗑️ Test 4: Invalidating cache...")
+ commands = [
+ Command(
+ tool_name="invalidate_cache",
+ tool_type="action",
+ parameters={"cache_type": "all"},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert len(results) == 1
+ assert results[0].status == ResultStatus.SUCCESS
+
+ result = results[0].result
+ assert result is not None
+ assert result["success"] is True
+
+ print(f"✅ Cache invalidated: {result['message']}")
+
+ @pytest.mark.asyncio
+ async def test_shared_state_between_servers(
+ self, check_adb_connection, command_router
+ ):
+ """Test that data collection and action servers share the same state"""
+
+ print("\n=== Testing Shared State Between Servers ===")
+
+ # Step 1: Get controls from data collection server (populates cache)
+ print("\n1️⃣ Getting controls from data collection server...")
+ commands = [
+ Command(
+ tool_name="get_app_window_controls_target_info",
+ tool_type="data_collection",
+ parameters={"force_refresh": True},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert results[0].status == ResultStatus.SUCCESS
+ controls = results[0].result
+
+ print(f"✅ Retrieved {len(controls)} controls (cache populated)")
+
+ # Step 2: Invalidate cache from action server
+ print("\n2️⃣ Invalidating cache from action server...")
+ commands = [
+ Command(
+ tool_name="invalidate_cache",
+ tool_type="action",
+ parameters={"cache_type": "controls"},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert results[0].status == ResultStatus.SUCCESS
+ print(f"✅ Cache invalidated from action server")
+
+ # Step 3: Get controls again - should refresh from device
+ print("\n3️⃣ Getting controls again from data collection server...")
+ commands = [
+ Command(
+ tool_name="get_app_window_controls_target_info",
+ tool_type="data_collection",
+ parameters={"force_refresh": False}, # Use cache if available
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ assert results[0].status == ResultStatus.SUCCESS
+ print(f"✅ Cache invalidation worked - data refreshed from device")
+
+ @pytest.mark.asyncio
+ async def test_complete_workflow(self, check_adb_connection, command_router):
+ """Test a complete workflow: get controls -> click control"""
+
+ print("\n=== Testing Complete Workflow ===")
+
+ # Navigate to home screen first
+ print("\n🏠 Navigating to home screen...")
+ commands = [
+ Command(
+ tool_name="press_key",
+ tool_type="action",
+ parameters={"key_code": "KEYCODE_HOME"},
+ )
+ ]
+
+ await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ await asyncio.sleep(1)
+
+ # Get controls
+ print("\n📋 Getting current screen controls...")
+ commands = [
+ Command(
+ tool_name="get_app_window_controls_target_info",
+ tool_type="data_collection",
+ parameters={"force_refresh": True},
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ controls = results[0].result
+ print(f"✅ Found {len(controls)} controls")
+
+ # Find a clickable control
+ clickable_control = None
+ for control in controls:
+ if control.get("name"): # Has a name/label
+ clickable_control = control
+ break
+
+ if clickable_control:
+ print(
+ f"\n👆 Clicking control: {clickable_control.get('name')} (ID: {clickable_control.get('id')})"
+ )
+
+ commands = [
+ Command(
+ tool_name="click_control",
+ tool_type="action",
+ parameters={
+ "control_id": clickable_control.get("id"),
+ "control_name": clickable_control.get("name"),
+ },
+ )
+ ]
+
+ results = await command_router.execute(
+ agent_name="MobileAgent",
+ process_name="",
+ root_name="default",
+ commands=commands,
+ )
+
+ if results[0].status == ResultStatus.SUCCESS:
+ print(f"✅ Successfully clicked control")
+ else:
+ print(f"⚠️ Click failed: {results[0].error}")
+ else:
+ print("⚠️ No clickable controls with names found on current screen")
+
+
+if __name__ == "__main__":
+ """
+ Run tests directly with: python tests/integration/test_mobile_mcp_server.py
+ """
+ print("=" * 70)
+ print("Mobile MCP Server Integration Tests")
+ print("=" * 70)
+ print("\n⚠️ Prerequisites:")
+ print(" 1. Android emulator or device must be running")
+ print(" 2. Run 'adb devices' to verify connection")
+ print(" 3. Start mobile MCP servers:")
+ print(" python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both")
+ print("\n" + "=" * 70)
+
+ pytest.main([__file__, "-v", "-s"])
diff --git a/tests/integration/test_mobile_mcp_standalone.py b/tests/integration/test_mobile_mcp_standalone.py
new file mode 100644
index 000000000..5ece593e1
--- /dev/null
+++ b/tests/integration/test_mobile_mcp_standalone.py
@@ -0,0 +1,394 @@
+"""
+Standalone test for Mobile MCP Servers
+Tests mobile servers directly without full UFO infrastructure.
+
+Prerequisites:
+- Android emulator or physical device must be running
+- ADB must be installed and accessible
+- Mobile MCP servers must be running on ports 8020 (data) and 8021 (action)
+
+Start servers:
+ python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both
+
+Run test:
+ python tests/integration/test_mobile_mcp_standalone.py
+"""
+
+import asyncio
+import os
+import subprocess
+import sys
+from typing import Any, Dict
+
+from fastmcp import Client
+
+
+def find_adb():
+ """Auto-detect ADB path"""
+ common_paths = [
+ r"C:\Users\{}\AppData\Local\Android\Sdk\platform-tools\adb.exe".format(
+ os.environ.get("USERNAME", "")
+ ),
+ "/usr/bin/adb",
+ "/usr/local/bin/adb",
+ ]
+
+ for path in common_paths:
+ if os.path.exists(path):
+ return path
+
+ try:
+ result = subprocess.run(
+ ["where" if os.name == "nt" else "which", "adb"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ if result.returncode == 0:
+ return result.stdout.strip().split("\n")[0]
+ except:
+ pass
+
+ return "adb"
+
+
+async def check_adb_connection() -> bool:
+ """Check if ADB is available and a device is connected"""
+ adb_path = find_adb()
+
+ try:
+ result = subprocess.run(
+ [adb_path, "devices"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+
+ if result.returncode != 0:
+ print(f"❌ ADB not found or not working properly")
+ print(f" Tried: {adb_path}")
+ return False
+
+ devices = [line for line in result.stdout.split("\n") if "\tdevice" in line]
+ if not devices:
+ print("❌ No Android device/emulator connected")
+ print(" Please connect a device and run 'adb devices'")
+ return False
+
+ print(f"✅ Found {len(devices)} connected device(s)")
+ print(result.stdout)
+ return True
+
+ except FileNotFoundError:
+ print(f"❌ ADB not found: {adb_path}")
+ print(" Please install Android SDK platform-tools")
+ return False
+ except Exception as e:
+ print(f"❌ Error checking ADB: {e}")
+ return False
+
+
+async def test_data_collection_server():
+ """Test the Mobile Data Collection Server"""
+ print("\n" + "=" * 70)
+ print("Testing Mobile Data Collection Server (port 8020)")
+ print("=" * 70)
+
+ server_url = "http://localhost:8020/mcp"
+
+ try:
+ # FastMCP Client automatically detects HTTP from URL
+ async with Client(server_url) as client:
+ # Test 1: List available tools
+ print("\n📋 Listing available data collection tools...")
+ tools = await client.list_tools()
+ print(f"✅ Found {len(tools)} tools:")
+ for tool in tools:
+ print(f" - {tool.name}: {tool.description}")
+
+ # Test 2: Get device info
+ print("\n📱 Getting device information...")
+ result = await client.call_tool("get_device_info", {})
+ device_info = result.data
+ if device_info and device_info.get("success"):
+ info = device_info["device_info"]
+ print(f"✅ Device Info:")
+ print(f" Model: {info.get('model', 'N/A')}")
+ print(f" Android Version: {info.get('android_version', 'N/A')}")
+ print(f" Screen: {info.get('screen_size', 'N/A')}")
+ print(f" Battery: {info.get('battery_level', 'N/A')}")
+ else:
+ print(f"❌ Failed to get device info: {device_info}")
+
+ # Test 3: Capture screenshot
+ print("\n📸 Capturing screenshot...")
+ result = await client.call_tool("capture_screenshot", {"format": "base64"})
+ screenshot = result.data
+ if screenshot and screenshot.get("success"):
+ print(f"✅ Screenshot captured:")
+ print(f" Size: {screenshot['width']}x{screenshot['height']}")
+ print(f" Format: {screenshot['format']}")
+ print(f" Data length: {len(screenshot['image'])} chars")
+ else:
+ print(f"❌ Failed to capture screenshot: {screenshot}")
+
+ # Test 4: Get UI tree
+ print("\n🌲 Getting UI hierarchy tree...")
+ result = await client.call_tool("get_ui_tree", {})
+ ui_tree = result.data
+ if ui_tree and ui_tree.get("success"):
+ print(f"✅ UI tree retrieved:")
+ print(f" Length: {len(ui_tree['ui_tree'])} characters")
+ print(f" Format: {ui_tree['format']}")
+ else:
+ print(f"❌ Failed to get UI tree: {ui_tree}")
+
+ # Test 5: Get installed apps
+ print("\n📱 Getting installed apps...")
+ result = await client.call_tool(
+ "get_mobile_app_target_info",
+ {"include_system_apps": False, "force_refresh": True},
+ )
+ apps = result.data
+ if apps and isinstance(apps, list):
+ print(f"✅ Found {len(apps)} user-installed apps")
+ if apps:
+ print(f" Sample app: {apps[0].get('name', 'N/A')}")
+ else:
+ print(f"❌ Failed to get apps: {apps}")
+
+ # Test 6: Get UI controls
+ print("\n🎮 Getting current screen controls...")
+ result = await client.call_tool(
+ "get_app_window_controls_target_info", {"force_refresh": True}
+ )
+ controls = result.data
+ if controls and isinstance(controls, list):
+ print(f"✅ Found {len(controls)} controls on current screen")
+ if controls:
+ sample = controls[0]
+ print(f" Sample control:")
+ print(f" ID: {sample.get('id', 'N/A')}")
+ print(f" Name: {sample.get('name', 'N/A')}")
+ print(f" Type: {sample.get('type', 'N/A')}")
+ else:
+ print(f"❌ Failed to get controls: {controls}")
+
+ print("\n✅ Data Collection Server: ALL TESTS PASSED")
+ return True
+
+ except Exception as e:
+ print(f"\n❌ Error testing data collection server: {e}")
+ import traceback
+
+ traceback.print_exc()
+ return False
+
+
+async def test_action_server():
+ """Test the Mobile Action Server"""
+ print("\n" + "=" * 70)
+ print("Testing Mobile Action Server (port 8021)")
+ print("=" * 70)
+
+ server_url = "http://localhost:8021/mcp"
+
+ try:
+ async with Client(server_url) as client:
+ # Test 1: List available tools
+ print("\n📋 Listing available action tools...")
+ tools = await client.list_tools()
+ print(f"✅ Found {len(tools)} tools:")
+ for tool in tools:
+ print(f" - {tool.name}: {tool.description}")
+
+ # Test 2: Press HOME key
+ print("\n🏠 Pressing HOME key...")
+ result = await client.call_tool("press_key", {"key_code": "KEYCODE_HOME"})
+ key_result = result.data
+ if key_result and key_result.get("success"):
+ print(f"✅ HOME key pressed: {key_result['action']}")
+ else:
+ print(f"❌ Failed to press HOME key: {key_result}")
+
+ await asyncio.sleep(1)
+
+ # Test 3: Tap at screen center
+ print("\n👆 Tapping at screen center (540, 960)...")
+ result = await client.call_tool("tap", {"x": 540, "y": 960})
+ tap_result = result.data
+ if tap_result and tap_result.get("success"):
+ print(f"✅ Tap executed: {tap_result['action']}")
+ else:
+ print(f"❌ Failed to tap: {tap_result}")
+
+ await asyncio.sleep(0.5)
+
+ # Test 4: Swipe gesture
+ print("\n👇 Performing swipe gesture (scroll down)...")
+ result = await client.call_tool(
+ "swipe",
+ {
+ "start_x": 540,
+ "start_y": 1200,
+ "end_x": 540,
+ "end_y": 600,
+ "duration": 300,
+ },
+ )
+ swipe_result = result.data
+ if swipe_result and swipe_result.get("success"):
+ print(f"✅ Swipe executed: {swipe_result['action']}")
+ else:
+ print(f"❌ Failed to swipe: {swipe_result}")
+
+ await asyncio.sleep(0.5)
+
+ # Test 5: Invalidate cache
+ print("\n🗑️ Invalidating all caches...")
+ result = await client.call_tool("invalidate_cache", {"cache_type": "all"})
+ cache_result = result.data
+ if cache_result and cache_result.get("success"):
+ print(f"✅ Cache invalidated: {cache_result['message']}")
+ else:
+ print(f"❌ Failed to invalidate cache: {cache_result}")
+
+ print("\n✅ Action Server: ALL TESTS PASSED")
+ return True
+
+ except Exception as e:
+ print(f"\n❌ Error testing action server: {e}")
+ import traceback
+
+ traceback.print_exc()
+ return False
+
+
+async def test_shared_state():
+ """Test that data and action servers share the same state"""
+ print("\n" + "=" * 70)
+ print("Testing Shared State Between Servers")
+ print("=" * 70)
+
+ data_url = "http://localhost:8020/mcp"
+ action_url = "http://localhost:8021/mcp"
+
+ try:
+ # Step 1: Get controls from data server (populates cache)
+ print("\n1️⃣ Getting controls from data collection server (populate cache)...")
+ async with Client(data_url) as data_client:
+ result = await data_client.call_tool(
+ "get_app_window_controls_target_info", {"force_refresh": True}
+ )
+ controls = result.data
+ if controls and isinstance(controls, list):
+ print(f"✅ Retrieved {len(controls)} controls (cache populated)")
+ else:
+ print(f"❌ Failed to get controls")
+ return False
+
+ # Step 2: Invalidate cache from action server
+ print("\n2️⃣ Invalidating cache from action server...")
+ async with Client(action_url) as action_client:
+ result = await action_client.call_tool(
+ "invalidate_cache", {"cache_type": "controls"}
+ )
+ cache_result = result.data
+ if cache_result and cache_result.get("success"):
+ print(
+ f"✅ Cache invalidated from action server: {cache_result['message']}"
+ )
+ else:
+ print(f"❌ Failed to invalidate cache")
+ return False
+
+ # Step 3: Get controls again from data server
+ # If shared state works, cache should be invalidated and will refresh
+ print("\n3️⃣ Getting controls again from data collection server...")
+ async with Client(data_url) as data_client:
+ result = await data_client.call_tool(
+ "get_app_window_controls_target_info",
+ {"force_refresh": False}, # Use cache if available
+ )
+ controls = result.data
+ if controls and isinstance(controls, list):
+ print(f"✅ Retrieved {len(controls)} controls")
+ print("✅ Shared state verified - cache was properly invalidated!")
+ else:
+ print(f"❌ Failed to get controls")
+ return False
+
+ print("\n✅ Shared State: TEST PASSED")
+ return True
+
+ except Exception as e:
+ print(f"\n❌ Error testing shared state: {e}")
+ import traceback
+
+ traceback.print_exc()
+ return False
+
+
+async def main():
+ """Main test runner"""
+ print("=" * 70)
+ print("Mobile MCP Server Standalone Tests")
+ print("=" * 70)
+
+ # Check prerequisites
+ print("\n📋 Checking prerequisites...")
+
+ if not await check_adb_connection():
+ print("\n❌ ADB connection check failed!")
+ print("\nPlease ensure:")
+ print(" 1. Android SDK platform-tools is installed")
+ print(" 2. ADB is in your PATH")
+ print(" 3. Android emulator or device is running")
+ print(" 4. Run 'adb devices' to verify connection")
+ return 1
+
+ print("\n⚠️ Make sure Mobile MCP servers are running:")
+ print(" python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both")
+ print("\nWaiting 3 seconds before starting tests...")
+ await asyncio.sleep(3)
+
+ # Run tests
+ results = []
+
+ try:
+ results.append(await test_data_collection_server())
+ except Exception as e:
+ print(f"❌ Data collection server test crashed: {e}")
+ results.append(False)
+
+ try:
+ results.append(await test_action_server())
+ except Exception as e:
+ print(f"❌ Action server test crashed: {e}")
+ results.append(False)
+
+ try:
+ results.append(await test_shared_state())
+ except Exception as e:
+ print(f"❌ Shared state test crashed: {e}")
+ results.append(False)
+
+ # Summary
+ print("\n" + "=" * 70)
+ print("TEST SUMMARY")
+ print("=" * 70)
+ passed = sum(results)
+ total = len(results)
+ print(f"\nPassed: {passed}/{total}")
+
+ if all(results):
+ print("\n🎉 ALL TESTS PASSED! 🎉")
+ return 0
+ else:
+ print("\n❌ SOME TESTS FAILED")
+ return 1
+
+
+if __name__ == "__main__":
+ exit_code = asyncio.run(main())
+ sys.exit(exit_code)
diff --git a/tests/integration/verify_mobile_setup.py b/tests/integration/verify_mobile_setup.py
new file mode 100644
index 000000000..2ac5b95bb
--- /dev/null
+++ b/tests/integration/verify_mobile_setup.py
@@ -0,0 +1,248 @@
+"""
+Quick verification script for Mobile MCP Server setup
+Checks prerequisites without starting full tests.
+
+Usage:
+ python tests/integration/verify_mobile_setup.py
+"""
+
+import os
+import subprocess
+import sys
+
+
+def find_adb():
+ """Auto-detect ADB path"""
+ # Try common ADB locations
+ common_paths = [
+ r"C:\Users\{}\AppData\Local\Android\Sdk\platform-tools\adb.exe".format(
+ os.environ.get("USERNAME", "")
+ ),
+ "/usr/bin/adb",
+ "/usr/local/bin/adb",
+ ]
+
+ for path in common_paths:
+ if os.path.exists(path):
+ return path
+
+ # Try to find in PATH
+ try:
+ result = subprocess.run(
+ ["where" if os.name == "nt" else "which", "adb"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ if result.returncode == 0:
+ return result.stdout.strip().split("\n")[0]
+ except:
+ pass
+
+ return None
+
+
+def check_adb():
+ """Check if ADB is available and working"""
+ print("\n1️⃣ Checking ADB installation...")
+
+ adb_path = find_adb()
+
+ if not adb_path:
+ print(" ❌ ADB not found")
+ print(" Please install Android SDK platform-tools")
+ return False, None
+
+ print(f" ✅ ADB found: {adb_path}")
+
+ try:
+ result = subprocess.run(
+ [adb_path, "version"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+
+ if result.returncode == 0:
+ version = result.stdout.split("\n")[0]
+ print(f" ✅ ADB version: {version}")
+ return True, adb_path
+ else:
+ print(f" ❌ ADB error: {result.stderr}")
+ return False, adb_path
+
+ except Exception as e:
+ print(f" ❌ Error: {e}")
+ return False, adb_path
+
+
+def check_device(adb_path):
+ """Check if Android device is connected"""
+ print("\n2️⃣ Checking device connection...")
+
+ if not adb_path:
+ print(" ❌ ADB not available")
+ return False
+
+ try:
+ result = subprocess.run(
+ [adb_path, "devices"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+
+ if result.returncode != 0:
+ print(f" ❌ ADB devices command failed")
+ return False
+
+ lines = result.stdout.strip().split("\n")
+ devices = [line for line in lines if "\tdevice" in line]
+
+ if not devices:
+ print(" ❌ No devices connected")
+ print(" Please start an Android emulator or connect a device")
+ print("\n Commands to start emulator:")
+ print(" emulator -list-avds # List available emulators")
+ print(" emulator -avd # Start specific emulator")
+ return False
+
+ print(f" ✅ Found {len(devices)} connected device(s):")
+ for device_line in devices:
+ device_id = device_line.split("\t")[0]
+ print(f" - {device_id}")
+
+ return True
+
+ except Exception as e:
+ print(f" ❌ Error: {e}")
+ return False
+
+
+def check_device_info(adb_path):
+ """Get basic device information"""
+ print("\n3️⃣ Getting device information...")
+
+ if not adb_path:
+ print(" ❌ ADB not available")
+ return False
+
+ try:
+ # Get device model
+ result = subprocess.run(
+ [adb_path, "shell", "getprop", "ro.product.model"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ model = result.stdout.strip() if result.returncode == 0 else "Unknown"
+
+ # Get Android version
+ result = subprocess.run(
+ [adb_path, "shell", "getprop", "ro.build.version.release"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ android_version = result.stdout.strip() if result.returncode == 0 else "Unknown"
+
+ # Get screen size
+ result = subprocess.run(
+ [adb_path, "shell", "wm", "size"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ screen_size = result.stdout.strip() if result.returncode == 0 else "Unknown"
+
+ print(f" ✅ Device Information:")
+ print(f" Model: {model}")
+ print(f" Android Version: {android_version}")
+ print(f" Screen Size: {screen_size}")
+
+ return True
+
+ except Exception as e:
+ print(f" ⚠️ Could not get device info: {e}")
+ return True # Non-critical
+
+
+def check_python_packages():
+ """Check if required Python packages are installed"""
+ print("\n4️⃣ Checking Python packages...")
+
+ required = ["fastmcp", "pydantic"]
+ missing = []
+
+ for package in required:
+ try:
+ __import__(package)
+ print(f" ✅ {package} installed")
+ except ImportError:
+ print(f" ❌ {package} NOT installed")
+ missing.append(package)
+
+ if missing:
+ print(f"\n Please install missing packages:")
+ print(f" pip install {' '.join(missing)}")
+ return False
+
+ return True
+
+
+def print_next_steps(all_ok):
+ """Print next steps based on verification results"""
+ print("\n" + "=" * 70)
+
+ if all_ok:
+ print("✅ ALL CHECKS PASSED - Ready to test Mobile MCP Servers!")
+ print("=" * 70)
+ print("\nNext steps:")
+ print("\n1. Start Mobile MCP Servers:")
+ print(
+ " python -m ufo.client.mcp.http_servers.mobile_mcp_server --server both"
+ )
+ print("\n2. Run standalone test:")
+ print(" python tests/integration/test_mobile_mcp_standalone.py")
+ print("\n3. Or run full integration test:")
+ print(" pytest tests/integration/test_mobile_mcp_server.py -v")
+ else:
+ print("❌ SOME CHECKS FAILED - Please fix the issues above")
+ print("=" * 70)
+ print("\nCommon fixes:")
+ print("- Install Android SDK platform-tools")
+ print("- Start Android emulator: emulator -avd ")
+ print("- Enable USB debugging on physical device")
+ print("- Install missing Python packages")
+
+
+def main():
+ """Main verification function"""
+ print("=" * 70)
+ print("Mobile MCP Server Setup Verification")
+ print("=" * 70)
+
+ results = []
+
+ # Run checks
+ adb_ok, adb_path = check_adb()
+ results.append(adb_ok)
+
+ if adb_ok:
+ results.append(check_device(adb_path))
+ results.append(check_device_info(adb_path))
+ else:
+ results.append(False)
+ results.append(False)
+
+ results.append(check_python_packages())
+
+ # Summary
+ all_ok = all(results)
+ print_next_steps(all_ok)
+
+ return 0 if all_ok else 1
+
+
+if __name__ == "__main__":
+ sys.exit(main())
diff --git a/ufo/agents/agent/customized_agent.py b/ufo/agents/agent/customized_agent.py
index 86ea8506b..119d8e0f4 100644
--- a/ufo/agents/agent/customized_agent.py
+++ b/ufo/agents/agent/customized_agent.py
@@ -7,9 +7,12 @@
CustomizedProcessor,
HardwareAgentProcessor,
LinuxAgentProcessor,
+ MobileAgentProcessor,
)
from ufo.agents.states.linux_agent_state import ContinueLinuxAgentState
+from ufo.agents.states.mobile_agent_state import ContinueMobileAgentState
from ufo.prompter.customized.linux_agent_prompter import LinuxAgentPrompter
+from ufo.prompter.customized.mobile_agent_prompter import MobileAgentPrompter
@AgentRegistry.register(
@@ -142,3 +145,127 @@ def blackboard(self) -> Blackboard:
:return: The blackboard.
"""
return self._blackboard
+
+
+@AgentRegistry.register(
+ agent_name="MobileAgent", third_party=True, processor_cls=MobileAgentProcessor
+)
+class MobileAgent(CustomizedAgent):
+ """
+ MobileAgent is a specialized agent that interacts with Android mobile devices.
+ """
+
+ def __init__(
+ self,
+ name: str,
+ main_prompt: str,
+ example_prompt: str,
+ ) -> None:
+ """
+ Initialize the MobileAgent.
+ :param name: The name of the agent.
+ :param main_prompt: The main prompt file path.
+ :param example_prompt: The example prompt file path.
+ """
+ super().__init__(
+ name=name,
+ main_prompt=main_prompt,
+ example_prompt=example_prompt,
+ process_name=None,
+ app_root_name=None,
+ is_visual=None,
+ )
+ self._blackboard = Blackboard()
+ self.set_state(self.default_state)
+
+ self._context_provision_executed = False
+ self.logger = logging.getLogger(__name__)
+
+ self.logger.info(
+ f"Main prompt: {main_prompt}, Example prompt: {example_prompt}"
+ )
+
+ def get_prompter(
+ self, is_visual: bool, main_prompt: str, example_prompt: str
+ ) -> MobileAgentPrompter:
+ """
+ Get the prompt for the agent.
+ :param main_prompt: The main prompt file path.
+ :param example_prompt: The example prompt file path.
+ :param is_visual: Whether the agent is visual or not. (Enabled for MobileAgent)
+ :return: The prompter instance.
+ """
+ return MobileAgentPrompter(main_prompt, example_prompt)
+
+ @property
+ def default_state(self) -> ContinueMobileAgentState:
+ """
+ Get the default state.
+ """
+ return ContinueMobileAgentState()
+
+ def message_constructor(
+ self,
+ dynamic_examples: List[str],
+ dynamic_knowledge: str,
+ plan: List[str],
+ request: str,
+ installed_apps: List[Dict[str, Any]],
+ current_controls: List[Dict[str, Any]],
+ screenshot_url: str = None,
+ annotated_screenshot_url: str = None,
+ blackboard_prompt: List[Dict[str, str]] = None,
+ last_success_actions: List[Dict[str, Any]] = None,
+ ) -> List[Dict[str, Union[str, List[Dict[str, str]]]]]:
+ """
+ Construct the prompt message for the MobileAgent.
+ :param dynamic_examples: The dynamic examples retrieved from demonstrations.
+ :param dynamic_knowledge: The dynamic knowledge retrieved from knowledge base.
+ :param plan: The plan list.
+ :param request: The overall user request.
+ :param installed_apps: The list of installed apps on the device.
+ :param current_controls: The list of current screen controls.
+ :param screenshot_url: The clean screenshot URL (base64).
+ :param annotated_screenshot_url: The annotated screenshot URL (base64).
+ :param blackboard_prompt: The prompt message from the blackboard.
+ :param last_success_actions: The list of successful actions in the last step.
+ :return: The prompt message.
+ """
+ if blackboard_prompt is None:
+ blackboard_prompt = []
+ if last_success_actions is None:
+ last_success_actions = []
+
+ mobile_agent_prompt_system_message = self.prompter.system_prompt_construction(
+ dynamic_examples
+ )
+
+ mobile_agent_prompt_user_message = self.prompter.user_content_construction(
+ prev_plan=plan,
+ user_request=request,
+ installed_apps=installed_apps,
+ current_controls=current_controls,
+ screenshot_url=screenshot_url,
+ annotated_screenshot_url=annotated_screenshot_url,
+ retrieved_docs=dynamic_knowledge,
+ last_success_actions=last_success_actions,
+ )
+
+ if blackboard_prompt:
+ mobile_agent_prompt_user_message = (
+ blackboard_prompt + mobile_agent_prompt_user_message
+ )
+
+ mobile_agent_prompt_message = self.prompter.prompt_construction(
+ mobile_agent_prompt_system_message, mobile_agent_prompt_user_message
+ )
+
+ return mobile_agent_prompt_message
+
+ @property
+ def blackboard(self) -> Blackboard:
+ """
+ Get the blackboard.
+ :return: The blackboard.
+ """
+ return self._blackboard
diff --git a/ufo/agents/processors/customized/customized_agent_processor.py b/ufo/agents/processors/customized/customized_agent_processor.py
index f8ac67e75..d0372a5be 100644
--- a/ufo/agents/processors/customized/customized_agent_processor.py
+++ b/ufo/agents/processors/customized/customized_agent_processor.py
@@ -31,6 +31,15 @@
LinuxLLMInteractionStrategy,
LinuxLoggingMiddleware,
)
+from ufo.agents.processors.strategies.mobile_agent_strategy import (
+ MobileScreenshotCaptureStrategy,
+ MobileAppsCollectionStrategy,
+ MobileControlsCollectionStrategy,
+ MobileLLMInteractionStrategy,
+ MobileActionExecutionStrategy,
+ MobileLoggingMiddleware,
+)
+from ufo.agents.processors.strategies.processing_strategy import ComposedStrategy
from ufo.module.context import Context, ContextNames
@@ -128,3 +137,59 @@ def _finalize_processing_context(
except Exception as e:
self.logger.warning(f"Failed to update ContextNames from results: {e}")
+
+
+class MobileAgentProcessor(CustomizedProcessor):
+ """
+ Processor for Mobile Android MCP Agent.
+ Handles data collection, LLM interaction, and action execution for Android devices.
+ """
+
+ def _setup_strategies(self) -> None:
+ """Setup processing strategies for Mobile Agent."""
+
+ # Data collection strategies - compose multiple strategies into one
+ self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
+ strategies=[
+ MobileScreenshotCaptureStrategy(fail_fast=True),
+ MobileAppsCollectionStrategy(fail_fast=False),
+ MobileControlsCollectionStrategy(fail_fast=False),
+ ],
+ name="MobileDataCollectionStrategy",
+ fail_fast=True,
+ )
+
+ # LLM interaction strategy (depends on all collected data)
+ self.strategies[ProcessingPhase.LLM_INTERACTION] = MobileLLMInteractionStrategy(
+ fail_fast=True
+ )
+
+ # Action execution strategy
+ self.strategies[ProcessingPhase.ACTION_EXECUTION] = (
+ MobileActionExecutionStrategy(fail_fast=False)
+ )
+
+ # Memory update strategy
+ self.strategies[ProcessingPhase.MEMORY_UPDATE] = AppMemoryUpdateStrategy(
+ fail_fast=False
+ )
+
+ def _setup_middleware(self) -> None:
+ """Setup middleware pipeline for Mobile Agent."""
+ # Use Mobile logging middleware for proper request display
+ self.middleware_chain = [MobileLoggingMiddleware()]
+
+ def _finalize_processing_context(
+ self, processing_context: ProcessingContext
+ ) -> None:
+ """
+ Finalize processing context by updating existing ContextNames fields.
+ :param processing_context: The processing context to finalize.
+ """
+ super()._finalize_processing_context(processing_context)
+ try:
+ result = processing_context.get_local("result")
+ if result:
+ self.global_context.set(ContextNames.ROUND_RESULT, result)
+ except Exception as e:
+ self.logger.warning(f"Failed to update ContextNames from results: {e}")
diff --git a/ufo/agents/processors/schemas/target.py b/ufo/agents/processors/schemas/target.py
index e902f8a90..4cae546a2 100644
--- a/ufo/agents/processors/schemas/target.py
+++ b/ufo/agents/processors/schemas/target.py
@@ -25,7 +25,7 @@ class TargetInfo(BaseModel):
id: Optional[str] = None # The ID of the target (only valid at current step)
type: Optional[str] = None # The type of the target (e.g., process, app, etc.)
rect: Optional[List[int]] = (
- None # The rectangle of the target [left, top, width, height]
+ None # The rectangle of the target [left, top, right, bottom]
)
diff --git a/ufo/agents/processors/strategies/linux_agent_strategy.py b/ufo/agents/processors/strategies/linux_agent_strategy.py
index 77cdfa581..34bbdf0b0 100644
--- a/ufo/agents/processors/strategies/linux_agent_strategy.py
+++ b/ufo/agents/processors/strategies/linux_agent_strategy.py
@@ -240,6 +240,11 @@ def starting_message(self, context: ProcessingContext) -> str:
:return: Starting message string
"""
- request = context.get_local("request")
-
- return f"Completing the user request [{request}] on Linux."
+ # Try both global and local context for request
+ request = (
+ context.get("request") or context.get_local("request") or "Unknown Request"
+ )
+
+ return (
+ f"Completing the user request: [bold cyan]{request}[/bold cyan] on Linux."
+ )
diff --git a/ufo/agents/processors/strategies/mobile_agent_strategy.py b/ufo/agents/processors/strategies/mobile_agent_strategy.py
new file mode 100644
index 000000000..fe24485e0
--- /dev/null
+++ b/ufo/agents/processors/strategies/mobile_agent_strategy.py
@@ -0,0 +1,838 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""
+Mobile Agent Data Collection Strategy - Strategy for collecting data from Android devices.
+
+This module contains data collection strategies for Mobile Agent including:
+- Screenshot capture (clean and annotated)
+- Installed apps information collection
+- Current screen controls information collection
+"""
+
+import traceback
+from typing import TYPE_CHECKING, List, Dict, Any
+
+from ufo import utils
+from ufo.agents.processors.context.processing_context import (
+ ProcessingContext,
+ ProcessingPhase,
+ ProcessingResult,
+)
+from ufo.agents.processors.core.strategy_dependency import depends_on, provides
+from ufo.agents.processors.app_agent_processor import AppAgentLoggingMiddleware
+from ufo.agents.processors.strategies.app_agent_processing_strategy import (
+ AppLLMInteractionStrategy,
+ AppActionExecutionStrategy,
+)
+from ufo.agents.processors.strategies.processing_strategy import BaseProcessingStrategy
+from ufo.automator.ui_control.screenshot import PhotographerFacade
+from config.config_loader import get_ufo_config
+from aip.messages import Command, ResultStatus, Result
+from ufo.module.dispatcher import BasicCommandDispatcher
+from ufo.agents.processors.schemas.actions import (
+ ListActionCommandInfo,
+ ActionCommandInfo,
+)
+from ufo.llm.response_schema import AppAgentResponse
+from ufo.agents.processors.schemas.target import TargetInfo, TargetKind
+
+# Load configuration
+ufo_config = get_ufo_config()
+
+if TYPE_CHECKING:
+ from ufo.agents.agent.customized_agent import MobileAgent
+
+
+@depends_on("log_path", "session_step")
+@provides(
+ "clean_screenshot_path",
+ "clean_screenshot_url",
+ "annotated_screenshot_url",
+ "screenshot_saved_time",
+)
+class MobileScreenshotCaptureStrategy(BaseProcessingStrategy):
+ """
+ Strategy for capturing Android device screenshots.
+
+ This strategy handles:
+ - Device screenshot capture via MCP server
+ - Screenshot path management and storage
+ - Performance timing for screenshot operations
+ """
+
+ def __init__(self, fail_fast: bool = True) -> None:
+ """
+ Initialize screenshot capture strategy.
+ :param fail_fast: Whether to raise exceptions immediately on errors
+ """
+ super().__init__(name="mobile_screenshot_capture", fail_fast=fail_fast)
+
+ async def execute(
+ self, agent: "MobileAgent", context: ProcessingContext
+ ) -> ProcessingResult:
+ """
+ Execute screenshot capture for Mobile Agent.
+ :param agent: The MobileAgent instance
+ :param context: Processing context
+ :return: ProcessingResult with screenshot paths and timing
+ """
+ try:
+ import time
+
+ start_time = time.time()
+
+ # Extract context variables with validation
+ log_path = context.get("log_path")
+ session_step = context.get("session_step", 0)
+ command_dispatcher = context.global_context.command_dispatcher
+
+ # Validate required context variables
+ if log_path is None:
+ raise ValueError("log_path is required but not found in context")
+ if command_dispatcher is None:
+ raise ValueError(
+ "command_dispatcher is required but not found in global context"
+ )
+
+ # Step 1: Capture clean screenshot
+ self.logger.info("Capturing Android device screenshot")
+
+ clean_screenshot_path = f"{log_path}action_step{session_step}.png"
+
+ clean_screenshot_url = await self._capture_screenshot(
+ clean_screenshot_path, command_dispatcher
+ )
+
+ # Step 2: Capture annotated screenshot (if available)
+ annotated_screenshot_url = None
+ # Note: Annotated screenshot would require additional processing
+ # For now, we'll use the clean screenshot
+
+ screenshot_time = time.time() - start_time
+
+ return ProcessingResult(
+ success=True,
+ data={
+ "clean_screenshot_path": clean_screenshot_path,
+ "clean_screenshot_url": clean_screenshot_url,
+ "annotated_screenshot_url": annotated_screenshot_url,
+ "screenshot_saved_time": screenshot_time,
+ },
+ phase=ProcessingPhase.DATA_COLLECTION,
+ )
+
+ except Exception as e:
+ error_msg = f"Screenshot capture failed: {str(e)}"
+ self.logger.error(error_msg)
+ return self.handle_error(e, ProcessingPhase.DATA_COLLECTION, context)
+
+ async def _capture_screenshot(
+ self, save_path: str, command_dispatcher: BasicCommandDispatcher
+ ) -> str:
+ """
+ Capture Android device screenshot via MCP server.
+ :param save_path: The path for saving screenshot
+ :param command_dispatcher: Command dispatcher for executing commands
+ :return: The base64 URL of the screenshot
+ """
+ try:
+ if not command_dispatcher:
+ raise ValueError("Command dispatcher not available")
+
+ # Execute capture_screenshot command via MCP server
+ result = await command_dispatcher.execute_commands(
+ [
+ Command(
+ tool_name="capture_screenshot",
+ parameters={},
+ tool_type="data_collection",
+ )
+ ]
+ )
+
+ if (
+ not result
+ or not result[0].result
+ or result[0].status != ResultStatus.SUCCESS
+ ):
+ raise ValueError("Failed to capture screenshot")
+
+ # Extract image data from result - now it's directly a base64 string
+ clean_screenshot_url = result[0].result
+
+ # Save screenshot to file
+ utils.save_image_string(clean_screenshot_url, save_path)
+ self.logger.info(f"Screenshot saved to: {save_path}")
+
+ return clean_screenshot_url
+
+ except Exception as e:
+ raise Exception(f"Failed to capture screenshot: {str(e)}")
+
+
+@depends_on("clean_screenshot_url")
+@provides("installed_apps", "apps_collection_time")
+class MobileAppsCollectionStrategy(BaseProcessingStrategy):
+ """
+ Strategy for collecting installed apps information from Android device.
+
+ This strategy handles:
+ - Fetching installed apps via MCP server
+ - Filtering and organizing app data
+ - Caching app information
+ """
+
+ def __init__(self, fail_fast: bool = True) -> None:
+ """
+ Initialize apps collection strategy.
+ :param fail_fast: Whether to raise exceptions immediately on errors
+ """
+ super().__init__(name="mobile_apps_collection", fail_fast=fail_fast)
+
+ async def execute(
+ self, agent: "MobileAgent", context: ProcessingContext
+ ) -> ProcessingResult:
+ """
+ Execute apps collection for Mobile Agent.
+ :param agent: The MobileAgent instance
+ :param context: Processing context
+ :return: ProcessingResult with installed apps list
+ """
+ try:
+ import time
+
+ start_time = time.time()
+
+ command_dispatcher = context.global_context.command_dispatcher
+
+ if command_dispatcher is None:
+ raise ValueError(
+ "command_dispatcher is required but not found in global context"
+ )
+
+ # Fetch installed apps via MCP server
+ self.logger.info("Fetching installed apps from Android device")
+
+ result = await command_dispatcher.execute_commands(
+ [
+ Command(
+ tool_name="get_mobile_app_target_info",
+ parameters={"include_system_apps": False},
+ tool_type="data_collection",
+ )
+ ]
+ )
+
+ if not result or result[0].status != ResultStatus.SUCCESS:
+ if not result:
+ self.logger.warning("No result returned from MCP server")
+ else:
+ self.logger.warning(
+ f"MCP server returned error. Status: {result[0].status}, Error: {result[0].error if hasattr(result[0], 'error') else 'N/A'}"
+ )
+ self.logger.warning("Failed to fetch installed apps, using empty list")
+ installed_apps = []
+ else:
+ # Parse the result - MCP returns dictionaries or TargetInfo objects
+ # result[0].result could be an empty list [] which is valid
+ apps_data = result[0].result or []
+ if isinstance(apps_data, list):
+ installed_apps = []
+ for app in apps_data:
+ # Handle both dict and TargetInfo objects
+ if isinstance(app, dict):
+ installed_apps.append(self._dict_to_app_dict(app))
+ else:
+ installed_apps.append(self._target_info_to_dict(app))
+ else:
+ installed_apps = []
+
+ apps_time = time.time() - start_time
+
+ self.logger.info(f"Collected {len(installed_apps)} installed apps")
+
+ return ProcessingResult(
+ success=True,
+ data={
+ "installed_apps": installed_apps,
+ "apps_collection_time": apps_time,
+ },
+ phase=ProcessingPhase.DATA_COLLECTION,
+ )
+
+ except Exception as e:
+ error_msg = f"Apps collection failed: {str(e)}"
+ self.logger.error(error_msg)
+ return self.handle_error(e, ProcessingPhase.DATA_COLLECTION, context)
+
+ def _target_info_to_dict(self, target_info: TargetInfo) -> Dict[str, Any]:
+ """
+ Convert TargetInfo object to dictionary for prompt.
+ :param target_info: TargetInfo object
+ :return: Dictionary representation
+ """
+ return {
+ "id": target_info.id,
+ "name": target_info.name,
+ "package": target_info.type,
+ }
+
+ def _dict_to_app_dict(self, app_dict: Dict[str, Any]) -> Dict[str, Any]:
+ """
+ Convert MCP returned dictionary to app dictionary for prompt.
+ :param app_dict: Dictionary from MCP server
+ :return: Dictionary representation for prompt
+ """
+ return {
+ "id": app_dict.get("id", ""),
+ "name": app_dict.get("name", ""),
+ "package": app_dict.get("type", ""),
+ }
+
+
+@depends_on("clean_screenshot_url")
+@provides(
+ "current_controls",
+ "controls_collection_time",
+ "annotated_screenshot_url",
+ "annotated_screenshot_path",
+ "annotation_dict",
+)
+class MobileControlsCollectionStrategy(BaseProcessingStrategy):
+ """
+ Strategy for collecting current screen controls information from Android device.
+
+ This strategy handles:
+ - Fetching current screen UI controls via MCP server
+ - Filtering and organizing control data
+ - Caching control information
+ - Creating annotated screenshots with control labels
+ """
+
+ def __init__(self, fail_fast: bool = True) -> None:
+ """
+ Initialize controls collection strategy.
+ :param fail_fast: Whether to raise exceptions immediately on errors
+ """
+ super().__init__(name="mobile_controls_collection", fail_fast=fail_fast)
+ self.photographer = PhotographerFacade()
+
+ async def execute(
+ self, agent: "MobileAgent", context: ProcessingContext
+ ) -> ProcessingResult:
+ """
+ Execute controls collection for Mobile Agent.
+ :param agent: The MobileAgent instance
+ :param context: Processing context
+ :return: ProcessingResult with current screen controls list
+ """
+ try:
+ import time
+
+ start_time = time.time()
+
+ command_dispatcher = context.global_context.command_dispatcher
+
+ if command_dispatcher is None:
+ raise ValueError(
+ "command_dispatcher is required but not found in global context"
+ )
+
+ # Fetch current screen controls via MCP server
+ self.logger.info("Fetching current screen controls from Android device")
+
+ result = await command_dispatcher.execute_commands(
+ [
+ Command(
+ tool_name="get_app_window_controls_target_info",
+ parameters={},
+ tool_type="data_collection",
+ )
+ ]
+ )
+
+ if not result or result[0].status != ResultStatus.SUCCESS:
+ if not result:
+ self.logger.warning("No result returned from MCP server")
+ else:
+ self.logger.warning(
+ f"MCP server returned error. Status: {result[0].status}, Error: {result[0].error if hasattr(result[0], 'error') else 'N/A'}"
+ )
+ self.logger.warning(
+ "Failed to fetch current controls, using empty list"
+ )
+ current_controls = []
+ else:
+ # Parse the result - MCP returns dictionaries or TargetInfo objects
+ # result[0].result could be an empty list [] which is valid
+ controls_data = result[0].result or []
+ if isinstance(controls_data, list):
+ current_controls = []
+ for control in controls_data:
+ # Handle both dict and TargetInfo objects
+ if isinstance(control, dict):
+ control_dict = self._dict_to_control_dict(control)
+ # Only add if it has a valid rect
+ if control_dict is not None:
+ current_controls.append(control_dict)
+ else:
+ control_dict = self._target_info_to_dict(control)
+ if control_dict is not None:
+ current_controls.append(control_dict)
+ else:
+ current_controls = []
+
+ controls_time = time.time() - start_time
+
+ self.logger.info(f"Collected {len(current_controls)} screen controls")
+
+ # Generate annotated screenshot with control IDs and annotation dict
+ annotated_screenshot_url = None
+ annotated_screenshot_path = None
+ annotation_dict = {}
+
+ if len(current_controls) > 0:
+ clean_screenshot_path = context.get_local("clean_screenshot_path")
+ log_path = context.get_local("log_path")
+ session_step = context.get_local("session_step", 0)
+
+ if clean_screenshot_path and log_path:
+ annotated_screenshot_path = (
+ f"{log_path}action_step{session_step}_annotated.png"
+ )
+
+ # Convert controls to TargetInfo objects for photographer
+ # Use current_controls which are already validated dictionaries
+ target_info_list = self._controls_to_target_info_list(
+ current_controls
+ )
+
+ # Create annotation dict
+ annotation_dict = {
+ control.get("id"): control
+ for control in current_controls
+ if "id" in control
+ }
+
+ # Generate annotated screenshot using photographer
+ annotated_screenshot_url = self._save_annotated_screenshot(
+ clean_screenshot_path,
+ target_info_list,
+ annotated_screenshot_path,
+ )
+
+ if annotated_screenshot_url:
+ self.logger.info(
+ f"Created annotated screenshot with {len(current_controls)} controls"
+ )
+ else:
+ self.logger.warning("Failed to create annotated screenshot")
+
+ return ProcessingResult(
+ success=True,
+ data={
+ "current_controls": current_controls,
+ "controls_collection_time": controls_time,
+ "annotated_screenshot_url": annotated_screenshot_url,
+ "annotated_screenshot_path": annotated_screenshot_path,
+ "annotation_dict": annotation_dict,
+ },
+ phase=ProcessingPhase.DATA_COLLECTION,
+ )
+
+ except Exception as e:
+ error_msg = f"Controls collection failed: {str(e)}"
+ self.logger.error(error_msg)
+ return self.handle_error(e, ProcessingPhase.DATA_COLLECTION, context)
+
+ def _target_info_to_dict(self, target_info: TargetInfo) -> Dict[str, Any]:
+ """
+ Convert TargetInfo object to dictionary for prompt.
+ :param target_info: TargetInfo object
+ :return: Dictionary representation
+ """
+ result = {
+ "id": target_info.id,
+ "name": target_info.name,
+ "type": target_info.type,
+ }
+ if target_info.rect:
+ result["rect"] = target_info.rect
+ return result
+
+ def _dict_to_control_dict(self, control_dict: Dict[str, Any]) -> Dict[str, Any]:
+ """
+ Convert MCP returned dictionary to control dictionary for prompt.
+ Validates rectangle and returns None if invalid.
+ :param control_dict: Dictionary from MCP server
+ :return: Dictionary representation for prompt, or None if invalid
+ """
+ rect = control_dict.get("rect")
+
+ # Validate rectangle if present
+ # rect format is [left, top, right, bottom] (bbox format)
+ if rect:
+ if not isinstance(rect, list) or len(rect) < 4:
+ self.logger.debug(
+ f"Skipping control with malformed rect: {control_dict.get('id')}"
+ )
+ return None
+
+ left, top, right, bottom = rect[0], rect[1], rect[2], rect[3]
+
+ # Check if dimensions are valid (right > left and bottom > top)
+ if right <= left or bottom <= top:
+ self.logger.debug(
+ f"Skipping control with invalid dimensions: {control_dict.get('id')} - "
+ f"rect={rect} (right={right}, left={left}, bottom={bottom}, top={top})"
+ )
+ return None
+
+ result = {
+ "id": control_dict.get("id", ""),
+ "name": control_dict.get("name", ""),
+ "type": control_dict.get("type", ""),
+ }
+ if rect:
+ result["rect"] = rect
+ return result
+
+ def _controls_to_target_info_list(self, controls_data: List) -> List[TargetInfo]:
+ """
+ Convert control dictionaries to TargetInfo objects.
+ Filters out controls with invalid rectangles.
+ :param controls_data: List of control dictionaries or TargetInfo objects
+ :return: List of TargetInfo objects with valid rectangles
+ """
+ target_info_list = []
+ invalid_count = 0
+
+ for control in controls_data:
+ if isinstance(control, dict):
+ rect = control.get("rect")
+
+ # Validate rectangle: [left, top, right, bottom] (bbox format)
+ # Skip if rect is None, empty, or has invalid dimensions
+ if rect and len(rect) >= 4:
+ left, top, right, bottom = rect[0], rect[1], rect[2], rect[3]
+
+ # Check if dimensions are valid (right > left and bottom > top)
+ if right > left and bottom > top:
+ # Create TargetInfo from dict
+ target_info = TargetInfo(
+ kind=TargetKind.CONTROL,
+ id=control.get("id", ""),
+ name=control.get("name", ""),
+ type=control.get("type", ""),
+ rect=rect,
+ )
+ target_info_list.append(target_info)
+ else:
+ invalid_count += 1
+ self.logger.debug(
+ f"Skipping control with invalid dimensions: {control.get('id')} - rect={rect}"
+ )
+ else:
+ invalid_count += 1
+ self.logger.debug(
+ f"Skipping control without valid rect: {control.get('id')}"
+ )
+
+ elif isinstance(control, TargetInfo):
+ # Validate TargetInfo rect as well
+ if control.rect and len(control.rect) >= 4:
+ left, top, right, bottom = (
+ control.rect[0],
+ control.rect[1],
+ control.rect[2],
+ control.rect[3],
+ )
+ if right > left and bottom > top:
+ target_info_list.append(control)
+ else:
+ invalid_count += 1
+ else:
+ invalid_count += 1
+
+ if invalid_count > 0:
+ self.logger.warning(
+ f"Filtered out {invalid_count} controls with invalid rectangles"
+ )
+
+ return target_info_list
+
+ def _save_annotated_screenshot(
+ self,
+ clean_screenshot_path: str,
+ target_list: List[TargetInfo],
+ save_path: str,
+ ) -> str:
+ """
+ Save annotated screenshot using photographer.
+ :param clean_screenshot_path: Path to the clean screenshot
+ :param target_list: List of TargetInfo objects
+ :param save_path: The saved path of the annotated screenshot
+ :return: The annotated image string (base64 URL)
+ """
+ try:
+ # For mobile, we don't have application_window_info, so create a dummy one
+ # The photographer will use the full screenshot
+ dummy_window_info = TargetInfo(
+ kind=TargetKind.WINDOW,
+ id="mobile_screen",
+ name="Mobile Screen",
+ type="mobile",
+ )
+
+ self.photographer.capture_app_window_screenshot_with_target_list(
+ application_window_info=dummy_window_info,
+ target_list=target_list,
+ path=clean_screenshot_path,
+ save_path=save_path,
+ highlight_bbox=True,
+ )
+
+ annotated_screenshot_url = self.photographer.encode_image_from_path(
+ save_path
+ )
+ return annotated_screenshot_url
+ except Exception as e:
+ import traceback
+
+ self.logger.error(f"Failed to save annotated screenshot: {str(e)}")
+ self.logger.error(traceback.format_exc())
+ return None
+
+
+@depends_on("installed_apps", "current_controls", "clean_screenshot_url")
+@provides(
+ "parsed_response",
+ "response_text",
+ "llm_cost",
+ "prompt_message",
+ "action",
+ "thought",
+ "comment",
+)
+class MobileLLMInteractionStrategy(AppLLMInteractionStrategy):
+ """
+ Strategy for LLM interaction with Mobile Agent specific prompting.
+
+ This strategy handles:
+ - Context-aware prompt construction with mobile-specific data
+ - Screenshot and control information integration in prompts
+ - LLM interaction with retry logic
+ - Response parsing and validation
+ """
+
+ def __init__(self, fail_fast: bool = True) -> None:
+ """
+ Initialize Mobile Agent LLM interaction strategy.
+ :param fail_fast: Whether to raise exceptions immediately on errors
+ """
+ super().__init__(fail_fast=fail_fast)
+
+ async def execute(
+ self, agent: "MobileAgent", context: ProcessingContext
+ ) -> ProcessingResult:
+ """
+ Execute LLM interaction for Mobile Agent.
+ :param agent: The MobileAgent instance
+ :param context: Processing context with mobile device data
+ :return: ProcessingResult with parsed response and cost
+ """
+ try:
+ request = context.get("request")
+ installed_apps = context.get_local("installed_apps", [])
+ current_controls = context.get_local("current_controls", [])
+ clean_screenshot_url = context.get_local("clean_screenshot_url")
+ annotated_screenshot_url = context.get_local("annotated_screenshot_url")
+ plan = self._get_prev_plan(agent)
+
+ # Build comprehensive prompt
+ self.logger.info("Building Mobile Agent prompt")
+
+ # Get blackboard context
+ blackboard_prompt = []
+ if not agent.blackboard.is_empty():
+ blackboard_prompt = agent.blackboard.blackboard_to_prompt()
+
+ prompt_message = agent.message_constructor(
+ dynamic_examples=[],
+ dynamic_knowledge="",
+ plan=plan,
+ request=request,
+ installed_apps=installed_apps,
+ current_controls=current_controls,
+ screenshot_url=clean_screenshot_url,
+ annotated_screenshot_url=annotated_screenshot_url,
+ blackboard_prompt=blackboard_prompt,
+ last_success_actions=self._get_last_success_actions(agent=agent),
+ )
+
+ # Get LLM response
+ self.logger.info("Getting LLM response for Mobile Agent")
+ response_text, llm_cost = await self._get_llm_response(
+ agent, prompt_message
+ )
+
+ # Parse and validate response
+ self.logger.info("Parsing Mobile Agent response")
+ parsed_response = self._parse_app_response(agent, response_text)
+
+ # Extract structured data
+ structured_data = parsed_response.model_dump()
+
+ return ProcessingResult(
+ success=True,
+ data={
+ "parsed_response": parsed_response,
+ "response_text": response_text,
+ "llm_cost": llm_cost,
+ "prompt_message": prompt_message,
+ **structured_data,
+ },
+ phase=ProcessingPhase.LLM_INTERACTION,
+ )
+
+ except Exception as e:
+ error_msg = f"Mobile LLM interaction failed: {str(e)}"
+ self.logger.error(error_msg)
+ return self.handle_error(e, ProcessingPhase.LLM_INTERACTION, context)
+
+
+class MobileActionExecutionStrategy(AppActionExecutionStrategy):
+ """
+ Strategy for executing actions in Mobile Agent.
+
+ This strategy handles:
+ - Action execution based on parsed LLM response
+ - Result capturing and error handling
+ """
+
+ def __init__(self, fail_fast: bool = True) -> None:
+ """
+ Initialize Mobile action execution strategy.
+ :param fail_fast: Whether to raise exceptions immediately on errors
+ """
+ super().__init__(fail_fast=fail_fast)
+
+ async def execute(
+ self, agent: "MobileAgent", context: ProcessingContext
+ ) -> ProcessingResult:
+ """
+ Execute Mobile Agent actions.
+ :param agent: The MobileAgent instance
+ :param context: Processing context with response and control data
+ :return: ProcessingResult with execution results
+ """
+ try:
+ # Step 1: Extract context variables
+ parsed_response: AppAgentResponse = context.get_local("parsed_response")
+ command_dispatcher = context.global_context.command_dispatcher
+
+ if not parsed_response:
+ return ProcessingResult(
+ success=True,
+ data={"message": "No response available for action execution"},
+ phase=ProcessingPhase.ACTION_EXECUTION,
+ )
+
+ # Execute the action
+ execution_results = await self._execute_app_action(
+ command_dispatcher, parsed_response.action
+ )
+
+ # Create action info for memory
+ actions = self._create_action_info(
+ parsed_response.action,
+ execution_results,
+ )
+
+ # Print action info
+ action_info = ListActionCommandInfo(actions)
+ action_info.color_print()
+
+ # Create control log
+ control_log = action_info.get_target_info()
+
+ status = (
+ parsed_response.action.status
+ if isinstance(parsed_response.action, ActionCommandInfo)
+ else action_info.status
+ )
+
+ return ProcessingResult(
+ success=True,
+ data={
+ "execution_result": execution_results,
+ "action_info": action_info,
+ "control_log": control_log,
+ "status": status,
+ },
+ phase=ProcessingPhase.ACTION_EXECUTION,
+ )
+
+ except Exception as e:
+ error_msg = f"Mobile action execution failed: {str(traceback.format_exc())}"
+ self.logger.error(error_msg)
+ return self.handle_error(e, ProcessingPhase.ACTION_EXECUTION, context)
+
+ def _create_action_info(
+ self,
+ actions: ActionCommandInfo | List[ActionCommandInfo],
+ execution_results: List[Result],
+ ) -> List[ActionCommandInfo]:
+ """
+ Create action information for memory tracking.
+ :param actions: The action or list of actions
+ :param execution_results: Execution results
+ :return: List of ActionCommandInfo objects
+ """
+ try:
+ if not actions:
+ actions = []
+ if not execution_results:
+ execution_results = []
+
+ if isinstance(actions, ActionCommandInfo):
+ actions = [actions]
+
+ assert len(execution_results) == len(
+ actions
+ ), "Mismatch in actions and execution results length"
+
+ for i, action in enumerate(actions):
+ action.result = execution_results[i]
+
+ if not action.function:
+ action.function = "no_action"
+
+ return actions
+
+ except Exception as e:
+ self.logger.warning(f"Failed to create action info: {str(e)}")
+ return []
+
+
+class MobileLoggingMiddleware(AppAgentLoggingMiddleware):
+ """
+ Specialized logging middleware for Mobile Agent with enhanced contextual information.
+ """
+
+ def starting_message(self, context: ProcessingContext) -> str:
+ """
+ Return the starting message of the agent.
+ :param context: Processing context with round and step information
+ :return: Starting message string
+ """
+
+ # Try both global and local context for request
+ request = (
+ context.get("request") or context.get_local("request") or "Unknown Request"
+ )
+
+ return (
+ f"Completing the user request: [bold cyan]{request}[/bold cyan] on Mobile."
+ )
diff --git a/ufo/agents/states/mobile_agent_state.py b/ufo/agents/states/mobile_agent_state.py
new file mode 100644
index 000000000..7dbad35e6
--- /dev/null
+++ b/ufo/agents/states/mobile_agent_state.py
@@ -0,0 +1,262 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+from __future__ import annotations
+
+from enum import Enum
+from typing import TYPE_CHECKING, Dict, Optional, Type
+
+from ufo.agents.states.basic import AgentState, AgentStateManager
+from config.config_loader import get_ufo_config
+from ufo.module.context import Context
+
+# Avoid circular import
+if TYPE_CHECKING:
+ from ufo.agents.agent.customized_agent import MobileAgent
+
+
+ufo_config = get_ufo_config()
+
+
+class MobileAgentStatus(Enum):
+ """
+ Store the status of the mobile agent.
+ """
+
+ FINISH = "FINISH"
+ CONTINUE = "CONTINUE"
+ FAIL = "FAIL"
+
+
+class MobileAgentStateManager(AgentStateManager):
+
+ _state_mapping: Dict[str, Type[MobileAgentState]] = {}
+
+ @property
+ def none_state(self) -> AgentState:
+ """
+ The none state of the state manager.
+ """
+ return NoneMobileAgentState()
+
+
+class MobileAgentState(AgentState):
+ """
+ The abstract class for the mobile agent state.
+ """
+
+ async def handle(
+ self, agent: "MobileAgent", context: Optional["Context"] = None
+ ) -> None:
+ """
+ Handle the agent for the current step.
+ :param agent: The agent for the current step.
+ :param context: The context for the agent and session.
+ """
+ pass
+
+ @classmethod
+ def agent_class(cls) -> Type[MobileAgent]:
+ """
+ The agent class of the state.
+ :return: The agent class.
+ """
+
+ # Avoid circular import
+ from ufo.agents.agent.customized_agent import MobileAgent
+
+ return MobileAgent
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ """
+ Get the agent for the next step.
+ :param agent: The agent for the current step.
+ :return: The agent for the next step.
+ """
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ """
+ Get the next state of the agent.
+ :param agent: The agent for the current step.
+ :return: The state for the next step.
+ """
+
+ status = agent.status
+ state = MobileAgentStateManager().get_state(status)
+ return state
+
+ def is_round_end(self) -> bool:
+ """
+ Check if the round ends.
+ :return: True if the round ends, False otherwise.
+ """
+ return False
+
+
+@MobileAgentStateManager.register
+class FinishMobileAgentState(MobileAgentState):
+ """
+ The class for the finish mobile agent state.
+ """
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ """
+ Get the agent for the next step.
+ :param agent: The agent for the current step.
+ :return: The agent for the next step.
+ """
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ """
+ Get the next state of the agent.
+ :param agent: The agent for the current step.
+ :return: The state for the next step.
+ """
+ return FinishMobileAgentState()
+
+ def is_subtask_end(self) -> bool:
+ """
+ Check if the subtask ends.
+ :return: True if the subtask ends, False otherwise.
+ """
+ return True
+
+ def is_round_end(self) -> bool:
+ """
+ Check if the round ends.
+ :return: True if the round ends, False otherwise.
+ """
+ return True
+
+ @classmethod
+ def name(cls) -> str:
+ """
+ The class name of the state.
+ :return: The name of the state.
+ """
+ return MobileAgentStatus.FINISH.value
+
+
+@MobileAgentStateManager.register
+class ContinueMobileAgentState(MobileAgentState):
+ """
+ The class for the continue mobile agent state.
+ """
+
+ async def handle(
+ self, agent: "MobileAgent", context: Optional["Context"] = None
+ ) -> None:
+ """
+ Handle the agent for the current step.
+ :param agent: The agent for the current step.
+ :param context: The context for the agent and session.
+ """
+
+ await agent.process(context)
+
+ def is_subtask_end(self) -> bool:
+ """
+ Check if the subtask ends.
+ :return: True if the subtask ends, False otherwise.
+ """
+ return False
+
+ @classmethod
+ def name(cls) -> str:
+ """
+ The class name of the state.
+ :return: The name of the state.
+ """
+ return MobileAgentStatus.CONTINUE.value
+
+
+@MobileAgentStateManager.register
+class FailMobileAgentState(MobileAgentState):
+ """
+ The class for the fail mobile agent state.
+ """
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ """
+ Get the agent for the next step.
+ :param agent: The agent for the current step.
+ :return: The agent for the next step.
+ """
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ """
+ Get the next state of the agent.
+ :param agent: The agent for the current step.
+ :return: The state for the next step.
+ """
+ return FinishMobileAgentState()
+
+ def is_round_end(self) -> bool:
+ """
+ Check if the round ends.
+ :return: True if the round ends, False otherwise.
+ """
+ return True
+
+ def is_subtask_end(self) -> bool:
+ """
+ Check if the subtask ends.
+ :return: True if the subtask ends, False otherwise.
+ """
+ return True
+
+ @classmethod
+ def name(cls) -> str:
+ """
+ The class name of the state.
+ :return: The name of the state.
+ """
+ return MobileAgentStatus.FAIL.value
+
+
+@MobileAgentStateManager.register
+class NoneMobileAgentState(MobileAgentState):
+ """
+ The class for the none mobile agent state.
+ """
+
+ def next_agent(self, agent: "MobileAgent") -> "MobileAgent":
+ """
+ Get the agent for the next step.
+ :param agent: The agent for the current step.
+ :return: The agent for the next step.
+ """
+ return agent
+
+ def next_state(self, agent: "MobileAgent") -> MobileAgentState:
+ """
+ Get the next state of the agent.
+ :param agent: The agent for the current step.
+ :return: The state for the next step.
+ """
+ return FinishMobileAgentState()
+
+ def is_subtask_end(self) -> bool:
+ """
+ Check if the subtask ends.
+ :return: True if the subtask ends, False otherwise.
+ """
+ return True
+
+ def is_round_end(self) -> bool:
+ """
+ Check if the round ends.
+ :return: True if the round ends, False otherwise.
+ """
+ return True
+
+ @classmethod
+ def name(cls) -> str:
+ """
+ The class name of the state.
+ :return: The name of the state.
+ """
+ return ""
diff --git a/ufo/client/client.py b/ufo/client/client.py
index 3cf4010ad..c2b42aa39 100644
--- a/ufo/client/client.py
+++ b/ufo/client/client.py
@@ -57,15 +57,15 @@
"--platform",
dest="platform",
default=None,
- choices=["windows", "linux"],
- help="Platform override (windows or linux). If not specified, auto-detected from system.",
+ choices=["windows", "linux", "mobile"],
+ help="Platform override (windows, linux, or mobile). If not specified, auto-detected from system.",
)
args = parser.parse_args()
# Auto-detect platform if not specified
if args.platform is None:
detected_platform = platform_module.system().lower()
- if detected_platform in ["windows", "linux"]:
+ if detected_platform in ["windows", "linux", "mobile"]:
args.platform = detected_platform
else:
# Fallback to windows for unsupported platforms
diff --git a/ufo/client/mcp/http_servers/mobile_mcp_server.py b/ufo/client/mcp/http_servers/mobile_mcp_server.py
new file mode 100644
index 000000000..a513b9bb4
--- /dev/null
+++ b/ufo/client/mcp/http_servers/mobile_mcp_server.py
@@ -0,0 +1,1521 @@
+#!/usr/bin/env python3
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""
+Mobile MCP Servers
+Provides two MCP servers:
+1. Mobile Data Collection Server - for data retrieval operations (screenshots, UI tree, device info, etc.)
+2. Mobile Action Server - for device control actions (tap, swipe, type, launch app, etc.)
+Both servers share the same MobileServerState for coordinated operations.
+Similar to linux_mcp_server.py structure with two separate servers on different ports.
+"""
+
+import argparse
+import asyncio
+import base64
+import os
+import subprocess
+import tempfile
+import xml.etree.ElementTree as ET
+from typing import Annotated, Any, Dict, List, Optional
+from fastmcp import FastMCP
+from pydantic import Field
+from ufo.agents.processors.schemas.target import TargetInfo, TargetKind
+
+
+# Singleton Mobile server state
+class MobileServerState:
+ """
+ Singleton state manager for Mobile MCP Servers.
+ Caches app and control information to avoid repeated ADB queries.
+ Shared between Data Collection and Action servers.
+ """
+
+ _instance = None
+ _initialized = False
+
+ def __new__(cls):
+ if cls._instance is None:
+ cls._instance = super(MobileServerState, cls).__new__(cls)
+ return cls._instance
+
+ def __init__(self):
+ if not self._initialized:
+ # Cache for installed apps (List[TargetInfo])
+ self.installed_apps: Optional[List[TargetInfo]] = None
+ self.installed_apps_timestamp: Optional[float] = None
+
+ # Cache for current screen controls (List[TargetInfo])
+ self.current_controls: Optional[List[TargetInfo]] = None
+ self.current_controls_timestamp: Optional[float] = None
+
+ # Cache for UI tree XML
+ self.ui_tree_xml: Optional[str] = None
+ self.ui_tree_timestamp: Optional[float] = None
+
+ # Cache for device info
+ self.device_info: Optional[Dict[str, Any]] = None
+ self.device_info_timestamp: Optional[float] = None
+
+ # Control dictionary for quick lookup by ID
+ self.control_dict: Optional[Dict[str, TargetInfo]] = None
+
+ # Cache expiration times (seconds)
+ self.apps_cache_duration = 300 # 5 minutes for apps list
+ self.controls_cache_duration = 5 # 5 seconds for screen controls
+ self.ui_tree_cache_duration = 5 # 5 seconds for UI tree
+ self.device_info_cache_duration = 60 # 1 minute for device info
+
+ MobileServerState._initialized = True
+
+ def set_installed_apps(self, apps: List[TargetInfo]) -> None:
+ """Cache the installed apps list."""
+ import time
+
+ self.installed_apps = apps
+ self.installed_apps_timestamp = time.time()
+
+ def get_installed_apps(self) -> Optional[List[TargetInfo]]:
+ """Get cached installed apps if not expired."""
+ import time
+
+ if self.installed_apps is None or self.installed_apps_timestamp is None:
+ return None
+
+ if time.time() - self.installed_apps_timestamp > self.apps_cache_duration:
+ return None # Cache expired
+
+ return self.installed_apps
+
+ def set_current_controls(self, controls: List[TargetInfo]) -> None:
+ """Cache the current screen controls and build control dictionary."""
+ import time
+
+ self.current_controls = controls
+ self.current_controls_timestamp = time.time()
+
+ # Build control dictionary for quick lookup
+ self.control_dict = {control.id: control for control in controls}
+
+ def get_current_controls(self) -> Optional[List[TargetInfo]]:
+ """Get cached screen controls if not expired."""
+ import time
+
+ if self.current_controls is None or self.current_controls_timestamp is None:
+ return None
+
+ if time.time() - self.current_controls_timestamp > self.controls_cache_duration:
+ return None # Cache expired
+
+ return self.current_controls
+
+ def get_control_by_id(self, control_id: str) -> Optional[TargetInfo]:
+ """Get a control by its ID from cache."""
+ if self.control_dict is None:
+ return None
+ return self.control_dict.get(control_id)
+
+ def set_ui_tree(self, xml: str) -> None:
+ """Cache the UI tree XML."""
+ import time
+
+ self.ui_tree_xml = xml
+ self.ui_tree_timestamp = time.time()
+
+ def get_ui_tree(self) -> Optional[str]:
+ """Get cached UI tree if not expired."""
+ import time
+
+ if self.ui_tree_xml is None or self.ui_tree_timestamp is None:
+ return None
+
+ if time.time() - self.ui_tree_timestamp > self.ui_tree_cache_duration:
+ return None # Cache expired
+
+ return self.ui_tree_xml
+
+ def set_device_info(self, info: Dict[str, Any]) -> None:
+ """Cache the device information."""
+ import time
+
+ self.device_info = info
+ self.device_info_timestamp = time.time()
+
+ def get_device_info(self) -> Optional[Dict[str, Any]]:
+ """Get cached device info if not expired."""
+ import time
+
+ if self.device_info is None or self.device_info_timestamp is None:
+ return None
+
+ if time.time() - self.device_info_timestamp > self.device_info_cache_duration:
+ return None # Cache expired
+
+ return self.device_info
+
+ def invalidate_controls(self) -> None:
+ """Invalidate the controls cache (e.g., after screen change)."""
+ self.current_controls = None
+ self.current_controls_timestamp = None
+ self.control_dict = None
+
+ def invalidate_ui_tree(self) -> None:
+ """Invalidate the UI tree cache."""
+ self.ui_tree_xml = None
+ self.ui_tree_timestamp = None
+
+ def invalidate_all(self) -> None:
+ """Invalidate all caches."""
+ self.installed_apps = None
+ self.installed_apps_timestamp = None
+ self.current_controls = None
+ self.current_controls_timestamp = None
+ self.ui_tree_xml = None
+ self.ui_tree_timestamp = None
+ self.device_info = None
+ self.device_info_timestamp = None
+ self.control_dict = None
+
+
+# Helper function for searching apps by name
+async def _search_app_by_name(
+ app_name: str, adb_path: str, include_system_apps: bool = True
+):
+ """Internal helper to search for app package by display name."""
+ try:
+ # Get package list
+ list_cmd = [adb_path, "shell", "pm", "list", "packages"]
+ if not include_system_apps:
+ list_cmd.append("-3")
+
+ proc = await asyncio.create_subprocess_exec(
+ *list_cmd,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, _ = await proc.communicate()
+
+ if proc.returncode != 0:
+ return None
+
+ # Parse packages
+ packages = []
+ for line in stdout.decode("utf-8").split("\n"):
+ if line.startswith("package:"):
+ pkg = line.replace("package:", "").strip()
+ packages.append(pkg)
+
+ # Search for matching packages (simple heuristic)
+ # First try: exact match in package name parts
+ for pkg in packages:
+ parts = pkg.split(".")
+ if any(app_name.lower() == part.lower() for part in parts):
+ return pkg
+
+ # Second try: partial match in package name
+ for pkg in packages:
+ if app_name.lower() in pkg.lower():
+ return pkg
+
+ return None
+
+ except Exception:
+ return None
+
+
+def create_mobile_data_collection_server(
+ host: str = "", port: int = 8020, adb_path: Optional[str] = None
+) -> None:
+ """
+ Create an MCP server for Mobile data collection operations.
+ Handles: screenshots, UI tree, device info, app list, controls list, cache status.
+ """
+
+ if adb_path is None:
+ adb_path = "adb"
+
+ # Initialize shared state manager
+ mobile_state = MobileServerState()
+
+ mcp = FastMCP(
+ "Mobile Data Collection MCP Server",
+ instructions="MCP server for retrieving Android device information via ADB (screenshots, UI tree, device info, etc.).",
+ stateless_http=False,
+ json_response=True,
+ host=host,
+ port=port,
+ )
+
+ # ========================================
+ # Data Collection Tool 1: Capture Screenshot
+ # ========================================
+ @mcp.tool()
+ async def capture_screenshot() -> Annotated[
+ str,
+ Field(
+ description="Base64 encoded image data URI of the screenshot (data:image/png;base64,...)"
+ ),
+ ]:
+ """
+ Capture screenshot from Android device.
+ Returns base64-encoded image data URI directly (matching ui_mcp_server format).
+ """
+ try:
+ # Create temp file for screenshot
+ with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
+ tmp_path = tmp.name
+
+ # Capture screenshot on device
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "screencap",
+ "-p",
+ "/sdcard/screen_temp.png",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ if proc.returncode != 0:
+ raise Exception("Failed to capture screenshot on device")
+
+ # Pull screenshot from device
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "pull",
+ "/sdcard/screen_temp.png",
+ tmp_path,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ if proc.returncode != 0:
+ raise Exception("Failed to pull screenshot from device")
+
+ # Clean up device temp file
+ await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "rm",
+ "/sdcard/screen_temp.png",
+ stdout=asyncio.subprocess.DEVNULL,
+ stderr=asyncio.subprocess.DEVNULL,
+ )
+
+ # Read and encode as base64
+ with open(tmp_path, "rb") as f:
+ img_data = base64.b64encode(f.read()).decode()
+
+ # Clean up temp file
+ os.unlink(tmp_path)
+
+ # Return base64 data URI directly (like ui_mcp_server)
+ return f"data:image/png;base64,{img_data}"
+
+ except Exception as e:
+ raise Exception(f"Error capturing screenshot: {str(e)}")
+
+ # ========================================
+ # Data Collection Tool 2: Get UI Tree
+ # ========================================
+ @mcp.tool()
+ async def get_ui_tree() -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'ui_tree' (str XML), 'format' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Get the UI hierarchy tree in XML format.
+ Useful for finding element positions and properties.
+ """
+ try:
+ # Generate UI dump on device
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "uiautomator",
+ "dump",
+ "/sdcard/window_dump.xml",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ if proc.returncode != 0:
+ return {"success": False, "error": "Failed to dump UI hierarchy"}
+
+ # Read XML content
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "cat",
+ "/sdcard/window_dump.xml",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ if proc.returncode == 0:
+ xml_content = stdout.decode("utf-8")
+ # Cache the UI tree
+ mobile_state.set_ui_tree(xml_content)
+
+ return {
+ "success": True,
+ "ui_tree": xml_content,
+ "format": "xml",
+ }
+ else:
+ return {"success": False, "error": stderr.decode("utf-8")}
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Data Collection Tool 3: Get Device Info
+ # ========================================
+ @mcp.tool()
+ async def get_device_info() -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with device information: model, android_version, sdk_version, screen_size, battery, etc."
+ ),
+ ]:
+ """
+ Get comprehensive Android device information.
+ Includes model, Android version, screen resolution, battery status.
+ Uses cache to improve performance.
+ """
+ try:
+ # Check cache first
+ cached_info = mobile_state.get_device_info()
+ if cached_info is not None:
+ return {"success": True, "device_info": cached_info, "from_cache": True}
+
+ info = {}
+
+ # Device model
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "getprop",
+ "ro.product.model",
+ stdout=asyncio.subprocess.PIPE,
+ )
+ stdout, _ = await proc.communicate()
+ info["model"] = stdout.decode("utf-8").strip()
+
+ # Android version
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "getprop",
+ "ro.build.version.release",
+ stdout=asyncio.subprocess.PIPE,
+ )
+ stdout, _ = await proc.communicate()
+ info["android_version"] = stdout.decode("utf-8").strip()
+
+ # SDK version
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "getprop",
+ "ro.build.version.sdk",
+ stdout=asyncio.subprocess.PIPE,
+ )
+ stdout, _ = await proc.communicate()
+ info["sdk_version"] = stdout.decode("utf-8").strip()
+
+ # Screen size
+ proc = await asyncio.create_subprocess_exec(
+ adb_path, "shell", "wm", "size", stdout=asyncio.subprocess.PIPE
+ )
+ stdout, _ = await proc.communicate()
+ info["screen_size"] = stdout.decode("utf-8").strip()
+
+ # Screen density
+ proc = await asyncio.create_subprocess_exec(
+ adb_path, "shell", "wm", "density", stdout=asyncio.subprocess.PIPE
+ )
+ stdout, _ = await proc.communicate()
+ info["screen_density"] = stdout.decode("utf-8").strip()
+
+ # Battery info
+ proc = await asyncio.create_subprocess_exec(
+ adb_path, "shell", "dumpsys", "battery", stdout=asyncio.subprocess.PIPE
+ )
+ stdout, _ = await proc.communicate()
+ battery_output = stdout.decode("utf-8")
+
+ # Parse battery level
+ for line in battery_output.split("\n"):
+ if "level:" in line:
+ info["battery_level"] = line.split(":")[1].strip()
+ elif "status:" in line:
+ info["battery_status"] = line.split(":")[1].strip()
+
+ # Cache the device info
+ mobile_state.set_device_info(info)
+
+ return {"success": True, "device_info": info, "from_cache": False}
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Data Collection Tool 4: Get Mobile App Target Info
+ # ========================================
+ @mcp.tool()
+ async def get_mobile_app_target_info(
+ filter: Annotated[
+ str,
+ Field(
+ description="Filter pattern for package names (optional, e.g., 'com.android')"
+ ),
+ ] = "",
+ include_system_apps: Annotated[
+ bool,
+ Field(
+ description="Whether to include system apps (default: False, only show user-installed apps)"
+ ),
+ ] = False,
+ force_refresh: Annotated[
+ bool,
+ Field(
+ description="Force refresh from device, ignoring cache (default: False)"
+ ),
+ ] = False,
+ ) -> Annotated[
+ List[TargetInfo],
+ Field(
+ description="List of TargetInfo objects representing installed applications"
+ ),
+ ]:
+ """
+ Get information about installed application packages as TargetInfo list.
+ Returns app package name, label (display name), and version if available.
+ Uses cache to improve performance (cache duration: 5 minutes).
+ """
+ try:
+ # Check cache first (only if no filter and not forcing refresh)
+ if not filter and not force_refresh:
+ cached_apps = mobile_state.get_installed_apps()
+ if cached_apps is not None:
+ # Filter by include_system_apps setting
+ if include_system_apps:
+ return cached_apps
+ else:
+ return cached_apps
+
+ # Get package list
+ list_cmd = ["packages", "-3"] if not include_system_apps else ["packages"]
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "pm",
+ "list",
+ *list_cmd,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ if proc.returncode != 0:
+ raise Exception(f"Failed to list packages: {stderr.decode('utf-8')}")
+
+ # Parse package list
+ packages = []
+ for line in stdout.decode("utf-8").split("\n"):
+ if line.startswith("package:"):
+ pkg = line.replace("package:", "").strip()
+ if not filter or filter in pkg:
+ packages.append(pkg)
+
+ # Get app labels (display names) for each package
+ target_info_list = []
+ for i, pkg in enumerate(packages):
+ # Create TargetInfo object
+ target_info = TargetInfo(
+ kind=TargetKind.THIRD_PARTY_AGENT,
+ id=str(i + 1),
+ name=pkg, # Default to package name
+ type=pkg, # Store package name in type field
+ )
+ target_info_list.append(target_info)
+
+ # Cache the result (only if no filter)
+ if not filter:
+ mobile_state.set_installed_apps(target_info_list)
+
+ return target_info_list
+
+ except Exception as e:
+ raise Exception(f"Failed to get mobile app target info: {str(e)}")
+
+ # ========================================
+ # Data Collection Tool 5: Get App Window Controls Target Info
+ # ========================================
+ @mcp.tool()
+ async def get_app_window_controls_target_info(
+ force_refresh: Annotated[
+ bool,
+ Field(
+ description="Force refresh from device, ignoring cache (default: False)"
+ ),
+ ] = False,
+ ) -> Annotated[
+ List[TargetInfo],
+ Field(
+ description="List of TargetInfo objects representing UI controls on the current screen"
+ ),
+ ]:
+ """
+ Get UI controls information as TargetInfo list.
+ Returns a list of TargetInfo objects for all meaningful controls on the screen.
+ Each control has an id that can be used with action tools.
+ Uses cache to improve performance (cache duration: 5 seconds).
+ """
+ try:
+ # Check cache first
+ if not force_refresh:
+ cached_controls = mobile_state.get_current_controls()
+ if cached_controls is not None:
+ return cached_controls
+
+ # Get UI tree XML
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "uiautomator",
+ "dump",
+ "/sdcard/window_dump.xml",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ # Read XML content
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "cat",
+ "/sdcard/window_dump.xml",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ if proc.returncode != 0:
+ return []
+
+ xml_content = stdout.decode("utf-8")
+
+ # Cache the UI tree XML
+ mobile_state.set_ui_tree(xml_content)
+
+ # Parse XML and extract controls
+ root = ET.fromstring(xml_content)
+ controls_target_info = []
+ control_id = 1
+
+ def parse_node(node):
+ nonlocal control_id
+
+ # Extract attributes
+ attribs = node.attrib
+
+ # Parse bounds [x,y][x2,y2]
+ bounds_str = attribs.get("bounds", "")
+ rect = None
+ if bounds_str:
+ try:
+ # Parse bounds like "[0,0][1080,100]"
+ import re
+
+ coords = re.findall(r"\[(\d+),(\d+)\]", bounds_str)
+ if len(coords) == 2:
+ x1, y1 = int(coords[0][0]), int(coords[0][1])
+ x2, y2 = int(coords[1][0]), int(coords[1][1])
+
+ # Validate coordinates: x2 must be >= x1 and y2 must be >= y1
+ # Some controls have invalid bounds, skip them
+ if x2 >= x1 and y2 >= y1 and x2 > 0 and y2 > 0:
+ # Use bbox format [left, top, right, bottom] to match ui_mcp_server.py
+ rect = [x1, y1, x2, y2]
+ except Exception:
+ pass
+
+ # Get control name (text or content-desc)
+ control_name = attribs.get("text") or attribs.get("content-desc") or ""
+
+ # Get control type (short class name)
+ control_type = attribs.get("class", "").split(".")[-1]
+
+ # Only add meaningful controls
+ is_meaningful = (
+ attribs.get("clickable") == "true"
+ or attribs.get("long-clickable") == "true"
+ or attribs.get("checkable") == "true"
+ or attribs.get("scrollable") == "true"
+ or control_name
+ or "Edit" in control_type
+ or "Button" in control_type
+ )
+
+ if is_meaningful and rect:
+ # Create TargetInfo object
+ target_info = TargetInfo(
+ kind=TargetKind.CONTROL,
+ id=str(control_id),
+ name=control_name or control_type,
+ type=control_type,
+ rect=rect,
+ )
+ controls_target_info.append(target_info)
+ control_id += 1
+
+ # Recursively parse children
+ for child in node:
+ parse_node(child)
+
+ # Start parsing from root
+ parse_node(root)
+
+ # Cache the controls
+ mobile_state.set_current_controls(controls_target_info)
+
+ return controls_target_info
+
+ except Exception as e:
+ import traceback
+
+ print(f"Error in get_app_window_controls_target_info: {str(e)}")
+ print(traceback.format_exc())
+ return []
+
+ mcp.run(transport="streamable-http")
+
+
+def create_mobile_action_server(
+ host: str = "", port: int = 8021, adb_path: Optional[str] = None
+) -> None:
+ """
+ Create an MCP server for Mobile action operations.
+ Handles: tap, swipe, type_text, launch_app, press_key, click_control, wait, invalidate_cache.
+ """
+
+ if adb_path is None:
+ adb_path = "adb"
+
+ # Get shared state manager (singleton)
+ mobile_state = MobileServerState()
+
+ mcp = FastMCP(
+ "Mobile Action MCP Server",
+ instructions="MCP server for controlling Android devices via ADB (tap, swipe, type, launch apps, etc.).",
+ stateless_http=False,
+ json_response=True,
+ host=host,
+ port=port,
+ )
+
+ # ========================================
+ # Action Tool 1: Tap/Click
+ # ========================================
+ @mcp.tool()
+ async def tap(
+ x: Annotated[int, Field(description="X coordinate to tap (pixels from left)")],
+ y: Annotated[int, Field(description="Y coordinate to tap (pixels from top)")],
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), 'output' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Tap/click at specified coordinates on the screen.
+ Coordinates are in pixels, origin (0,0) is top-left corner.
+ Automatically invalidates controls cache after interaction.
+ """
+ try:
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "tap",
+ str(x),
+ str(y),
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ # Invalidate controls cache after interaction
+ if proc.returncode == 0:
+ mobile_state.invalidate_controls()
+
+ return {
+ "success": proc.returncode == 0,
+ "action": f"tap({x}, {y})",
+ "output": stdout.decode("utf-8") if stdout else "",
+ "error": stderr.decode("utf-8") if stderr else "",
+ }
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 2: Swipe
+ # ========================================
+ @mcp.tool()
+ async def swipe(
+ start_x: Annotated[int, Field(description="Starting X coordinate")],
+ start_y: Annotated[int, Field(description="Starting Y coordinate")],
+ end_x: Annotated[int, Field(description="Ending X coordinate")],
+ end_y: Annotated[int, Field(description="Ending Y coordinate")],
+ duration: Annotated[
+ int, Field(description="Duration of swipe in milliseconds (default 300)")
+ ] = 300,
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Perform swipe gesture from start to end coordinates.
+ Useful for scrolling, dragging, and gesture navigation.
+ Automatically invalidates controls cache after interaction.
+ """
+ try:
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "swipe",
+ str(start_x),
+ str(start_y),
+ str(end_x),
+ str(end_y),
+ str(duration),
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ # Invalidate controls cache after swipe (UI likely changed)
+ if proc.returncode == 0:
+ mobile_state.invalidate_controls()
+
+ return {
+ "success": proc.returncode == 0,
+ "action": f"swipe({start_x},{start_y})->({end_x},{end_y}) in {duration}ms",
+ "output": stdout.decode("utf-8") if stdout else "",
+ "error": stderr.decode("utf-8") if stderr else "",
+ }
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 3: Type Text
+ # ========================================
+ @mcp.tool()
+ async def type_text(
+ text: Annotated[
+ str,
+ Field(
+ description="Text to input. Spaces and special characters are automatically escaped."
+ ),
+ ],
+ control_id: Annotated[
+ str,
+ Field(
+ description="REQUIRED: The precise annotated ID of the control to type into (from get_app_window_controls_target_info). The control will be clicked before typing to ensure focus."
+ ),
+ ],
+ control_name: Annotated[
+ str,
+ Field(
+ description="REQUIRED: The precise name of the control to type into, must match the selected control_id."
+ ),
+ ],
+ clear_current_text: Annotated[
+ bool,
+ Field(
+ description="Whether to clear existing text before typing. If True, selects all text (Ctrl+A) and deletes it first."
+ ),
+ ] = False,
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), 'message' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Type text into a specific input field control.
+ Always clicks the target control first to ensure it's focused before typing.
+
+ Usage:
+ type_text(text="hello world", control_id="5", control_name="Search")
+
+ Steps:
+ 1. Call get_app_window_controls_target_info to get the list of controls
+ 2. Identify the input field control (EditText, etc.)
+ 3. Call type_text with the control's id and name
+ 4. The tool will click the control, then type the text
+
+ Note: Spaces and special characters are automatically escaped for Android input.
+ """
+ try:
+ messages = []
+
+ # Verify control exists in cache
+ target_control = mobile_state.get_control_by_id(control_id)
+
+ if not target_control:
+ return {
+ "success": False,
+ "error": f"Control with ID '{control_id}' not found. Please call get_app_window_controls_target_info first.",
+ }
+
+ # Verify name matches (optional warning)
+ if target_control.name != control_name:
+ messages.append(
+ f"Warning: Control ID {control_id} has name '{target_control.name}', but provided name is '{control_name}'. Using ID {control_id}."
+ )
+
+ # Click the control to focus it
+ rect = target_control.rect
+ if not rect:
+ return {
+ "success": False,
+ "error": f"Control '{control_id}' has no rectangle information",
+ }
+
+ # rect format is [left, top, right, bottom] (bbox format)
+ center_x = (rect[0] + rect[2]) // 2 # (left + right) / 2
+ center_y = (rect[1] + rect[3]) // 2 # (top + bottom) / 2
+
+ # Execute tap to focus
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "tap",
+ str(center_x),
+ str(center_y),
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ if proc.returncode != 0:
+ return {
+ "success": False,
+ "error": f"Failed to click control at ({center_x}, {center_y})",
+ }
+
+ messages.append(
+ f"Clicked control '{target_control.name or target_control.type}' at ({center_x}, {center_y})"
+ )
+
+ # Small delay to let the input field focus
+ await asyncio.sleep(0.2)
+
+ # Clear existing text if requested
+ if clear_current_text:
+ # Delete characters
+ for _ in range(50): # Clear up to 50 characters
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "keyevent",
+ "KEYCODE_DEL",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ await proc.communicate()
+
+ messages.append("Cleared existing text")
+
+ # Escape text for shell (replace spaces with %s)
+ escaped_text = text.replace(" ", "%s").replace("&", "\\&")
+
+ # Type the text
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "text",
+ escaped_text,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ if proc.returncode != 0:
+ return {
+ "success": False,
+ "error": f"Failed to type text: {stderr.decode('utf-8')}",
+ }
+
+ messages.append(f"Typed text: '{text}'")
+
+ # Invalidate controls cache after typing (UI state may have changed)
+ mobile_state.invalidate_controls()
+
+ action_desc = f"type_text(text='{text}', control_id='{control_id}', control_name='{control_name}')"
+
+ return {
+ "success": True,
+ "action": action_desc,
+ "message": " | ".join(messages),
+ "control_info": {
+ "id": target_control.id,
+ "name": target_control.name,
+ "type": target_control.type,
+ },
+ }
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 4: Launch App
+ # ========================================
+ @mcp.tool()
+ async def launch_app(
+ package_name: Annotated[
+ str,
+ Field(
+ description="Package name of the app to launch (e.g., 'com.android.settings')"
+ ),
+ ],
+ id: Annotated[
+ Optional[str],
+ Field(
+ description="Optional: The precise annotated ID of the app from get_mobile_app_target_info."
+ ),
+ ] = None,
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'message' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Launch an application by package name or app ID.
+
+ Usage modes:
+ 1. Launch by package name: launch_app(package_name="com.android.settings")
+ 2. Launch from cached app list: launch_app(package_name="com.android.settings", id="5")
+
+ When using id, the function will verify the package name matches the cached app info.
+ """
+ try:
+ actual_package_name = package_name
+ warning = None
+ app_info = None
+
+ # If id is provided, get app from cache
+ if id:
+ # Try to get from cache
+ cached_apps = mobile_state.get_installed_apps()
+
+ if cached_apps is None:
+ return {
+ "success": False,
+ "error": f"App cache is empty. Please call get_mobile_app_target_info first.",
+ }
+
+ # Find the app by id
+ target_app = None
+ for app in cached_apps:
+ if app.id == id:
+ target_app = app
+ break
+
+ if not target_app:
+ return {
+ "success": False,
+ "error": f"App with ID '{id}' not found in cached app list.",
+ }
+
+ # The app's 'type' field contains the package name
+ actual_package_name = target_app.type
+ app_info = {
+ "id": target_app.id,
+ "name": target_app.name,
+ "package": target_app.type,
+ }
+
+ # Verify package_name matches (optional warning)
+ if package_name != actual_package_name:
+ warning = f"Warning: Provided package_name '{package_name}' differs from cached package '{actual_package_name}'. Using cached package from ID {id}."
+
+ # If no id and input doesn't look like a package name, search by app name
+ elif "." not in package_name:
+ found_package = await _search_app_by_name(
+ package_name, adb_path, include_system_apps=True
+ )
+
+ if not found_package:
+ return {
+ "success": False,
+ "error": f"No app found with name containing '{package_name}'. Try using full package name.",
+ }
+
+ actual_package_name = found_package
+ warning = (
+ f"Resolved '{package_name}' to package '{actual_package_name}'"
+ )
+
+ # Launch the app using package name
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "monkey",
+ "-p",
+ actual_package_name,
+ "-c",
+ "android.intent.category.LAUNCHER",
+ "1",
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ output = stdout.decode("utf-8")
+ success = "Events injected:" in output or proc.returncode == 0
+
+ result = {
+ "success": success,
+ "message": f"Launched {actual_package_name}",
+ "package_name": actual_package_name,
+ "output": output,
+ "error": stderr.decode("utf-8") if stderr else "",
+ }
+
+ if warning:
+ result["warning"] = warning
+
+ if id and app_info:
+ result["app_info"] = app_info
+
+ return result
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 5: Press Key
+ # ========================================
+ @mcp.tool()
+ async def press_key(
+ key_code: Annotated[
+ str,
+ Field(
+ description="Key code to press. Common codes: KEYCODE_HOME, KEYCODE_BACK, KEYCODE_ENTER, KEYCODE_MENU"
+ ),
+ ],
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Press a hardware or software key.
+ Useful for navigation (back, home) and system actions.
+ """
+ try:
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "keyevent",
+ key_code,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ return {
+ "success": proc.returncode == 0,
+ "action": f"press_key({key_code})",
+ "output": stdout.decode("utf-8") if stdout else "",
+ "error": stderr.decode("utf-8") if stderr else "",
+ }
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 6: Click Control by ID
+ # ========================================
+ @mcp.tool()
+ async def click_control(
+ control_id: Annotated[
+ str,
+ Field(
+ description="The precise annotated ID of the control to click (from get_app_window_controls_target_info)"
+ ),
+ ],
+ control_name: Annotated[
+ str,
+ Field(
+ description="The precise name of the control to click, must match the selected control_id"
+ ),
+ ],
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), 'message' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Click a UI control by its id and name.
+ First call get_app_window_controls_target_info to get the list of controls,
+ then use the id and name to click the desired control.
+ """
+ try:
+ # Try to get control from cache
+ target_control = mobile_state.get_control_by_id(control_id)
+
+ if not target_control:
+ return {
+ "success": False,
+ "error": f"Control with ID '{control_id}' not found. Please call get_app_window_controls_target_info first.",
+ }
+
+ # Verify name matches
+ name_verified = target_control.name == control_name
+ warning = None
+ if not name_verified:
+ warning = f"Warning: Control ID {control_id} has name '{target_control.name}', but provided name is '{control_name}'. Clicking control {control_id}."
+
+ # Get control center position
+ rect = target_control.rect
+ if not rect:
+ return {
+ "success": False,
+ "error": f"Control '{control_id}' has no rectangle information",
+ }
+
+ # rect format is [left, top, right, bottom] (bbox format)
+ center_x = (rect[0] + rect[2]) // 2 # (left + right) / 2
+ center_y = (rect[1] + rect[3]) // 2 # (top + bottom) / 2
+
+ # Execute tap
+ proc = await asyncio.create_subprocess_exec(
+ adb_path,
+ "shell",
+ "input",
+ "tap",
+ str(center_x),
+ str(center_y),
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ stdout, stderr = await proc.communicate()
+
+ control_name_actual = target_control.name or target_control.type
+
+ # Invalidate controls cache after interaction
+ mobile_state.invalidate_controls()
+
+ result = {
+ "success": proc.returncode == 0,
+ "action": f"click_control(id={control_id}, name={control_name})",
+ "message": f"Clicked control '{control_name_actual}' at ({center_x}, {center_y})",
+ "control_info": {
+ "id": target_control.id,
+ "name": target_control.name,
+ "type": target_control.type,
+ "rect": target_control.rect,
+ },
+ }
+
+ if warning:
+ result["warning"] = warning
+
+ return result
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 7: Wait
+ # ========================================
+ @mcp.tool()
+ async def wait(
+ seconds: Annotated[
+ float,
+ Field(
+ description="Number of seconds to wait (can be decimal, e.g., 0.5 for 500ms)"
+ ),
+ ] = 1.0,
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'action' (str), 'message' (str)"
+ ),
+ ]:
+ """
+ Wait for a specified number of seconds.
+ Useful for waiting for UI transitions, animations, or app loading.
+ Examples:
+ - wait(seconds=1.0) - Wait 1 second
+ - wait(seconds=0.5) - Wait 500 milliseconds
+ - wait(seconds=2.5) - Wait 2.5 seconds
+ """
+ try:
+ if seconds < 0:
+ return {
+ "success": False,
+ "error": "Wait time must be non-negative",
+ }
+
+ if seconds > 60:
+ return {
+ "success": False,
+ "error": "Wait time cannot exceed 60 seconds",
+ }
+
+ await asyncio.sleep(seconds)
+
+ return {
+ "success": True,
+ "action": f"wait({seconds}s)",
+ "message": f"Waited for {seconds} seconds",
+ }
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ # ========================================
+ # Action Tool 8: Invalidate Cache
+ # ========================================
+ @mcp.tool()
+ async def invalidate_cache(
+ cache_type: Annotated[
+ str,
+ Field(
+ description="Type of cache to invalidate: 'controls', 'apps', 'ui_tree', 'device_info', or 'all'"
+ ),
+ ] = "all",
+ ) -> Annotated[
+ Dict[str, Any],
+ Field(
+ description="Dictionary with keys: 'success' (bool), 'message' (str), or 'error' (str)"
+ ),
+ ]:
+ """
+ Manually invalidate cached data to force refresh on next query.
+ Useful when you know the state has changed significantly.
+ """
+ try:
+ if cache_type == "controls":
+ mobile_state.invalidate_controls()
+ message = "Controls cache invalidated"
+ elif cache_type == "apps":
+ mobile_state.installed_apps = None
+ mobile_state.installed_apps_timestamp = None
+ message = "Apps cache invalidated"
+ elif cache_type == "ui_tree":
+ mobile_state.invalidate_ui_tree()
+ message = "UI tree cache invalidated"
+ elif cache_type == "device_info":
+ mobile_state.device_info = None
+ mobile_state.device_info_timestamp = None
+ message = "Device info cache invalidated"
+ elif cache_type == "all":
+ mobile_state.invalidate_all()
+ message = "All caches invalidated"
+ else:
+ return {
+ "success": False,
+ "error": f"Invalid cache_type: {cache_type}. Must be 'controls', 'apps', 'ui_tree', 'device_info', or 'all'",
+ }
+
+ return {"success": True, "message": message}
+
+ except Exception as e:
+ return {"success": False, "error": str(e)}
+
+ mcp.run(transport="streamable-http")
+
+
+def _detect_adb_path() -> str:
+ """Auto-detect ADB path or return 'adb' to use from PATH."""
+ # Try common ADB locations
+ common_paths = [
+ r"C:\Users\{}\AppData\Local\Android\Sdk\platform-tools\adb.exe".format(
+ os.environ.get("USERNAME", "")
+ ),
+ "/usr/bin/adb",
+ "/usr/local/bin/adb",
+ ]
+ for path in common_paths:
+ if os.path.exists(path):
+ return path
+
+ # Try to find in PATH
+ try:
+ result = subprocess.run(
+ ["where" if os.name == "nt" else "which", "adb"],
+ capture_output=True,
+ text=True,
+ timeout=5,
+ )
+ if result.returncode == 0:
+ return result.stdout.strip().split("\n")[0]
+ except:
+ pass
+
+ return "adb" # Fallback to PATH
+
+
+def _run_both_servers_sync(host: str, data_port: int, action_port: int, adb_path: str):
+ """
+ Run both data collection and action servers in the same process using threading.
+ This allows them to share the same MobileServerState singleton.
+
+ Note: MobileServerState uses singleton pattern, which ensures the same instance
+ is shared across threads in the same process. This is critical for `click_control`
+ to access controls cached by `get_app_window_controls_target_info`.
+ """
+ import threading
+ import time
+
+ print(f"\n✅ Starting both servers in same process (shared MobileServerState)")
+ print(f" - Data Collection Server: {host}:{data_port}")
+ print(f" - Action Server: {host}:{action_port}")
+ print("\n" + "=" * 70)
+ print("Both servers share MobileServerState cache. Press Ctrl+C to stop.")
+ print("=" * 70 + "\n")
+
+ # Create threads for both servers
+ data_thread = threading.Thread(
+ target=create_mobile_data_collection_server,
+ kwargs={"host": host, "port": data_port, "adb_path": adb_path},
+ name="DataCollectionServer",
+ daemon=False,
+ )
+
+ action_thread = threading.Thread(
+ target=create_mobile_action_server,
+ kwargs={"host": host, "port": action_port, "adb_path": adb_path},
+ name="ActionServer",
+ daemon=False,
+ )
+
+ # Start both server threads
+ data_thread.start()
+ print(f"✅ Data Collection Server thread started")
+
+ time.sleep(0.5) # Small delay between starts
+
+ action_thread.start()
+ print(f"✅ Action Server thread started")
+
+ print("\n" + "=" * 70)
+ print("Both servers are running. Press Ctrl+C to stop.")
+ print("=" * 70 + "\n")
+
+ try:
+ # Wait for both threads
+ data_thread.join()
+ action_thread.join()
+ except KeyboardInterrupt:
+ print("\n\nShutting down servers...")
+ # FastMCP servers should handle shutdown gracefully
+ data_thread.join(timeout=5)
+ action_thread.join(timeout=5)
+ print("✅ Servers stopped")
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Mobile MCP Servers for Android")
+ parser.add_argument(
+ "--data-port", type=int, default=8020, help="Port for Data Collection Server"
+ )
+ parser.add_argument(
+ "--action-port", type=int, default=8021, help="Port for Action Server"
+ )
+ parser.add_argument("--host", default="localhost", help="Host to bind servers to")
+ parser.add_argument(
+ "--adb-path",
+ default=None,
+ help="Path to ADB executable (auto-detected if not specified)",
+ )
+ parser.add_argument(
+ "--server",
+ choices=["data", "action", "both"],
+ default="both",
+ help="Which server(s) to start: 'data', 'action', or 'both'",
+ )
+ args = parser.parse_args()
+
+ # Auto-detect ADB if not provided
+ adb = args.adb_path or _detect_adb_path()
+
+ print("=" * 70)
+ print("UFO Mobile MCP Servers (Android)")
+ print("Android device control via ADB and Model Context Protocol")
+ print("=" * 70)
+ print(f"\nUsing ADB: {adb}")
+ print("\nChecking ADB connection...")
+
+ # Test ADB connection
+ try:
+ result = subprocess.run(
+ [adb, "devices"], capture_output=True, text=True, timeout=5
+ )
+ print(result.stdout)
+
+ if "device" in result.stdout and "List of devices" in result.stdout:
+ devices = [line for line in result.stdout.split("\n") if "\tdevice" in line]
+ if devices:
+ print(f"✅ Found {len(devices)} connected device(s)")
+ else:
+ print(
+ "⚠️ No devices connected. Please connect an Android device or emulator."
+ )
+ else:
+ print("⚠️ ADB not working properly. Please check ADB installation.")
+ except Exception as e:
+ print(f"❌ Error checking ADB: {e}")
+ print(" Servers will start but may not function properly.")
+
+ print("=" * 70)
+
+ if args.server == "both":
+ # Run both servers in the same process/event loop to share MobileServerState
+ import uvicorn
+
+ print(f"\n🚀 Starting both servers on {args.host} (shared state)")
+ print(f" - Data Collection Server: port {args.data_port}")
+ print(f" - Action Server: port {args.action_port}")
+ print("\nNote: Both servers share the same MobileServerState for caching")
+
+ # Run both servers concurrently in the same process with shared state
+ _run_both_servers_sync(args.host, args.data_port, args.action_port, adb)
+
+ elif args.server == "data":
+ print(f"\n🚀 Starting Data Collection Server on {args.host}:{args.data_port}")
+ create_mobile_data_collection_server(
+ host=args.host, port=args.data_port, adb_path=adb
+ )
+
+ elif args.server == "action":
+ print(f"🚀 Starting Action Server on {args.host}:{args.action_port}")
+ create_mobile_action_server(host=args.host, port=args.action_port, adb_path=adb)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/ufo/client/ufo_client.py b/ufo/client/ufo_client.py
index dd1f08dc5..21da9f83c 100644
--- a/ufo/client/ufo_client.py
+++ b/ufo/client/ufo_client.py
@@ -29,7 +29,7 @@ def __init__(
:param mcp_server_manager: Instance of MCPServerManager to manage MCP servers
:param computer_manager: Instance of ComputerManager to manage Computer instances
:param client_id: Optional client ID for the UFO client
- :param platform: Platform type ('windows' or 'linux'). Auto-detected if not specified.
+ :param platform: Platform type ('windows', 'linux', or 'mobile'). Auto-detected if not specified.
"""
self.mcp_server_manager = mcp_server_manager
self.computer_manager = computer_manager
diff --git a/ufo/client/websocket.py b/ufo/client/websocket.py
index ae0ed1900..19a7bc9d0 100644
--- a/ufo/client/websocket.py
+++ b/ufo/client/websocket.py
@@ -76,9 +76,13 @@ async def connect_and_listen(self):
)
break
- self.logger.info(
- f"[WS] Connecting to {self.ws_url} (attempt {self.retry_count + 1}/{self.max_retries})"
- )
+ # Only log on first attempt or after failures
+ if self.retry_count == 0:
+ self.logger.info(f"[WS] Connecting to {self.ws_url}...")
+ else:
+ self.logger.info(
+ f"[WS] Reconnecting... (attempt {self.retry_count + 1}/{self.max_retries})"
+ )
# Reset connection state before attempting to connect
self.connected_event.clear()
@@ -123,15 +127,25 @@ async def connect_and_listen(self):
await self._maybe_retry()
# Loop continues automatically
- except Exception as e:
- self.logger.error(
- f"[WS] Unexpected error: {e}. Will retry.", exc_info=True
+ except ConnectionRefusedError as e:
+ # Common error - don't show full traceback
+ self.logger.warning(
+ f"[WS] Connection refused: Server not available at {self.ws_url}"
)
self.connected_event.clear()
self.retry_count += 1
await self._maybe_retry()
# Loop continues automatically
+ except Exception as e:
+ # Show error type and message without full traceback for connection errors
+ error_type = type(e).__name__
+ self.logger.warning(f"[WS] Connection error ({error_type}): {e}")
+ self.connected_event.clear()
+ self.retry_count += 1
+ await self._maybe_retry()
+ # Loop continues automatically
+
async def register_client(self):
"""
Send client_id and device system information to server upon connection.
@@ -396,13 +410,12 @@ async def _maybe_retry(self):
if self.retry_count < self.max_retries:
wait_time = 2**self.retry_count
self.logger.info(
- f"[WS] 🔄 Reconnecting in {wait_time}s... "
- f"(attempt {self.retry_count}/{self.max_retries})"
+ f"[WS] Retrying in {wait_time}s... ({self.retry_count}/{self.max_retries})"
)
await asyncio.sleep(wait_time)
else:
- self.logger.warning(
- f"[WS] ⚠️ Retry limit reached ({self.retry_count}/{self.max_retries})"
+ self.logger.error(
+ f"[WS] ❌ Max retries reached ({self.max_retries}). Please check if server is running at {self.ws_url}"
)
def is_connected(self) -> bool:
diff --git a/ufo/module/session_pool.py b/ufo/module/session_pool.py
index e9200b970..21f7b4d5c 100644
--- a/ufo/module/session_pool.py
+++ b/ufo/module/session_pool.py
@@ -21,6 +21,7 @@
if TYPE_CHECKING:
from aip.protocol.task_execution import TaskExecutionProtocol
from ufo.module.sessions.linux_session import LinuxSession, LinuxServiceSession
+from ufo.module.sessions.mobile_session import MobileSession, MobileServiceSession
ufo_config = get_ufo_config()
@@ -91,7 +92,7 @@ def create_session(
:param mode: The mode of the task.
:param plan: The plan file or folder path (for follower/batch modes).
:param request: The user request.
- :param platform_override: Override platform detection ('windows' or 'linux').
+ :param platform_override: Override platform detection ('windows', 'linux', or 'mobile').
:param kwargs: Additional platform-specific parameters:
- application_name: Target application (for Linux sessions)
- websocket: WebSocket connection (for service sessions)
@@ -103,6 +104,8 @@ def create_session(
return self._create_windows_session(task, mode, plan, request, **kwargs)
elif current_platform == "linux":
return self._create_linux_session(task, mode, plan, request, **kwargs)
+ elif current_platform == "mobile":
+ return self._create_mobile_session(task, mode, plan, request, **kwargs)
else:
raise NotImplementedError(
f"Platform {current_platform} is not supported yet."
@@ -220,6 +223,50 @@ def _create_linux_session(
f"Supported modes: normal, normal_operator, service"
)
+ def _create_mobile_session(
+ self, task: str, mode: str, plan: str, request: str = "", **kwargs
+ ) -> List[BaseSession]:
+ """
+ Create Mobile Android-specific sessions.
+ :param task: The name of current task.
+ :param mode: The mode of the task.
+ :param plan: The plan file or folder path (not used for normal/service modes).
+ :param request: The user request.
+ :param kwargs: Additional parameters:
+ - task_protocol: AIP TaskExecutionProtocol instance (for service mode)
+ :return: The created Mobile session list.
+ """
+ if mode in ["normal", "normal_operator"]:
+ self.logger.info(f"Creating a normal Mobile session for mode: {mode}")
+ return [
+ MobileSession(
+ task=task,
+ should_evaluate=ufo_config.system.eva_session,
+ id=0,
+ request=request,
+ mode=mode,
+ )
+ ]
+ elif mode == "service":
+ self.logger.info(f"Creating a Mobile service session for mode: {mode}")
+ return [
+ MobileServiceSession(
+ task=task,
+ should_evaluate=ufo_config.system.eva_session,
+ id=0,
+ request=request,
+ task_protocol=kwargs.get("task_protocol"),
+ )
+ ]
+ # TODO: Add Mobile follower and batch modes if needed
+ # elif mode == "follower":
+ # return self._create_mobile_follower_session(...)
+ else:
+ raise ValueError(
+ f"The {mode} mode is not supported on Mobile yet. "
+ f"Supported modes: normal, normal_operator, service"
+ )
+
def create_service_session(
self,
task: str,
@@ -236,7 +283,7 @@ def create_service_session(
:param id: Session ID.
:param request: User request.
:param task_protocol: AIP TaskExecutionProtocol instance.
- :param platform_override: Override platform detection ('windows' or 'linux').
+ :param platform_override: Override platform detection ('windows', 'linux', or 'mobile').
:return: Platform-specific service session.
"""
current_platform = platform_override or platform.system().lower()
@@ -259,6 +306,15 @@ def create_service_session(
request=request,
task_protocol=task_protocol,
)
+ elif current_platform == "mobile":
+ self.logger.info("Creating Mobile service session")
+ return MobileServiceSession(
+ task=task,
+ should_evaluate=should_evaluate,
+ id=id,
+ request=request,
+ task_protocol=task_protocol,
+ )
else:
raise NotImplementedError(
f"Service session not supported on {current_platform}"
diff --git a/ufo/module/sessions/mobile_session.py b/ufo/module/sessions/mobile_session.py
new file mode 100644
index 000000000..e64918510
--- /dev/null
+++ b/ufo/module/sessions/mobile_session.py
@@ -0,0 +1,156 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""
+Mobile Android-specific session implementations.
+This module provides session types for Android mobile platform that don't require a HostAgent.
+"""
+
+import logging
+from typing import Optional, TYPE_CHECKING
+
+from ufo.client.mcp.mcp_server_manager import MCPServerManager
+from config.config_loader import get_ufo_config
+from ufo.module import interactor
+from ufo.module.basic import BaseRound
+from ufo.module.context import ContextNames
+from ufo.module.dispatcher import LocalCommandDispatcher, WebSocketCommandDispatcher
+from ufo.module.sessions.platform_session import MobileBaseSession
+
+if TYPE_CHECKING:
+ from aip.protocol.task_execution import TaskExecutionProtocol
+
+ufo_config = get_ufo_config()
+
+
+class MobileSession(MobileBaseSession):
+ """
+ A session for UFO on Android mobile platform.
+ Unlike Windows sessions, Mobile sessions don't use a HostAgent.
+ They work directly with MobileAgent for device control.
+ """
+
+ def __init__(
+ self,
+ task: str,
+ should_evaluate: bool,
+ id: int,
+ request: str = "",
+ mode: str = "normal",
+ ) -> None:
+ """
+ Initialize a Mobile session.
+ :param task: The name of current task.
+ :param should_evaluate: Whether to evaluate the session.
+ :param id: The id of the session.
+ :param request: The user request of the session.
+ :param mode: The mode of the task.
+ """
+ self._mode = mode
+ self._init_request = request
+ super().__init__(task, should_evaluate, id)
+ self.logger = logging.getLogger(__name__)
+
+ def _init_context(self) -> None:
+ """
+ Initialize the context for Mobile session.
+ """
+ super()._init_context()
+
+ self.context.set(ContextNames.MODE, self._mode)
+
+ # Initialize Mobile-specific command dispatcher
+ mcp_server_manager = MCPServerManager()
+ command_dispatcher = LocalCommandDispatcher(self, mcp_server_manager)
+ self.context.attach_command_dispatcher(command_dispatcher)
+
+ def create_new_round(self) -> Optional[BaseRound]:
+ """
+ Create a new round for Mobile session.
+ Since there's no host agent, directly create app-level rounds.
+ """
+ request = self.next_request()
+
+ if self.is_finished():
+ return None
+
+ round = BaseRound(
+ request=request,
+ agent=self._agent,
+ context=self.context,
+ should_evaluate=ufo_config.system.eva_round,
+ id=self.total_rounds,
+ )
+
+ self.add_round(round.id, round)
+ return round
+
+ def next_request(self) -> str:
+ """
+ Get the request for the mobile agent.
+ :return: The request for the mobile agent.
+ """
+ if self.total_rounds == 0:
+ if self._init_request:
+ return self._init_request
+ else:
+ return interactor.first_request()
+ else:
+ request, iscomplete = interactor.new_request()
+ if iscomplete:
+ self._finish = True
+ return request
+
+ def request_to_evaluate(self) -> str:
+ """
+ Get the request to evaluate.
+ :return: The request(s) to evaluate.
+ """
+ # For Mobile session, collect requests from all rounds
+ if self.current_round and hasattr(self.current_round.agent, "blackboard"):
+ request_memory = self.current_round.agent.blackboard.requests
+ return request_memory.to_json()
+ return self._init_request
+
+
+class MobileServiceSession(MobileSession):
+ """
+ A session for UFO service on Android mobile platform.
+ Similar to Windows ServiceSession but without HostAgent - works directly with MobileAgent.
+ Communicates via AIP protocols for remote control and monitoring.
+ This enables server-client architecture for mobile device control.
+ """
+
+ def __init__(
+ self,
+ task: str,
+ should_evaluate: bool,
+ id: str = None,
+ request: str = "",
+ task_protocol: Optional["TaskExecutionProtocol"] = None,
+ ):
+ """
+ Initialize the Mobile service session.
+ :param task: The task name for the session.
+ :param should_evaluate: Whether to evaluate the session.
+ :param id: The ID of the session.
+ :param request: The user request for the session.
+ :param task_protocol: AIP TaskExecutionProtocol instance for remote communication.
+ """
+ self.task_protocol = task_protocol
+ super().__init__(
+ task=task, should_evaluate=should_evaluate, id=id, request=request
+ )
+
+ def _init_context(self) -> None:
+ """
+ Initialize the context for Mobile service session.
+ Uses WebSocket-based dispatcher for remote communication.
+ """
+ super()._init_context()
+
+ # Use WebSocket dispatcher for service mode (server-client communication)
+ command_dispatcher = WebSocketCommandDispatcher(
+ self, protocol=self.task_protocol
+ )
+ self.context.attach_command_dispatcher(command_dispatcher)
diff --git a/ufo/module/sessions/platform_session.py b/ufo/module/sessions/platform_session.py
index 6bbe08940..e57d9650c 100644
--- a/ufo/module/sessions/platform_session.py
+++ b/ufo/module/sessions/platform_session.py
@@ -9,7 +9,7 @@
from typing import Optional
-from ufo.agents.agent.customized_agent import LinuxAgent
+from ufo.agents.agent.customized_agent import LinuxAgent, MobileAgent
from ufo.agents.agent.host_agent import AgentFactory, HostAgent
from config.config_loader import get_ufo_config
from ufo.module.basic import BaseRound, BaseSession
@@ -93,3 +93,54 @@ def reset(self) -> None:
This includes resetting any Linux-specific agents and session state.
"""
self._agent.set_state(self._agent.default_state)
+
+
+class MobileBaseSession(BaseSession):
+ """
+ Base class for all Android mobile-based sessions.
+ Mobile sessions don't use a HostAgent, working directly with MobileAgent.
+ This provides a simpler, single-tier architecture for mobile device control.
+ """
+
+ def _init_agents(self) -> None:
+ """
+ Initialize Mobile-specific agents.
+ Mobile sessions don't require a HostAgent - they work directly with MobileAgent.
+ This method intentionally leaves _host_agent as None.
+ """
+ # No host agent for Mobile
+ self._host_agent = None
+ # Mobile-specific agent initialization
+ self._agent: MobileAgent = AgentFactory.create_agent(
+ "MobileAgent",
+ "MobileAgent",
+ ufo_config.system.third_party_agent_config["MobileAgent"][
+ "APPAGENT_PROMPT"
+ ],
+ ufo_config.system.third_party_agent_config["MobileAgent"][
+ "APPAGENT_EXAMPLE_PROMPT"
+ ],
+ )
+
+ def evaluation(self) -> None:
+ """
+ Evaluation logic for Mobile sessions.
+ """
+ # Implement evaluation logic specific to Mobile sessions
+ self.logger.warning("Evaluation not yet implemented for Mobile sessions.")
+ pass
+
+ def save_log_to_markdown(self) -> None:
+ """
+ Save the log of the session to markdown file.
+ """
+ # Implement markdown logging specific to Mobile sessions
+ self.logger.warning("Markdown logging not yet implemented for Mobile sessions.")
+ pass
+
+ def reset(self) -> None:
+ """
+ Reset the session state for a new session.
+ This includes resetting any Mobile-specific agents and session state.
+ """
+ self._agent.set_state(self._agent.default_state)
diff --git a/ufo/prompter/customized/mobile_agent_prompter.py b/ufo/prompter/customized/mobile_agent_prompter.py
new file mode 100644
index 000000000..4a7da83fa
--- /dev/null
+++ b/ufo/prompter/customized/mobile_agent_prompter.py
@@ -0,0 +1,178 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+import json
+from typing import Any, Dict, List
+
+from config.config_loader import get_ufo_config
+from ufo.prompter.agent_prompter import AppAgentPrompter
+
+
+class MobileAgentPrompter(AppAgentPrompter):
+ """
+ The MobileAgentPrompter class is the prompter for the Mobile Android agent.
+ """
+
+ def __init__(
+ self,
+ prompt_template: str,
+ example_prompt_template: str,
+ ):
+ """
+ Initialize the MobileAgentPrompter.
+ :param prompt_template: The path of the prompt template.
+ :param example_prompt_template: The path of the example prompt template.
+ """
+ super().__init__(None, prompt_template, example_prompt_template)
+ self.api_prompt_template = None
+
+ def system_prompt_construction(self, additional_examples: List[str] = []) -> str:
+ """
+ Construct the system prompt for mobile agent.
+ :param additional_examples: The additional examples added to the prompt.
+ return: The system prompt for mobile agent.
+ """
+
+ apis = self.api_prompt_helper(verbose=1)
+ examples = self.examples_prompt_helper(additional_examples=additional_examples)
+
+ return self.prompt_template["system"].format(apis=apis, examples=examples)
+
+ def user_prompt_construction(
+ self,
+ prev_plan: List[str],
+ user_request: str,
+ installed_apps: List[Dict[str, Any]],
+ current_controls: List[Dict[str, Any]],
+ retrieved_docs: str = "",
+ last_success_actions: List[Dict[str, Any]] = [],
+ ) -> str:
+ """
+ Construct the user prompt for action selection.
+ :param prev_plan: The previous plan.
+ :param user_request: The user request.
+ :param installed_apps: The list of installed apps on the device.
+ :param current_controls: The list of current screen controls.
+ :param retrieved_docs: The retrieved documents.
+ :param last_success_actions: The list of successful actions in the last step.
+ return: The prompt for action selection.
+ """
+ prompt = self.prompt_template["user"].format(
+ prev_plan=json.dumps(prev_plan),
+ user_request=user_request,
+ installed_apps=json.dumps(installed_apps),
+ current_controls=json.dumps(current_controls),
+ retrieved_docs=retrieved_docs,
+ last_success_actions=json.dumps(last_success_actions),
+ )
+
+ return prompt
+
+ def user_content_construction(
+ self,
+ prev_plan: List[str],
+ user_request: str,
+ installed_apps: List[Dict[str, Any]],
+ current_controls: List[Dict[str, Any]],
+ screenshot_url: str = None,
+ annotated_screenshot_url: str = None,
+ retrieved_docs: str = "",
+ last_success_actions: List[Dict[str, Any]] = [],
+ ) -> List[Dict[str, str]]:
+ """
+ Construct the prompt content for LLMs with screenshots and control information.
+ :param prev_plan: The previous plan.
+ :param user_request: The user request.
+ :param installed_apps: The list of installed apps on the device.
+ :param current_controls: The list of current screen controls.
+ :param screenshot_url: The clean screenshot URL (base64).
+ :param annotated_screenshot_url: The annotated screenshot URL (base64).
+ :param retrieved_docs: The retrieved documents.
+ :param last_success_actions: The list of successful actions in the last step.
+ return: The prompt content for LLMs.
+ """
+
+ user_content = []
+
+ # Add screenshots if available
+ if screenshot_url:
+ user_content.append(
+ {
+ "type": "image_url",
+ "image_url": {"url": screenshot_url},
+ }
+ )
+
+ if annotated_screenshot_url:
+ user_content.append(
+ {
+ "type": "image_url",
+ "image_url": {"url": annotated_screenshot_url},
+ }
+ )
+
+ # Add text prompt
+ user_content.append(
+ {
+ "type": "text",
+ "text": self.user_prompt_construction(
+ prev_plan=prev_plan,
+ user_request=user_request,
+ installed_apps=installed_apps,
+ current_controls=current_controls,
+ retrieved_docs=retrieved_docs,
+ last_success_actions=last_success_actions,
+ ),
+ }
+ )
+
+ return user_content
+
+ def examples_prompt_helper(
+ self,
+ header: str = "## Response Examples",
+ separator: str = "Example",
+ additional_examples: List[Dict[str, Any]] = [],
+ ) -> str:
+ """
+ Construct the prompt for examples.
+ :param header: The header of the prompt.
+ :param separator: The separator of the prompt.
+ :param additional_examples: The additional examples added to the prompt.
+ return: The prompt for examples.
+ """
+
+ template = """
+ [User Request]:
+ {request}
+ [Response]:
+ {response}"""
+
+ example_dict = [
+ self.example_prompt_template[key]
+ for key in self.example_prompt_template.keys()
+ if key.startswith("example")
+ ] + additional_examples
+
+ example_list = []
+
+ for example in example_dict:
+ example_str = template.format(
+ request=example.get("Request"),
+ response=json.dumps(example.get("Response")),
+ )
+ example_list.append(example_str)
+
+ return self.retrieved_documents_prompt_helper(header, separator, example_list)
+
+ def api_prompt_helper(self, verbose: int = 1) -> str:
+ """
+ Construct the prompt for APIs.
+ :param verbose: The verbosity level.
+ return: The prompt for APIs.
+ """
+ if self.api_prompt_template is None:
+ raise ValueError(
+ "API prompt template is not set. Call create_api_prompt_template first."
+ )
+ return self.api_prompt_template
diff --git a/ufo/prompts/third_party/mobile_agent.yaml b/ufo/prompts/third_party/mobile_agent.yaml
new file mode 100644
index 000000000..518c11d8e
--- /dev/null
+++ b/ufo/prompts/third_party/mobile_agent.yaml
@@ -0,0 +1,80 @@
+version: 1.0
+
+system: |-
+ You are **MobileAgent**, the UFO framework's intelligent agent for executing and reasoning about Android mobile device operations.
+ Your goal is to **complete the entire User Request** by interacting with the Android device using available touch, swipe, and app control APIs.
+
+ ## Capabilities
+ - Capture and analyze Android device screenshots to understand the current screen state.
+ - Interact with UI controls (tap, swipe, type text) to navigate apps and complete tasks.
+ - Launch applications and navigate between apps.
+ - Retrieve device information including installed apps and current screen controls.
+ - Execute actions based on annotated control IDs from the UI analysis.
+
+ ## Current Device Context
+ You have access to:
+ - **Screenshot**: A visual representation of the current screen (when provided).
+ - **Installed Apps**: A list of installed applications on the device (provided in user prompt).
+ - **Current Screen Controls**: A list of UI controls on the current screen with their IDs (provided in user prompt).
+
+ ## Task Status
+ After each step, decide the overall status of the **User Request**:
+ - `CONTINUE` — the request is partially complete; further actions are required.
+ - `FINISH` — the request has been successfully fulfilled; no further actions are needed.
+ - `FAIL` — the request cannot be completed due to invalid state, app crashes, or repeated ineffective attempts.
+
+ ## Response Format
+ Always respond **only** with valid JSON that strictly follows the structure below.
+ Your output must be directly parseable by `json.loads()` — no markdown, comments, or extra text.
+
+ Required JSON keys:
+
+ {{
+ "observation": str, "",
+ "thought": str, "",
+ "action": {{
+ "function": str, "",
+ "arguments": Dict[str, Any], The dictionary of arguments {{ "": "" }}, for the function. Use an empty dictionary if no arguments are needed or if no execution is needed.
+ "status": str, ""
+ }},
+ "plan": List[str], "",
+ "result": str, ""
+ }}
+
+ ## Operational Rules
+ - Always analyze the **screenshot** and **current screen controls** before deciding your action.
+ - Use control IDs from the current_controls list when interacting with specific UI elements.
+ - When tapping controls, prefer using `click_control` with control_id and control_name over raw `tap` with coordinates.
+ - When typing text, use `type_text` with control_id if targeting a specific input field.
+ - Use `launch_app` with the correct package name or app ID from the installed_apps list.
+ - Use `swipe` for scrolling or navigation gestures.
+ - Use `press_key` for hardware/system keys (BACK, HOME, ENTER, etc.).
+ - Use `wait` to pause execution when waiting for UI transitions, animations, app loading, or network responses. Common wait times: 0.5-1.0 seconds for quick transitions, 1-3 seconds for app launches or heavy UI changes.
+ - Do **not** ask for user confirmation or additional input.
+ - Review previous actions to avoid repeating ineffective or failed commands.
+ - If the screen state doesn't change after multiple similar actions, consider trying a different approach or declare FAIL.
+ - When the User Request is completed, set `"status": "FINISH"` and provide a comprehensive summary in the `"result"` field.
+
+ ## Actions
+ - You are able to use the following APIs to interact with the Android device.
+ {apis}
+
+ ## Examples
+ - Below are some examples for your reference. Only use them as guidance and do not copy them directly.
+ {examples}
+
+ ## Final Reminder
+ Please observe the **screenshot**, **installed apps**, **current screen controls**, and previous steps carefully to decide your next action.
+ The control IDs in current_controls correspond to the annotated elements you may see in the screenshot.
+ Think step-by-step, act carefully, and output only the required JSON structure.
+ Any invalid JSON or extra text will crash the system.
+
+
+user: |-
+ {user_request}
+ {last_success_actions}
+ {prev_plan}
+ {installed_apps}
+ {current_controls}
+ {retrieved_docs}
+
diff --git a/ufo/prompts/third_party/mobile_agent_example.yaml b/ufo/prompts/third_party/mobile_agent_example.yaml
new file mode 100644
index 000000000..5a2c417a0
--- /dev/null
+++ b/ufo/prompts/third_party/mobile_agent_example.yaml
@@ -0,0 +1,201 @@
+version: 1.0
+
+example1:
+ Request: |-
+ Open the Chrome browser app on the Android device.
+ Response:
+ observation: |-
+ The device screen shows the home screen with various app icons. The installed apps list includes Chrome with package name 'com.android.chrome' and ID '5'.
+ thought: |-
+ To open Chrome browser, I should use the launch_app function with the Chrome package name from the installed apps list.
+ action:
+ function: |-
+ launch_app
+ arguments:
+ {"package_name": "com.android.chrome", "id": "5"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Launch the Chrome browser app.
+ - (2) Verify that Chrome has opened by checking the next screenshot.
+ - (3) Mark the request as FINISH once Chrome is successfully opened.
+ result: |-
+ Request in CONTINUE state. Successfully initiated launch_app command for Chrome browser (com.android.chrome, ID: 5). The app launch command has been submitted. No execution result received yet. Remaining steps: (1) await app launch completion, (2) verify Chrome browser is now visible on screen through screenshot observation, (3) confirm successful app opening by checking for Chrome UI elements (address bar, tabs, menu), and (4) mark task as FINISH with confirmation that Chrome browser is now open and ready for use.
+
+example2:
+ Request: |-
+ Type "hello world" in the search box on the current screen.
+ Response:
+ observation: |-
+ The screen shows a search interface with a text input field. The current_controls list includes a control with ID '12', name 'Search', type 'EditText', at coordinates [50, 200, 900, 100].
+ thought: |-
+ I need to click on the search box first to focus it, then type the text. I'll use the control ID from the current_controls list for precise interaction.
+ action:
+ function: |-
+ type_text
+ arguments:
+ {"text": "hello world", "control_id": "12", "control_name": "Search"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Type the text in the search box using the control ID.
+ - (2) Verify the text appears in the search box from the next screenshot.
+ - (3) Mark as FINISH if the text is successfully entered.
+ result: |-
+ Request in CONTINUE state. Successfully initiated type_text command to input "hello world" into the search box (control_id: 12, control_name: Search). The type_text command will automatically click the search field to focus it before typing. No execution result received yet. Remaining steps: (1) await text input completion, (2) verify from screenshot that "hello world" now appears in the search box control, (3) confirm the text input was successful by observing the search field contains the expected text, and (4) mark task as FINISH with confirmation that the search query has been entered.
+
+example3:
+ Request: |-
+ Click the "Submit" button on the form.
+ Response:
+ observation: |-
+ The screen displays a form with several input fields and a submit button. The current_controls list shows a Button control with ID '25', name 'Submit', type 'Button', located at [300, 800, 400, 100].
+ thought: |-
+ I should use click_control with the exact control ID and name to tap the Submit button, which is more reliable than using raw coordinates.
+ action:
+ function: |-
+ click_control
+ arguments:
+ {"control_id": "25", "control_name": "Submit"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Click the Submit button using its control ID.
+ - (2) Observe the result in the next screenshot (form submission, navigation to new screen, or error message).
+ - (3) Mark as FINISH if the form submission completes successfully.
+ result: |-
+ Request in CONTINUE state. Successfully initiated click_control command for the Submit button (control_id: 25, control_name: Submit). The command will tap the center of the button control at coordinates calculated from its bounding rectangle [300, 800, 400, 100]. No execution result received yet. Remaining steps: (1) await button click completion, (2) observe the screen state change in next screenshot - possible outcomes include form submission success message, navigation to confirmation page, error validation messages, or loading indicator, (3) verify the form submission completed by checking for success indicators or new screen content, and (4) mark task as FINISH with summary of submission outcome (e.g., "Form submitted successfully" or "Submission failed with error: [error message]").
+
+example4:
+ Request: |-
+ Scroll down on the current screen to see more content.
+ Response:
+ observation: |-
+ The screen shows a scrollable content area with text and images. To scroll down, I need to perform a swipe gesture from bottom to top.
+ thought: |-
+ I'll use the swipe function to perform a downward scroll. I need to start from a point near the bottom of the screen and swipe upward. Assuming screen height is around 1920 pixels, I'll swipe from y=1500 to y=500.
+ action:
+ function: |-
+ swipe
+ arguments:
+ {"start_x": 540, "start_y": 1500, "end_x": 540, "end_y": 500, "duration": 300}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Execute the swipe gesture to scroll down the content.
+ - (2) Observe the next screenshot to verify new content is visible.
+ - (3) Mark as FINISH once scrolling is complete and new content is displayed.
+ result: |-
+ Request in CONTINUE state. Successfully initiated swipe command to scroll down the screen. The swipe gesture will move from coordinates (540, 1500) upward to (540, 500) over 300 milliseconds, simulating a natural scroll-down motion. No execution result received yet. Remaining steps: (1) await swipe gesture completion, (2) observe the next screenshot to verify screen content has scrolled - new items should now be visible that were previously off-screen below, (3) compare visible content with previous screenshot to confirm scroll distance, and (4) mark task as FINISH with confirmation that scrolling completed and new content is now visible (e.g., "Successfully scrolled down, new content items [item names] are now visible").
+
+example5:
+ Request: |-
+ Go back to the previous screen.
+ Response:
+ observation: |-
+ The device is currently showing a detail page or sub-screen. The user wants to navigate back to the previous screen.
+ thought: |-
+ I should use the press_key function with KEYCODE_BACK to simulate the back button press, which is the standard Android navigation pattern.
+ action:
+ function: |-
+ press_key
+ arguments:
+ {"key_code": "KEYCODE_BACK"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Press the back button using KEYCODE_BACK.
+ - (2) Observe the next screenshot to verify navigation to the previous screen.
+ - (3) Mark as FINISH once the previous screen is displayed.
+ result: |-
+ Request in CONTINUE state. Successfully initiated press_key command with KEYCODE_BACK to navigate to the previous screen. The hardware back button press will trigger Android's standard back navigation behavior. No execution result received yet. Remaining steps: (1) await key press completion, (2) observe the next screenshot to verify the app has navigated back - the previous screen should now be visible, (3) identify the previous screen by comparing with navigation history or recognizing familiar UI elements, and (4) mark task as FINISH with confirmation of successful back navigation (e.g., "Successfully navigated back to [previous screen name]" or "Returned to previous screen showing [description of visible content]").
+
+example6:
+ Request: |-
+ Find and click the Settings icon on the home screen.
+ Response:
+ observation: |-
+ The device shows the home screen with multiple app icons. The current_controls list includes multiple clickable controls. I need to identify the Settings icon by looking for a control with "Settings" in its name or type.
+ thought: |-
+ From the current_controls list, I can see there's a control with ID '8', name 'Settings', type 'ImageView' at coordinates [100, 300, 150, 150]. This is likely the Settings app icon.
+ action:
+ function: |-
+ click_control
+ arguments:
+ {"control_id": "8", "control_name": "Settings"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Click the Settings icon using its control ID.
+ - (2) Verify the Settings app opens in the next screenshot.
+ - (3) Mark as FINISH once Settings screen is visible.
+ result: |-
+ Request in CONTINUE state. Successfully initiated click_control command for the Settings icon (control_id: 8, control_name: Settings). The command will tap the center of the Settings icon at coordinates calculated from its bounding rectangle [100, 300, 150, 150]. No execution result received yet. Remaining steps: (1) await icon click completion, (2) observe next screenshot to verify Settings app has opened - expect to see Settings screen with options like WiFi, Bluetooth, Display, Sound, etc., (3) confirm Settings app is active by identifying characteristic Settings UI elements and menu items, and (4) mark task as FINISH with confirmation (e.g., "Successfully opened Settings app, now displaying main Settings screen with options: [list of visible settings categories]").
+
+example7:
+ Request: |-
+ Check if WhatsApp is installed on the device.
+ Response:
+ observation: |-
+ I have access to the installed_apps list which contains all user-installed applications on the device. I need to search through this list to find WhatsApp.
+ thought: |-
+ I should examine the installed_apps data structure to look for WhatsApp. The package name for WhatsApp is typically 'com.whatsapp'. I don't need to execute any action since I can answer this from the installed apps data.
+ action:
+ function: |-
+
+ arguments:
+ {}
+ status: |-
+ FINISH
+ plan:
+ - (1) Search through installed_apps for WhatsApp package.
+ - (2) Report whether WhatsApp is found or not.
+ result: |-
+ Task FINISHED. Checked installed applications list for WhatsApp. Result: [If found] WhatsApp is installed on the device (Package: com.whatsapp, App ID: [id], Name: [name]). [If not found] WhatsApp is NOT installed on the device. The installed_apps list was searched for package names containing 'whatsapp' and app names matching 'WhatsApp', but no match was found among the [number] user-installed applications.
+
+example8:
+ Request: |-
+ Launch the Photos app and wait for it to fully load before proceeding.
+ Response:
+ observation: |-
+ The device shows the home screen. The installed_apps list includes Photos app with package name 'com.google.android.apps.photos' and ID '7'. Apps often need time to initialize after launch.
+ thought: |-
+ I need to launch the Photos app first, then wait for it to load completely before marking as complete. I'll use launch_app followed by a wait action to ensure the app has time to initialize and display its content.
+ action:
+ function: |-
+ launch_app
+ arguments:
+ {"package_name": "com.google.android.apps.photos", "id": "7"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Launch the Photos app using launch_app.
+ - (2) Wait 2 seconds for the app to fully load and display its interface.
+ - (3) Verify the Photos app is open and displaying content in the next screenshot.
+ - (4) Mark as FINISH once the app is confirmed to be fully loaded.
+ result: |-
+ Request in CONTINUE state. Successfully initiated launch_app command for Photos app (com.google.android.apps.photos, ID: 7). The app launch has been triggered. Next step will be to wait for the app to fully initialize. Remaining steps: (1) use wait action to pause for app loading (2 seconds recommended for app launches), (2) observe the next screenshot to verify Photos app has fully loaded - expect to see photo grid or albums view, (3) confirm the app is responsive and displaying content, not just a splash screen or loading indicator, and (4) mark task as FINISH with confirmation (e.g., "Successfully launched Photos app, now displaying [photo library/albums view] with [number] items visible").
+
+example9:
+ Request: |-
+ After clicking the Submit button, wait for the confirmation message to appear.
+ Response:
+ observation: |-
+ The screen shows a form with a Submit button (control_id: 15, control_name: Submit). After submission, the server needs time to process the request and display a confirmation message.
+ thought: |-
+ I should click the Submit button first, then wait for the server response. Network operations and UI transitions typically need 1-2 seconds. I'll wait 1.5 seconds to ensure the confirmation message has time to appear.
+ action:
+ function: |-
+ click_control
+ arguments:
+ {"control_id": "15", "control_name": "Submit"}
+ status: |-
+ CONTINUE
+ plan:
+ - (1) Click the Submit button to trigger form submission.
+ - (2) Wait 1.5 seconds for server processing and confirmation message to appear.
+ - (3) Check the next screenshot for the confirmation message.
+ - (4) Mark as FINISH once confirmation is visible.
+ result: |-
+ Request in CONTINUE state. Successfully initiated click_control command for the Submit button (control_id: 15, control_name: Submit). The form submission has been triggered. Next step will be to wait for server response. Remaining steps: (1) use wait action to pause for server processing and UI update (1.5 seconds for network operations), (2) observe the next screenshot to look for confirmation message - common indicators include success toast, confirmation dialog, or navigation to success page, (3) verify the confirmation message content to ensure submission was successful, and (4) mark task as FINISH with confirmation details (e.g., "Form submitted successfully. Confirmation message: '[message text]'" or "Submission completed, now showing [result screen description]").
+
diff --git a/ufo/server/app.py b/ufo/server/app.py
index ed962feeb..440514132 100644
--- a/ufo/server/app.py
+++ b/ufo/server/app.py
@@ -15,8 +15,8 @@ def parse_args():
"--platform",
type=str,
default=None,
- choices=["windows", "linux"],
- help="Platform override (auto-detected if not specified)",
+ choices=["windows", "linux", "mobile"],
+ help="Platform override (windows, linux, or mobile). Auto-detected if not specified.",
)
parser.add_argument(
"--log-level",
diff --git a/ufo/server/services/session_manager.py b/ufo/server/services/session_manager.py
index 5c7cc846a..94583e6c4 100644
--- a/ufo/server/services/session_manager.py
+++ b/ufo/server/services/session_manager.py
@@ -20,14 +20,14 @@
class SessionManager:
"""
This class manages active sessions for the UFO service.
- Supports both Windows and Linux platforms using SessionFactory.
+ Supports Windows, Linux, and Mobile (Android) platforms using SessionFactory.
"""
def __init__(self, platform_override: Optional[str] = None):
"""
Initialize the SessionManager.
This class manages active sessions for the UFO service.
- :param platform_override: Override platform detection ('windows' or 'linux').
+ :param platform_override: Override platform detection ('windows', 'linux', or 'mobile').
If None, platform is auto-detected.
"""
self.sessions: Dict[str, BaseSession] = {}
@@ -67,9 +67,9 @@ def get_or_create_session(
:param task_name: The name of the task.
:param request: Optional request text to initialize the session.
:param task_protocol: Optional AIP TaskExecutionProtocol instance.
- :param platform_override: Override platform detection ('windows' or 'linux').
+ :param platform_override: Override platform detection ('windows', 'linux', or 'mobile').
:param local: Whether the session is running in local mode with the client.
- :return: The BaseSession object for the session (Windows or Linux).
+ :return: The BaseSession object for the session (Windows, Linux, or Mobile).
"""
with self.lock:
if session_id not in self.sessions:
@@ -172,7 +172,7 @@ async def execute_task_async(
:param task_name: Task name
:param request: User request
:param task_protocol: AIP TaskExecutionProtocol instance
- :param platform_override: Platform type ('windows' or 'linux')
+ :param platform_override: Platform type ('windows', 'linux', or 'mobile')
:param callback: Optional async callback(session_id, ServerMessage) when task completes
:return: session_id
"""
diff --git a/ufo/server/ws/handler.py b/ufo/server/ws/handler.py
index 4d9a4f1e8..756f574a6 100644
--- a/ufo/server/ws/handler.py
+++ b/ufo/server/ws/handler.py
@@ -249,7 +249,7 @@ async def handler(self, websocket: WebSocket) -> None:
asyncio.create_task(self.handle_message(msg))
except WebSocketDisconnect as e:
self.logger.warning(
- f"[WS] {client_id} disconnected �?code={e.code}, reason={e.reason}"
+ f"[WS] {client_id} disconnected - code={e.code}, reason={e.reason}"
)
if client_id:
await self.disconnect(client_id)
@@ -419,7 +419,7 @@ async def send_result(sid: str, result_msg: ServerMessage):
error=result_msg.error,
response_id=result_msg.response_id,
)
- self.logger.info(f"[WS] �?Sent to client {client_id} successfully")
+ self.logger.info(f"[WS] ✅ Sent to client {client_id} successfully")
# If constellation client, also notify the target device
if client_type == ClientType.CONSTELLATION and target_device_id:
@@ -440,7 +440,7 @@ async def send_result(sid: str, result_msg: ServerMessage):
response_id=result_msg.response_id,
)
self.logger.info(
- f"[WS] �?Sent to target device {target_device_id} successfully"
+ f"[WS] ✅ Sent to target device {target_device_id} successfully"
)
except (ConnectionError, IOError) as target_error:
self.logger.warning(
@@ -451,7 +451,7 @@ async def send_result(sid: str, result_msg: ServerMessage):
f"[WS] ⚠️ Target device {target_device_id} disconnected, skipping send"
)
- self.logger.info(f"[WS] �?All results sent for session {sid}")
+ self.logger.info(f"[WS] ✅ All results sent for session {sid}")
except (ConnectionError, IOError) as e:
self.logger.warning(
f"[WS] ⚠️ Connection error sending result for {sid}: {e}"
@@ -460,7 +460,7 @@ async def send_result(sid: str, result_msg: ServerMessage):
import traceback
self.logger.error(
- f"[WS] �?Failed to send result for {sid}: {e}\n{traceback.format_exc()}"
+ f"[WS] ❌ Failed to send result for {sid}: {e}\n{traceback.format_exc()}"
)
self.logger.info(
|