Export Meta AI's Segment Anything 3 (SAM3) model to ONNX, then build a TensorRT engine for real-time segmentation. This repo includes a CUDA inference library and demo apps for semantic and instance segmentation.
- Project Overview
- Benchmarks
- Demos
- Repo Layout
- Quickstart
- Extensions
- Troubleshooting
- Development guide
- Disclaimer
- Python tooling to export SAM3 to a clean ONNX graph.
- TensorRT-ready workflows for building optimized engines.
- A C++/CUDA library for high-performance inference with demo apps.
- Support for Promptable concept segmentation (PCS), the latest feautre in SAM3.
- Zero-copy support on unified-memory platforms (Jetson, DGX Spark). Great for robotics/real-time interaction.
- Everything runs inside a reproducible docker environment (x86, Jetson, Spark).
- MIT license for the love of everything nice :)
The numbers show end to end image processing latency per image (4K resolution) in ms excluding image load/save time.
| Hardware | HF+PyTorch | TensorRT+CUDA | Speedup | Notes |
|---|---|---|---|---|
| Jetson Orin NX | 6600 ms | 950 ms | 6.95x | Uses zero-copy |
| Jetson Thor | Please contribute | |||
| DGX Spark | Please contribute | |||
| RTX 3090 | 438 ms | 75 ms | 5.82x | |
| A10 | 545.3 ms | 161.1 | 3.38x | GPU hits 100% utilization |
| A100 | 314.1 ms | 48.8 ms | 6.43x | 40GB SXM4 variant |
| H100 | 265.3 ms | 34.6 ms | 7.66x | PCIe variant |
| H100 | 213.2 ms | 24.9 ms | 8.56x | SXM5 variant |
| GH200 | 142.3 ms | 23.3 ms | 6.11x | arm64+H100 iGPU, without zero-copy |
| GH200 | 142.3 ms | 26.4 ms | 5.39x | using zero-copy |
| B200 | 160.0 ms | 17.7 ms | 9.03x | SXM6 variant |
Note: the HF+PyTorch path is GPU-backed too, so these numbers compare two GPU implementations rather than CPU vs GPU.
Please contribute your results and I will be happy to add them here. Use this guide to run the benchmarks yourself.
Semantic segmentation produced by the C++ demo app (prompt='dog')
Instance segmentation results (prompt='box')
python/- ONNX export and visualization scripts.cpp/- C++/CUDA library and apps (TensorRT inference).docker/- Container setup (Dockerfile.x86, with an aarch64 variant expected).demo/- Example outputs from the C++ demo app.
-
Request access to the gated model
- Visit https://huggingface.co/facebook/sam3 and request access.
- Ensure your
HF_TOKENhas permission. - Set
HF_TOKENas environment variable in the host. Docker will pick it up from there.
-
Build the Docker container for your platform (all commands below run inside it)
docker build -t sam3-trt -f docker/Dockerfile.x86 .For aarch64 platforms with shared CPU/GPU memory, the C++ library in this repo supports zero-copy inference paths.
Build and run the aarch64 container:
docker build -t sam3-trt-aarch64 -f docker/Dockerfile.aarch64 .- Export
HF_TOKENand run the docker container
export HF_TOKEN=<YOUR TOKEN>
docker run -it --rm \
--network=host \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--env HF_TOKEN \
-v "$PWD":/workspace \
-w /workspace \
sam3-trt bash- Export to ONNX
python python/onnxexport.pyThis produces onnx_weights/sam3_static.onnx plus external weight shards.
- Build a TensorRT engine
trtexec --onnx=onnx_weights/sam3_static.onnx --saveEngine=sam3_fp16.plan --fp16 --verbose- Build the C++/CUDA library and sample app
mkdir cpp/build && cd cpp/build
cmake ..
make- Run the demo app
./sam3_pcs_app <image_dir> <engine_path.engine>Results are written to a results/ folder.
This is a very raw project and provides the crucial backend TensorRT/CUDA bits necessary for anything. From here, please feel free to fan out into any application you like. Pull requests are very welcome! Here are some ideas I can think of:
- ROS2 wrapper for real-time robotics pipelines.
- Interactive voice-based segmentation app. Have someone speak into a microphone, use a TTS model to transcribe it and feed into the engine, which then produces the segmentation mask live. I don't have the time to build it but I hope you can.
- Live camera input and overlays. You will need a beefy GPU. SAM3 doesn't run realtime on a Jetson nano.
- Access errors: Make sure your
HF_TOKENhas access tofacebook/sam3. - ONNX export fails: Install
transformersfrom source if SAM3 is missing. - TensorRT parse errors: Ensure the full
onnx_weights/directory is copied (external data is required). - C++ build errors: Confirm CUDA, TensorRT, and OpenCV are installed and discoverable via
pkg-config.
- The shared library target is
sam3_trt. - Demo app:
sam3_pcs_app(semantic/instance visualization modes). - Outputs include semantic segmentation and instance segmentation mask logits. If you choose
SAM3_VISUALIZATION::VIS_NONEin your application, you need to apply sigmoid yourself. - The library does not support building engines. Use
trtexecinstead.
Use the same image directory and prompt for all runs. Both paths time the model pipeline and exclude image load/save.
Huggingface + PyTorch:
python python/basic_script.py <image_dir>TensorRT + CUDA (benchmark mode disables output writes):
./sam3_pcs_app <image_dir> <engine_path.engine> 1- Default export runs on CPU for compatibility (switch
devicetocudaif desired). - SAM3 is large and exports with external weight shards; keep the entire
onnx_weights/directory together.
- Use
trtexecfor quick engine builds and benchmarking. - FP16 is the usual starting point; INT8/FP8/INT4 require calibration or compatible tooling.
- MIT (see
LICENSE).
If this saved you time, drop a ⭐ so others can find it and ship SAM-3 faster.
All views expressed here are my own. This project is not affiliated with my employer.


