Skip to content

Conversation

@rich7420
Copy link
Contributor

Purpose of PR

Now we make multiple copy in pytorch import like: torch.Tensor -> tolist() -> Vec.
This PR replaced the tolist() approach with zero-copy conversion using PyO3 NumPy interface:

  • Convert PyTorch tensor to NumPy view via tensor.detach().numpy() (zero-copy when C-contiguous)
  • Extract &[f64] slice directly using PyReadonlyArrayDyn::as_slice()
  • Eliminates intermediate Python list and Rust Vec allocations

Otherwise, this pr provide a benchmark to test latency of pytorch.
We could remove the benchmark if you think it's unnecessary.

Related Issues or PRs

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@rich7420
Copy link
Contributor Author

rich7420 commented Jan 13, 2026

before

uv run python benchmark_latency_pytorch.py --qubits 16 --batches 100 --batch-size 32 2>&1
Uninstalled 1 package in 0.40ms
Installed 1 package in 2ms
PyTorch Tensor Encoding Benchmark: 16 Qubits, 3200 Samples
  Batch size   : 32
  Vector length: 65536
  Batches      : 100
  Prefetch     : 16

======================================================================
PYTORCH TENSOR LATENCY BENCHMARK: 16 Qubits, 3200 Samples
======================================================================

[Mahout-PyTorch] PyTorch Tensor Input (Zero-Copy Optimization)...
  Total Time: 2.8140 s (0.879 ms/vector)

[Mahout-NumPy] NumPy Array Input (Baseline)...
  Total Time: 2.3389 s (0.731 ms/vector)

======================================================================
LATENCY COMPARISON (Lower is Better)
Samples: 3200, Qubits: 16
======================================================================
PyTorch Tensor          0.879 ms/vector
NumPy Array             0.731 ms/vector
----------------------------------------------------------------------
Speedup: 0.83x
Improvement: -20.3%

@rich7420
Copy link
Contributor Author

after

uv run python benchmark_latency_pytorch.py --qubits 16 --batches 100 --batch-size 32 2>&1
Uninstalled 1 package in 0.40ms
Installed 1 package in 4ms
PyTorch Tensor Encoding Benchmark: 16 Qubits, 3200 Samples
  Batch size   : 32
  Vector length: 65536
  Batches      : 100
  Prefetch     : 16

======================================================================
PYTORCH TENSOR LATENCY BENCHMARK: 16 Qubits, 3200 Samples
======================================================================

[Mahout-PyTorch] PyTorch Tensor Input (Zero-Copy Optimization)...
  Total Time: 2.3464 s (0.733 ms/vector)

[Mahout-NumPy] NumPy Array Input (Baseline)...
  Total Time: 2.3175 s (0.724 ms/vector)

======================================================================
LATENCY COMPARISON (Lower is Better)
Samples: 3200, Qubits: 16
======================================================================
PyTorch Tensor          0.733 ms/vector
NumPy Array             0.724 ms/vector
----------------------------------------------------------------------
Speedup: 0.99x
Improvement: -1.2%

Copy link
Member

@guan404ming guan404ming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guan404ming
Copy link
Member

PyTorch Tensor Encoding Benchmark: 16 Qubits, 3200 Samples            
Batch size   : 32
Vector length: 65536
Batches      : 100
Prefetch     : 16

======================================================================
PYTORCH TENSOR LATENCY BENCHMARK: 16 Qubits, 3200 Samples
======================================================================

[Mahout-PyTorch] PyTorch Tensor Input (Zero-Copy Optimization)...
Total Time: 1.4448 s (0.451 ms/vector)

[Mahout-NumPy] NumPy Array Input (Baseline)...
Total Time: 1.4269 s (0.446 ms/vector)

======================================================================
LATENCY COMPARISON (Lower is Better)
Samples: 3200, Qubits: 16
======================================================================
PyTorch Tensor          0.451 ms/vector
NumPy Array             0.446 ms/vector
----------------------------------------------------------------------
Speedup: 0.99x
Improvement: -1.3%

@guan404ming guan404ming merged commit 011e851 into apache:main Jan 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants