Adding a ray batch query API #178

Waqar-ukaea · 2025-11-12T18:23:56Z

This PR adds a set of methods to the XDG API that allow for launching batches of rays in one call. The main intention here being to plug into GPRT to perform GPU ray tracing at scale across the RT pipeline.

At this stage, I have a working set of overloads for GPRTRayTracer::point_in_volume() and GPRTRayTracer::ray_fire() which can be used to call rays over large batches.

The two unit tests test_point_in_volume and test_ray_fire have been extended to also test the batch variants of the methods but have been gated to only do so when GPRT is enabled (until I get a working implementation for embree too).

Three new miniapps have been added to xdg/tools. These are:

batch-ray-fire - Provide a set of origins and directions and fire rays using the batch version of xdg::ray_fire()
batch-point-in-volume - Provide a set of points and (optionally) directions to perform point in volume checks using the batch version of xdg::point_in_volume()
ray-benchmark - Provide a .h5m model to perform a ray tracing throughput benchmark against that model (sample directions from unit sphere around origin provided)

The most interesting of the new miniapps is by far ray-benchmark as it allows for a direct performance comparison between our embree and GPRT ray tracing implementations. On the jezebel.h5m model with 40 million rays launched from the origin and intersecting the volume we see the following metrics.

Embree:

GPRT:

Which is a promising sign that we already have some performance increase. Hopefully coming PRs will increase this gap further.

Waqar-ukaea · 2025-11-12T18:28:45Z

40 million rays

Why 40 million rays? Because that seems to be about the memory limit I can transfer in one go using vulkan before I run into the following vulkan error:

ERROR: [-1649273453][UNASSIGNED-vkAllocateMemory-maxMemoryAllocationSize] : vkAllocateMemory(): pAllocateInfo->allocationSize (4800000000) is larger than maxMemoryAllocationSize (4292870144). While this might work locally on your machine, there are many external factors each platform has that is used to determine this limit. You should receive VK_ERROR_OUT_OF_DEVICE_MEMORY from this call, but even if you do not, it is highly advised from all hardware vendors to not ignore this limit.

So I'm going to have to revisit how we are currently managing ray buffers to better handle this because 40 million rays is only using about 50% of the 8GB of VRAM I have available on my local machine.

Waqar-ukaea · 2025-11-13T10:54:39Z

@pshriwise When running openmc in event based mode, does the number of particles in flight ever change? i.e is there a need to change the number of rays to be launched in a single batch on the fly?

Right now both GPRTRayTracer::point_in_volume() and GPRTRayTracer::ray_fire() call this internal method check_ray_buffer_capacity():

void GPRTRayTracer::check_ray_buffer_capacity(const size_t N)
{
  if (N <= rayHitBuffers_.capacity) return; // current capacity is sufficient

  // Resize buffers to accommodate N rays
  size_t newCapacity = std::max(N, rayHitBuffers_.capacity * 2); // double the capacity or set to N, whichever is larger

  gprtBufferResize(context_, rayHitBuffers_.ray, newCapacity, false);
  gprtBufferResize(context_, rayHitBuffers_.hit, newCapacity, false);
  rayHitBuffers_.capacity = newCapacity;

  // Since we have resized the ray buffers, we need to update the geom_data->rayIn pointers in all geometries too 
  for (auto const& [surf, geom] : surface_to_geometry_map_) {
    DPTriangleGeomData* geom_data = gprtGeomGetParameters(geom);
    geom_data->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray); 
  }

  // Update raygen data pointers
  for (auto const& [type, rayGen] : rayGenPrograms_) {
    dblRayGenData* rayGenData = gprtRayGenGetParameters(rayGen);
    rayGenData->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray);
    rayGenData->hit = gprtBufferGetDevicePointer(rayHitBuffers_.hit);
  }

  gprtBuildShaderBindingTable(context_, static_cast<GPRTBuildSBTFlags>(GPRT_SBT_GEOM | GPRT_SBT_RAYGEN));
}

The method resizes our ray buffers if either ray_fire or point_in_volume() are requested with a N > rayHitBuffers_.capacity. But I'm curious if N would ever actually change during an openmc simulation?

Waqar-ukaea · 2025-11-19T16:12:09Z

I've re-run my ray throughput benchmark miniapp with a larger model and I've also moved the timing region to be around the actual raygen launch so it is purely timing the ray tracing computation, rather than inadvertently also including the time taken to transfer ray buffers to device and hit buffers back to the host.

I am running this on my local machine which contains an Nvidia RTX ada 2000 mobile (used by GPRT) and an 13th Gen Intel® Core™ i7-13850HX (used by Embree). Specs of the RTX ada 2000 and the i7-13850HX:

A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue:

Benchmark parameters

model	Volume	No. of Elements	No. of Rays	Location
simple_tokamak	2	280K	50M	(180,250,-27)

Ray tracer performance

Note - GPRT results do not take into account device buffer IO. So comparison to Embree is a little disingenuous.

Ray Tracer backend	Ray tracing Wall Time (sec)	Ray Tracing Throughput (ray/sec)	Speedup vs Embree
Embree	10.7318	4.65905e+06	1×
GPRT (FP64)	0.762942	6.55358e+07	~14.1×
GPRT (FP32) + RT cores	0.0252805	1.97781e+09	~424.5×

So, a fairly significant speedup already seen in pure ray tracing throughput in moving from Embree to GPRT whilst maintaining the full mixed precision algorithm. However, if there is a possibility that we can move to full single precision ray tracing we can see a speedup of nearly 425x. And this isn't even a particularly powerful graphics-oriented chip and is a few years old now. An RTX 5090 (launched early January 2025) has around 10x the FP32 FLOPS as well as 7x the number of RT cores.

EDIT - With minimal validation layers enabled (for printf) the backend with GPRT (FP32) + RT cores ends up with a throughput of ~1.36496e+09. So a slight drop off but not as much of a difference as I previously thought there might be.

References

[1] Valentine, A., Berry, T., Bradnam, S., Hagues, J., & Hodson, J. (2022). Benchmarking of emergent radiation transport codes for fusion neutronics applications. Fusion Engineering and Design, 180, 113197. https://doi.org/10.1016/j.fusengdes.2022.113197

Waqar-ukaea · 2025-11-25T17:18:26Z

Currently working on an approach to packing rays (origins + directions) on device that doesn't involve an expensive host to device transfer. Right now I'm setting up some GPRT buffers in the ray-benchmark miniapp to essentially act as a "mock" for a downstream application which generates those origins + directions. Getting them to work together is already proving a little difficult so im not entirely sure how it will be done when that downstream application is running with an entirely different gpu runtime.

These are in the methods:
ray_fire_packed() and pack_external_rays()

Waqar-ukaea · 2025-11-26T17:10:35Z

So I have something which works but it is significantly slower when running in via the pack_external_rays() -> ray_fire_packed() path. The slowdown is happening when resizing our rayhit buffers and rebuilding the SBT. Perhaps the added cost of having the origins[N] + directions[N] alongside the rays[N] and hits[N] buffers is causing the slowdown when it comes to resizing?

Waqar-ukaea · 2025-11-27T19:18:12Z

Following on from the trouble I was having outlined in the last comment, I have opted for a different route to writing raydata on device which involves exposing the rayhit buffers and allowing the "external" application to write to them directly. This is actually the original approach I wanted to do but I wasn't sure about which types should be exposed to the public facing API. However I have a solution now which seems somewhat reasonable.

I've defined a new public facing struct to essentially wrap the internal struct I was using to manage GPRT ray and hit buffers. The public facing struct exposes device pointers for XDG's dblRay and dblHit types. This public facing buffer can be returned from XDG by calling: xdg->get_device_rayhit_buffers(N):

struct DeviceRayHitBuffers {
  dblRay* rayDevPtr; // device pointer to ray buffers
  dblHit* hitDevPtr; // device pointer to hit buffers
  uint capacity = 0;
}

Once the external application has the device pointers, they can be passed to a compute shader and written directly to.

Right now my testing involves the "external" compute shader being a new GPRT compute shader registered to the same GPRTContext (and underlying VkInstance that is registered to XDG.

Of course a real world downstream application wont have this luxury and will likely not even be using Vulkan so I will need to figure out a way to get these vulkan device pointers into something meaningful that can be passed to another GPU API - Issue #182 for more detail on that.

However, what this does mean is that I can more fairly compare the GPRT ray tracing against embree now. Since the device IO is now making use of the GPU rather than requiring an expensive host to device transfer, I will include it in the next benchmarks I run.

Waqar-ukaea · 2025-11-28T10:34:04Z

Another important change which I hadn't highlighted earlier is the reduction in memory footprint of the dblRay struct by more effectively making use of push constants - b92bd89. A push constant struct can be passed to the raygen shader which essentially defines a set of constants for that RT pipeline call. So the practical change made was to move parameters which are constant for every ray to the push constants:

struct dblRay
{
  double3 origin;
  double3 direction;
-  double tMin; // Minimum distance for ray intersection
-  double tMax; // Maximum distance for ray intersection
  int32_t* exclude_primitives; // Optional for excluding primitives
  int32_t exclude_count;           // Number of excluded primitives
-  xdg::HitOrientation hitOrientation;
-  int volume_tree; // TreeID of the volume being queried
-  SurfaceAccelerationStructure volume_accel; // The volume accel 
};

struct dblRayFirePushConstants {
+  double tMax;
+  double tMin;
+  SurfaceAccelerationStructure volume_accel; 
+  int volume_tree;
+  xdg::HitOrientation hitOrientation;
};

This change results in a memory saving of about 32B per ray. Which scaled up to 50M rays is a saving of around 1.5GB.

…rays

…e overrides

…constants across all rays

…_packed()

…irectly via an "external" compute shader

Waqar-ukaea · 2025-11-28T13:57:52Z

Some new benchmarks after implementing the new approach to generate rays on device and some more consistent timing usage thanks to PR #148!

Update 1: Now includes results for the new GPRT FP32 ray tracer backend.
Update 2: I have written a script that can automatically drive multiple runs of the benchmark problem
Update 3: Now have some performance results from a Sapphire-Rapids node on a HPC system too:
Update 4: Now have results for Nvidia L40 with both FP64 and FP32 + RT cores
Update 5: Added entries to indicate vulkan issues on AMD MI300X and Intel PVC cards
Update 6: Now have results for Nvidia A100

Benchmark parameters

model	Volume	No. of Elements	No. of Rays	Location	No. of Runs
simple_tokamak	2	280K	50M	(180,250,-27)	100

A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue:

Ray tracer performance (trace-only)

Baseline = Embree (CPU), 2× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 112 threads

Times and throughput averaged over the 100 runs.

Ray Tracer backend	Hardware (Threads / Device)	Trace Time (s)	Throughput (ray/s)	Speedup vs 2×8480+ (112-thread)	Peak FP32/FP64 (TFLOPS) + RT cores
Embree	13th Gen Intel® Core™ i7-13850HX × 28 threads	0.749128	1.06791e+08	~0.25×	N/A
Embree	1× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 56 threads	0.367881	2.17462e+08	~0.51×	N/A
Embree	2× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 112 threads	0.18914	4.22967e+08	1× (baseline)	N/A
GPRT (FP64)	NVIDIA RTX 2000 Ada	1.11261	7.19033e+07	~0.17× (baseline faster)	FP32: 12.0 FP64: 0.19 RT cores: 22
GPRT (FP32 + RT cores)	NVIDIA RTX 2000 Ada	0.0302956	2.64065e+09	~6.24×	FP32: 12.0 FP64: 0.19 RT cores: 22
GPRT (FP64)	NVIDIA L40	0.185930	4.319e+08	~1.02×	FP32: 90.5 FP64: 1.41 RT cores: 142
GPRT (FP32 + RT cores)	NVIDIA L40	0.008051	9.954e+09	~23.5×	FP32: 90.5 FP64: 1.41 RT cores: 142
GPRT (FP64)	NVIDIA A100	0.183721	4.506e+08	~1.07×	FP32: 19.5 FP64: 9.7 RT cores: N/A
GPRT (FP64)	AMD MI300X - No VK ray tracing support	N/A	N/A	N/A	N/A
GPRT (FP64)	Intel Data Center GPU Max 1100 (Ponte Vecchio) - Vulkan doesn’t recognise PVCs as physical devices	N/A	N/A	N/A	N/A

Next Steps - It's probably worth coming up with a more computationally intense benchmark problem. For this simple ray-throughput case, it might just have to be increasing the number of rays - however I am memory bound for the RTX 2000 Ada which only has 8GB, so I'll have to have a think what could be more suitable.

Performance seems to be capping out at ~4e+08 rays/sec no matter how much more theoretical performance the card has. Increasing the number of rays fired up from 80M seems to have no positive impact on this performance metric.

…h clang

…ing(volume) call

…ith GPRT disabled

Waqar-ukaea mentioned this pull request Nov 13, 2025

Array-based ray query interface #72

Open

Waqar-ukaea linked an issue Nov 13, 2025 that may be closed by this pull request

Array-based ray query interface #72

Open

Waqar-ukaea added the Ray Tracing Changes made to the core ray tracing interface affecting both Embree and GPRT implementations. label Nov 20, 2025

Waqar-ukaea force-pushed the batch-query-api branch from 049ea22 to f297d70 Compare November 25, 2025 17:05

Waqar-ukaea mentioned this pull request Nov 26, 2025

Communicating with external GPU memory when using GPRTRayTracer #182

Open

Waqar-ukaea added 18 commits November 28, 2025 13:21

Started on batch API query methods

ce62395

Added a batch_ray_fire miniapp for extra testing

39b16c1

Added test_cases for batch_ray_fire

d77946a

Implemented batch_point_in_volume()

5f43de1

Added a batch_point in volume miniapp for extra testing

6bc01fa

Added stress test for batch_ray_fire which calls over a batch of 10M …

64f3053

…rays

Added a catch2 microbenchmark to the large number of rays stress test

c995dba

Rebased changes from xdg-org#170

0872d8e

Overloading ray_fire/point_in_volume for batch calls

bfac6d2

Added method descriptions for the two overloads

71632df

Refactor batch based point_in_volume tests

c714768

Dropped dir mask in batch PIV calls + made dirs optional

8421abf

Added a new ray-benchmark tool to compare embree vs gprt ray throughput

bafb7cf

Removed performance benchmarking in unit tests

562d879

Removed the leftover extra call to rebuild SBT

bc6e992

Update AABB population program to properly distribute threads

8e0545e

Move timer to exclude memory transfer around raygen launch

5665b63

Ensured timer still setup for Embree ray benchamrking

482361f

Waqar-ukaea added 9 commits November 28, 2025 13:22

Add default stubs for GPU specific ray tracing methods. Cleanup embre…

7eefd74

…e overrides

Refactor raygen launching functions to make use of PushConstants for …

1019ff9

…constants across all rays

Added methods for pre-packing rays and performing ray_fire on those rays

32d4e9b

Update ray_benchmark tool to make use of ray pre-packing and ray_fire…

0cf76b0

…_packed()

Added include guard to header with shared types between slang and C++

f2d1ea9

Fixed ambiguity with mixing GPRT math types (double3) and linalg

b010c2f

Refactored internal and public facing rayhit buffers to shared POD

8a1d7be

Refactored ray-benchmark miniapp to write into XDG's rayhit buffers d…

800abe6

…irectly via an "external" compute shader

Refactored ray_benchmark miniapp to make use of new xdg::Timer objects

7248773

Waqar-ukaea force-pushed the batch-query-api branch from 269b95e to 7248773 Compare November 28, 2025 13:55

Waqar-ukaea added 7 commits December 3, 2025 13:06

Switch constant DILATION_FACTOR to be inline const so it compiles wit…

f0c92cf

…h clang

Ensure embree path is making use of all CPU threads available

f0b3eac

Extended random ray generation to get random origins too

90624bc

Added in source-radius for GPRT but fails to compile

89f755d

No idea why but apparently I need the xdg->prepare_volume_for_raytrac…

bf36ff4

…ing(volume) call

Moved some header definitions around to make it possible to compile w…

bbb1601

…ith GPRT disabled

Added a python script to drive multiple ray-benchmarks

ce457b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding a ray batch query API #178

Adding a ray batch query API #178

Uh oh!

Waqar-ukaea commented Nov 12, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 12, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 13, 2025

Uh oh!

Waqar-ukaea commented Nov 19, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 25, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 26, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 27, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 28, 2025 •

edited

Loading

Uh oh!

Waqar-ukaea commented Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adding a ray batch query API #178

Are you sure you want to change the base?

Adding a ray batch query API #178

Uh oh!

Conversation

Waqar-ukaea commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 13, 2025

Uh oh!

Waqar-ukaea commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark parameters

Ray tracer performance

References

Uh oh!

Waqar-ukaea commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Waqar-ukaea commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark parameters

Ray tracer performance (trace-only)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Waqar-ukaea commented Nov 12, 2025 •

edited

Loading

Waqar-ukaea commented Nov 12, 2025 •

edited

Loading

Waqar-ukaea commented Nov 19, 2025 •

edited

Loading

Waqar-ukaea commented Nov 25, 2025 •

edited

Loading

Waqar-ukaea commented Nov 26, 2025 •

edited

Loading

Waqar-ukaea commented Nov 27, 2025 •

edited

Loading

Waqar-ukaea commented Nov 28, 2025 •

edited

Loading

Waqar-ukaea commented Nov 28, 2025 •

edited

Loading