Skip to content

Conversation

@Waqar-ukaea
Copy link
Collaborator

@Waqar-ukaea Waqar-ukaea commented Nov 12, 2025

This PR adds a set of methods to the XDG API that allow for launching batches of rays in one call. The main intention here being to plug into GPRT to perform GPU ray tracing at scale across the RT pipeline.

At this stage, I have a working set of overloads for GPRTRayTracer::point_in_volume() and GPRTRayTracer::ray_fire() which can be used to call rays over large batches.

The two unit tests test_point_in_volume and test_ray_fire have been extended to also test the batch variants of the methods but have been gated to only do so when GPRT is enabled (until I get a working implementation for embree too).

Three new miniapps have been added to xdg/tools. These are:

  • batch-ray-fire - Provide a set of origins and directions and fire rays using the batch version of xdg::ray_fire()
  • batch-point-in-volume - Provide a set of points and (optionally) directions to perform point in volume checks using the batch version of xdg::point_in_volume()
  • ray-benchmark - Provide a .h5m model to perform a ray tracing throughput benchmark against that model (sample directions from unit sphere around origin provided)

The most interesting of the new miniapps is by far ray-benchmark as it allows for a direct performance comparison between our embree and GPRT ray tracing implementations. On the jezebel.h5m model with 40 million rays launched from the origin and intersecting the volume we see the following metrics.

Embree:
image

GPRT:
Screenshot from 2025-11-12 18-09-19

Which is a promising sign that we already have some performance increase. Hopefully coming PRs will increase this gap further.

@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 12, 2025

40 million rays

Why 40 million rays? Because that seems to be about the memory limit I can transfer in one go using vulkan before I run into the following vulkan error:

ERROR: [-1649273453][UNASSIGNED-vkAllocateMemory-maxMemoryAllocationSize] : vkAllocateMemory(): pAllocateInfo->allocationSize (4800000000) is larger than maxMemoryAllocationSize (4292870144). While this might work locally on your machine, there are many external factors each platform has that is used to determine this limit. You should receive VK_ERROR_OUT_OF_DEVICE_MEMORY from this call, but even if you do not, it is highly advised from all hardware vendors to not ignore this limit.

So I'm going to have to revisit how we are currently managing ray buffers to better handle this because 40 million rays is only using about 50% of the 8GB of VRAM I have available on my local machine.

@Waqar-ukaea
Copy link
Collaborator Author

@pshriwise When running openmc in event based mode, does the number of particles in flight ever change? i.e is there a need to change the number of rays to be launched in a single batch on the fly?

Right now both GPRTRayTracer::point_in_volume() and GPRTRayTracer::ray_fire() call this internal method check_ray_buffer_capacity():

void GPRTRayTracer::check_ray_buffer_capacity(const size_t N)
{
  if (N <= rayHitBuffers_.capacity) return; // current capacity is sufficient

  // Resize buffers to accommodate N rays
  size_t newCapacity = std::max(N, rayHitBuffers_.capacity * 2); // double the capacity or set to N, whichever is larger

  gprtBufferResize(context_, rayHitBuffers_.ray, newCapacity, false);
  gprtBufferResize(context_, rayHitBuffers_.hit, newCapacity, false);
  rayHitBuffers_.capacity = newCapacity;

  // Since we have resized the ray buffers, we need to update the geom_data->rayIn pointers in all geometries too 
  for (auto const& [surf, geom] : surface_to_geometry_map_) {
    DPTriangleGeomData* geom_data = gprtGeomGetParameters(geom);
    geom_data->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray); 
  }

  // Update raygen data pointers
  for (auto const& [type, rayGen] : rayGenPrograms_) {
    dblRayGenData* rayGenData = gprtRayGenGetParameters(rayGen);
    rayGenData->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray);
    rayGenData->hit = gprtBufferGetDevicePointer(rayHitBuffers_.hit);
  }

  gprtBuildShaderBindingTable(context_, static_cast<GPRTBuildSBTFlags>(GPRT_SBT_GEOM | GPRT_SBT_RAYGEN));
}

The method resizes our ray buffers if either ray_fire or point_in_volume() are requested with a N > rayHitBuffers_.capacity. But I'm curious if N would ever actually change during an openmc simulation?

@Waqar-ukaea Waqar-ukaea linked an issue Nov 13, 2025 that may be closed by this pull request
@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 19, 2025

I've re-run my ray throughput benchmark miniapp with a larger model and I've also moved the timing region to be around the actual raygen launch so it is purely timing the ray tracing computation, rather than inadvertently also including the time taken to transfer ray buffers to device and hit buffers back to the host.

I am running this on my local machine which contains an Nvidia RTX ada 2000 mobile (used by GPRT) and an 13th Gen Intel® Core™ i7-13850HX (used by Embree). Specs of the RTX ada 2000 and the i7-13850HX:

image 1 image 2

A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue:
ray_benchmark_tokamak_setup

Benchmark parameters

model Volume No. of Elements No. of Rays Location
simple_tokamak 2 280K 50M (180,250,-27)

Ray tracer performance

Note - GPRT results do not take into account device buffer IO. So comparison to Embree is a little disingenuous.

Ray Tracer backend Ray tracing Wall Time (sec) Ray Tracing Throughput (ray/sec) Speedup vs Embree
Embree 10.7318 4.65905e+06
GPRT (FP64) 0.762942 6.55358e+07 ~14.1×
GPRT (FP32) + RT cores 0.0252805 1.97781e+09 ~424.5×

So, a fairly significant speedup already seen in pure ray tracing throughput in moving from Embree to GPRT whilst maintaining the full mixed precision algorithm. However, if there is a possibility that we can move to full single precision ray tracing we can see a speedup of nearly 425x. And this isn't even a particularly powerful graphics-oriented chip and is a few years old now. An RTX 5090 (launched early January 2025) has around 10x the FP32 FLOPS as well as 7x the number of RT cores.

EDIT - With minimal validation layers enabled (for printf) the backend with GPRT (FP32) + RT cores ends up with a throughput of ~1.36496e+09. So a slight drop off but not as much of a difference as I previously thought there might be.

References

[1] Valentine, A., Berry, T., Bradnam, S., Hagues, J., & Hodson, J. (2022). Benchmarking of emergent radiation transport codes for fusion neutronics applications. Fusion Engineering and Design, 180, 113197. https://doi.org/10.1016/j.fusengdes.2022.113197

@Waqar-ukaea Waqar-ukaea added the Ray Tracing Changes made to the core ray tracing interface affecting both Embree and GPRT implementations. label Nov 20, 2025
@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 25, 2025

Currently working on an approach to packing rays (origins + directions) on device that doesn't involve an expensive host to device transfer. Right now I'm setting up some GPRT buffers in the ray-benchmark miniapp to essentially act as a "mock" for a downstream application which generates those origins + directions. Getting them to work together is already proving a little difficult so im not entirely sure how it will be done when that downstream application is running with an entirely different gpu runtime.

These are in the methods:
ray_fire_packed() and pack_external_rays()

@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 26, 2025

So I have something which works but it is significantly slower when running in via the pack_external_rays() -> ray_fire_packed() path. The slowdown is happening when resizing our rayhit buffers and rebuilding the SBT. Perhaps the added cost of having the origins[N] + directions[N] alongside the rays[N] and hits[N] buffers is causing the slowdown when it comes to resizing?

@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 27, 2025

Following on from the trouble I was having outlined in the last comment, I have opted for a different route to writing raydata on device which involves exposing the rayhit buffers and allowing the "external" application to write to them directly. This is actually the original approach I wanted to do but I wasn't sure about which types should be exposed to the public facing API. However I have a solution now which seems somewhat reasonable.

I've defined a new public facing struct to essentially wrap the internal struct I was using to manage GPRT ray and hit buffers. The public facing struct exposes device pointers for XDG's dblRay and dblHit types. This public facing buffer can be returned from XDG by calling: xdg->get_device_rayhit_buffers(N):

struct DeviceRayHitBuffers {
  dblRay* rayDevPtr; // device pointer to ray buffers
  dblHit* hitDevPtr; // device pointer to hit buffers
  uint capacity = 0;
}

Once the external application has the device pointers, they can be passed to a compute shader and written directly to.

Right now my testing involves the "external" compute shader being a new GPRT compute shader registered to the same GPRTContext (and underlying VkInstance that is registered to XDG.

Of course a real world downstream application wont have this luxury and will likely not even be using Vulkan so I will need to figure out a way to get these vulkan device pointers into something meaningful that can be passed to another GPU API - Issue #182 for more detail on that.

However, what this does mean is that I can more fairly compare the GPRT ray tracing against embree now. Since the device IO is now making use of the GPU rather than requiring an expensive host to device transfer, I will include it in the next benchmarks I run.

@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 28, 2025

Another important change which I hadn't highlighted earlier is the reduction in memory footprint of the dblRay struct by more effectively making use of push constants - b92bd89. A push constant struct can be passed to the raygen shader which essentially defines a set of constants for that RT pipeline call. So the practical change made was to move parameters which are constant for every ray to the push constants:

struct dblRay
{
  double3 origin;
  double3 direction;
-  double tMin; // Minimum distance for ray intersection
-  double tMax; // Maximum distance for ray intersection
  int32_t* exclude_primitives; // Optional for excluding primitives
  int32_t exclude_count;           // Number of excluded primitives
-  xdg::HitOrientation hitOrientation;
-  int volume_tree; // TreeID of the volume being queried
-  SurfaceAccelerationStructure volume_accel; // The volume accel 
};

struct dblRayFirePushConstants {
+  double tMax;
+  double tMin;
+  SurfaceAccelerationStructure volume_accel; 
+  int volume_tree;
+  xdg::HitOrientation hitOrientation;
};

This change results in a memory saving of about 32B per ray. Which scaled up to 50M rays is a saving of around 1.5GB.

@Waqar-ukaea
Copy link
Collaborator Author

Waqar-ukaea commented Nov 28, 2025

Some new benchmarks after implementing the new approach to generate rays on device and some more consistent timing usage thanks to PR #148!

Update 1: Now includes results for the new GPRT FP32 ray tracer backend.
Update 2: I have written a script that can automatically drive multiple runs of the benchmark problem
Update 3: Now have some performance results from a Sapphire-Rapids node on a HPC system too:
Update 4: Now have results for Nvidia L40 with both FP64 and FP32 + RT cores
Update 5: Added entries to indicate vulkan issues on AMD MI300X and Intel PVC cards
Update 6: Now have results for Nvidia A100

Benchmark parameters

model Volume No. of Elements No. of Rays Location No. of Runs
simple_tokamak 2 280K 50M (180,250,-27) 100

A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue:
ray_benchmark_tokamak_setup

Ray tracer performance (trace-only)

Baseline = Embree (CPU), 2× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 112 threads

Times and throughput averaged over the 100 runs.

Ray Tracer backend Hardware (Threads / Device) Trace Time (s) Throughput (ray/s) Speedup vs 2×8480+ (112-thread) Peak FP32/FP64 (TFLOPS) + RT cores
Embree 13th Gen Intel® Core™ i7-13850HX × 28 threads 0.749128 1.06791e+08 ~0.25× N/A
Embree 1× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 56 threads 0.367881 2.17462e+08 ~0.51× N/A
Embree 2× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 112 threads 0.18914 4.22967e+08 1× (baseline) N/A
GPRT (FP64) NVIDIA RTX 2000 Ada 1.11261 7.19033e+07 ~0.17× (baseline faster) FP32: 12.0
FP64: 0.19
RT cores: 22
GPRT (FP32 + RT cores) NVIDIA RTX 2000 Ada 0.0302956 2.64065e+09 ~6.24× FP32: 12.0
FP64: 0.19
RT cores: 22
GPRT (FP64) NVIDIA L40 0.185930 4.319e+08 ~1.02× FP32: 90.5
FP64: 1.41
RT cores: 142
GPRT (FP32 + RT cores) NVIDIA L40 0.008051 9.954e+09 ~23.5× FP32: 90.5
FP64: 1.41
RT cores: 142
GPRT (FP64) NVIDIA A100 0.183721 4.506e+08 ~1.07× FP32: 19.5
FP64: 9.7
RT cores: N/A
GPRT (FP64) AMD MI300X - No VK ray tracing support N/A N/A N/A N/A
GPRT (FP64) Intel Data Center GPU Max 1100 (Ponte Vecchio) - Vulkan doesn’t recognise PVCs as physical devices N/A N/A N/A N/A

Next Steps - It's probably worth coming up with a more computationally intense benchmark problem. For this simple ray-throughput case, it might just have to be increasing the number of rays - however I am memory bound for the RTX 2000 Ada which only has 8GB, so I'll have to have a think what could be more suitable.

Performance seems to be capping out at ~4e+08 rays/sec no matter how much more theoretical performance the card has. Increasing the number of rays fired up from 80M seems to have no positive impact on this performance metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ray Tracing Changes made to the core ray tracing interface affecting both Embree and GPRT implementations.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Array-based ray query interface

1 participant