-
Notifications
You must be signed in to change notification settings - Fork 7
Adding a ray batch query API #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Why 40 million rays? Because that seems to be about the memory limit I can transfer in one go using vulkan before I run into the following vulkan error:
So I'm going to have to revisit how we are currently managing ray buffers to better handle this because 40 million rays is only using about 50% of the 8GB of VRAM I have available on my local machine. |
|
@pshriwise When running openmc in event based mode, does the number of particles in flight ever change? i.e is there a need to change the number of rays to be launched in a single batch on the fly? Right now both void GPRTRayTracer::check_ray_buffer_capacity(const size_t N)
{
if (N <= rayHitBuffers_.capacity) return; // current capacity is sufficient
// Resize buffers to accommodate N rays
size_t newCapacity = std::max(N, rayHitBuffers_.capacity * 2); // double the capacity or set to N, whichever is larger
gprtBufferResize(context_, rayHitBuffers_.ray, newCapacity, false);
gprtBufferResize(context_, rayHitBuffers_.hit, newCapacity, false);
rayHitBuffers_.capacity = newCapacity;
// Since we have resized the ray buffers, we need to update the geom_data->rayIn pointers in all geometries too
for (auto const& [surf, geom] : surface_to_geometry_map_) {
DPTriangleGeomData* geom_data = gprtGeomGetParameters(geom);
geom_data->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray);
}
// Update raygen data pointers
for (auto const& [type, rayGen] : rayGenPrograms_) {
dblRayGenData* rayGenData = gprtRayGenGetParameters(rayGen);
rayGenData->ray = gprtBufferGetDevicePointer(rayHitBuffers_.ray);
rayGenData->hit = gprtBufferGetDevicePointer(rayHitBuffers_.hit);
}
gprtBuildShaderBindingTable(context_, static_cast<GPRTBuildSBTFlags>(GPRT_SBT_GEOM | GPRT_SBT_RAYGEN));
}The method resizes our ray buffers if either |
|
I've re-run my ray throughput benchmark miniapp with a larger model and I've also moved the timing region to be around the actual raygen launch so it is purely timing the ray tracing computation, rather than inadvertently also including the time taken to transfer ray buffers to device and hit buffers back to the host. I am running this on my local machine which contains an Nvidia RTX ada 2000 mobile (used by GPRT) and an 13th Gen Intel® Core™ i7-13850HX (used by Embree). Specs of the RTX ada 2000 and the i7-13850HX: A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue: Benchmark parameters
Ray tracer performanceNote - GPRT results do not take into account device buffer IO. So comparison to Embree is a little disingenuous.
So, a fairly significant speedup already seen in pure ray tracing throughput in moving from Embree to GPRT whilst maintaining the full mixed precision algorithm. However, if there is a possibility that we can move to full single precision ray tracing we can see a speedup of nearly 425x. And this isn't even a particularly powerful graphics-oriented chip and is a few years old now. An RTX 5090 (launched early January 2025) has around 10x the FP32 FLOPS as well as 7x the number of RT cores. EDIT - With minimal validation layers enabled (for printf) the backend with GPRT (FP32) + RT cores ends up with a throughput of ~1.36496e+09. So a slight drop off but not as much of a difference as I previously thought there might be. References[1] Valentine, A., Berry, T., Bradnam, S., Hagues, J., & Hodson, J. (2022). Benchmarking of emergent radiation transport codes for fusion neutronics applications. Fusion Engineering and Design, 180, 113197. https://doi.org/10.1016/j.fusengdes.2022.113197 |
049ea22 to
f297d70
Compare
|
Currently working on an approach to packing rays (origins + directions) on device that doesn't involve an expensive host to device transfer. Right now I'm setting up some GPRT buffers in the ray-benchmark miniapp to essentially act as a "mock" for a downstream application which generates those origins + directions. Getting them to work together is already proving a little difficult so im not entirely sure how it will be done when that downstream application is running with an entirely different gpu runtime. These are in the methods: |
|
So I have something which works but it is significantly slower when running in via the |
|
Following on from the trouble I was having outlined in the last comment, I have opted for a different route to writing raydata on device which involves exposing the rayhit buffers and allowing the "external" application to write to them directly. This is actually the original approach I wanted to do but I wasn't sure about which types should be exposed to the public facing API. However I have a solution now which seems somewhat reasonable. I've defined a new public facing struct to essentially wrap the internal struct I was using to manage GPRT ray and hit buffers. The public facing struct exposes device pointers for XDG's struct DeviceRayHitBuffers {
dblRay* rayDevPtr; // device pointer to ray buffers
dblHit* hitDevPtr; // device pointer to hit buffers
uint capacity = 0;
}Once the external application has the device pointers, they can be passed to a compute shader and written directly to. Right now my testing involves the "external" compute shader being a new GPRT compute shader registered to the same Of course a real world downstream application wont have this luxury and will likely not even be using Vulkan so I will need to figure out a way to get these vulkan device pointers into something meaningful that can be passed to another GPU API - Issue #182 for more detail on that. However, what this does mean is that I can more fairly compare the GPRT ray tracing against embree now. Since the device IO is now making use of the GPU rather than requiring an expensive host to device transfer, I will include it in the next benchmarks I run. |
|
Another important change which I hadn't highlighted earlier is the reduction in memory footprint of the struct dblRay
{
double3 origin;
double3 direction;
- double tMin; // Minimum distance for ray intersection
- double tMax; // Maximum distance for ray intersection
int32_t* exclude_primitives; // Optional for excluding primitives
int32_t exclude_count; // Number of excluded primitives
- xdg::HitOrientation hitOrientation;
- int volume_tree; // TreeID of the volume being queried
- SurfaceAccelerationStructure volume_accel; // The volume accel
};
struct dblRayFirePushConstants {
+ double tMax;
+ double tMin;
+ SurfaceAccelerationStructure volume_accel;
+ int volume_tree;
+ xdg::HitOrientation hitOrientation;
};This change results in a memory saving of about |
…constants across all rays
…irectly via an "external" compute shader
269b95e to
7248773
Compare
|
Some new benchmarks after implementing the new approach to generate rays on device and some more consistent timing usage thanks to PR #148! Update 1: Now includes results for the new GPRT FP32 ray tracer backend. Benchmark parameters
A render of the simple_tokamak model [1] used in these preliminary benchmarks along with a depiction (significantly smaller number of rays plotted) of the rays launched is shown in the image below, along with the volume queried against highlghted in blue: Ray tracer performance (trace-only)Baseline = Embree (CPU), 2× Intel® Xeon® Platinum 8480+ (Sapphire Rapids) × 112 threads Times and throughput averaged over the 100 runs.
Next Steps - It's probably worth coming up with a more computationally intense benchmark problem. For this simple ray-throughput case, it might just have to be increasing the number of rays - however I am memory bound for the RTX 2000 Ada which only has 8GB, so I'll have to have a think what could be more suitable. Performance seems to be capping out at ~4e+08 rays/sec no matter how much more theoretical performance the card has. Increasing the number of rays fired up from 80M seems to have no positive impact on this performance metric. |
…ith GPRT disabled



This PR adds a set of methods to the XDG API that allow for launching batches of rays in one call. The main intention here being to plug into GPRT to perform GPU ray tracing at scale across the RT pipeline.
At this stage, I have a working set of overloads for
GPRTRayTracer::point_in_volume()andGPRTRayTracer::ray_fire()which can be used to call rays over large batches.The two unit tests
test_point_in_volumeandtest_ray_firehave been extended to also test the batch variants of the methods but have been gated to only do so when GPRT is enabled (until I get a working implementation for embree too).Three new miniapps have been added to
xdg/tools. These are:batch-ray-fire- Provide a set of origins and directions and fire rays using the batch version ofxdg::ray_fire()batch-point-in-volume- Provide a set of points and (optionally) directions to perform point in volume checks using the batch version ofxdg::point_in_volume()ray-benchmark- Provide a.h5mmodel to perform a ray tracing throughput benchmark against that model (sample directions from unit sphere around origin provided)The most interesting of the new miniapps is by far
ray-benchmarkas it allows for a direct performance comparison between our embree and GPRT ray tracing implementations. On thejezebel.h5mmodel with 40 million rays launched from the origin and intersecting the volume we see the following metrics.Embree:

GPRT:

Which is a promising sign that we already have some performance increase. Hopefully coming PRs will increase this gap further.