perf(vello_common): use SIMD dispatch in flattening codegen #1336

tomcur · 2025-12-31T10:33:36Z

This is a ~10% reduction in flattening time on x86, I haven't measured AArch64.

Flattening was already dispatched to have access to the SIMD witness, but it did not yet unambiguously make use of target features for codegen as the functions weren't forced to be inlined.

flatten/Ghostscript_Tiger
                        time:   [208.76 µs 209.06 µs 209.42 µs]
                        change: [-12.177% -11.979% -11.768%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
flatten/paris-30k       time:   [13.157 ms 13.202 ms 13.253 ms]
                        change: [-10.728% -10.307% -9.8772%] (p = 0.00 < 0.05)
                        Performance has improved.

This is a ~10% reduction in flattening time. Flattening was already dispatched to have access to the SIMD witness, but it did not yet unambiguously make use of target features for codegen as the functions weren't forced to be inlined. ``` flatten/Ghostscript_Tiger time: [208.76 µs 209.06 µs 209.42 µs] change: [-12.177% -11.979% -11.768%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe flatten/paris-30k time: [13.157 ms 13.202 ms 13.253 ms] change: [-10.728% -10.307% -9.8772%] (p = 0.00 < 0.05) Performance has improved. ```

tomcur · 2025-12-31T10:35:28Z

sparse_strips/vello_common/src/flatten_simd.rs

-                        let max = simd.vectorize(
-                            #[inline(always)]
-                            || {
-                                flatten_cubic_simd(
-                                    simd,
-                                    c,
-                                    flatten_ctx,
-                                    tolerance as f32,
-                                    &mut flattened_cubics,
-                                )
-                            },
+                        let max = flatten_cubic_simd(
+                            simd,
+                            c,
+                            flatten_ctx,
+                            tolerance as f32,
+                            &mut flattened_cubics,


This vectorize is no longer necessary, as the flatten_cubic_simd call gets inlined into flatten which itself gets vectorized.

LaurenzV

No change on ARM in my benchmarks.

Philipp-M · 2025-12-31T11:11:25Z

sparse_strips/vello_common/src/flatten.rs

+    let iter = path.into_iter().map(
+        #[inline(always)]
+        |el| affine * el,
+    );


I'm curious whether this has any effect? I would expect this to be inlined basically always given the small closure?

I'm curious whether this has any effect?

This one more than likely has no effect as this indeed very likely gets inlined without the attribute, too, but this makes it as unambiguous as Rust allows. This follows the suggestion in https://docs.rs/fearless_simd/0.3.0/fearless_simd/#inlining.

tomcur · 2025-12-31T11:15:40Z

No change on ARM in my benchmarks.

Thanks for checking, the compiler made better inlining decisions on ARM then!

DJMcNab · 2025-12-31T11:49:43Z

Thanks for checking, the compiler made better inlining decisions on ARM then!

Unfortunately, it's even stupider than that :)

Essentially, all of the relevant aarch64 targets already unconditionally enable the neon feature, so the inlining actually isn't at all necessary on those targets...

tomcur commented Dec 31, 2025

View reviewed changes

LaurenzV approved these changes Dec 31, 2025

View reviewed changes

Philipp-M reviewed Dec 31, 2025

View reviewed changes

tomcur added this pull request to the merge queue Dec 31, 2025

Merged via the queue into linebender:main with commit a108895 Dec 31, 2025
17 checks passed

tomcur deleted the codegen-flatten branch December 31, 2025 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(vello_common): use SIMD dispatch in flattening codegen #1336

perf(vello_common): use SIMD dispatch in flattening codegen #1336

Uh oh!

tomcur commented Dec 31, 2025 •

edited

Loading

Uh oh!

tomcur Dec 31, 2025

Uh oh!

LaurenzV left a comment

Uh oh!

Philipp-M Dec 31, 2025

Uh oh!

tomcur Dec 31, 2025

Uh oh!

tomcur commented Dec 31, 2025

Uh oh!

Uh oh!

DJMcNab commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(vello_common): use SIMD dispatch in flattening codegen #1336

perf(vello_common): use SIMD dispatch in flattening codegen #1336

Uh oh!

Conversation

tomcur commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomcur Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Philipp-M Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

tomcur Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

tomcur commented Dec 31, 2025

Uh oh!

Uh oh!

DJMcNab commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomcur commented Dec 31, 2025 •

edited

Loading