Skip to content

Conversation

@tomcur
Copy link
Member

@tomcur tomcur commented Dec 31, 2025

This is a ~10% reduction in flattening time on x86, I haven't measured AArch64.

Flattening was already dispatched to have access to the SIMD witness, but it did not yet unambiguously make use of target features for codegen as the functions weren't forced to be inlined.

flatten/Ghostscript_Tiger
                        time:   [208.76 µs 209.06 µs 209.42 µs]
                        change: [-12.177% -11.979% -11.768%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
flatten/paris-30k       time:   [13.157 ms 13.202 ms 13.253 ms]
                        change: [-10.728% -10.307% -9.8772%] (p = 0.00 < 0.05)
                        Performance has improved.

This is a ~10% reduction in flattening time.

Flattening was already dispatched to have access to the SIMD witness,
but it did not yet unambiguously make use of target features for codegen
as the functions weren't forced to be inlined.

```
flatten/Ghostscript_Tiger
                        time:   [208.76 µs 209.06 µs 209.42 µs]
                        change: [-12.177% -11.979% -11.768%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
flatten/paris-30k       time:   [13.157 ms 13.202 ms 13.253 ms]
                        change: [-10.728% -10.307% -9.8772%] (p = 0.00 < 0.05)
                        Performance has improved.
```
Comment on lines -114 to +120
let max = simd.vectorize(
#[inline(always)]
|| {
flatten_cubic_simd(
simd,
c,
flatten_ctx,
tolerance as f32,
&mut flattened_cubics,
)
},
let max = flatten_cubic_simd(
simd,
c,
flatten_ctx,
tolerance as f32,
&mut flattened_cubics,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This vectorize is no longer necessary, as the flatten_cubic_simd call gets inlined into flatten which itself gets vectorized.

Copy link
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change on ARM in my benchmarks.

Comment on lines +98 to +101
let iter = path.into_iter().map(
#[inline(always)]
|el| affine * el,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious whether this has any effect? I would expect this to be inlined basically always given the small closure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious whether this has any effect?

This one more than likely has no effect as this indeed very likely gets inlined without the attribute, too, but this makes it as unambiguous as Rust allows. This follows the suggestion in https://docs.rs/fearless_simd/0.3.0/fearless_simd/#inlining.

@tomcur
Copy link
Member Author

tomcur commented Dec 31, 2025

No change on ARM in my benchmarks.

Thanks for checking, the compiler made better inlining decisions on ARM then!

@tomcur tomcur added this pull request to the merge queue Dec 31, 2025
Merged via the queue into linebender:main with commit a108895 Dec 31, 2025
17 checks passed
@tomcur tomcur deleted the codegen-flatten branch December 31, 2025 11:24
@DJMcNab
Copy link
Member

DJMcNab commented Dec 31, 2025

Thanks for checking, the compiler made better inlining decisions on ARM then!

Unfortunately, it's even stupider than that :)

Essentially, all of the relevant aarch64 targets already unconditionally enable the neon feature, so the inlining actually isn't at all necessary on those targets...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants