Skip to content

Conversation

@benraha
Copy link
Contributor

@benraha benraha commented Oct 9, 2024

In a lot of our solutions, we select only a fixed number of rows, based on ordering by a column, usually a small amount. Datafu has dedupTopN, which uses a window function, and dedupWithCombiner, which is limited to only taking one record per grouping. dedupTopN is using a window function, which is inefficient because it orders all of the rows per group, and is very susceptible to skew. DedupWithCombiner won't let us take more than one row.

This PR introduces a solution - a class that implements DeclarativeAggregate, to avoid declaring the schemas explicitly and using the combiner to avoid skew and Codegen.

@benraha benraha changed the title Added the first version of collectNumberOrderedElements Added collectNumberOrderedElements Oct 9, 2024
@eyala
Copy link
Contributor

eyala commented Dec 9, 2024

Did you specify that DataFu build with Spark 3.3 or 3.4? I think your PR assumes a newer interface of DeclarativeAggregate than what we have currently, and that's why the build is failing in our CI.

I'm planning on pushing code that will upgrade us to these versions, so that will probably make your PR pass tests.

@eyala
Copy link
Contributor

eyala commented Jan 23, 2025

Looks like after the last commit our CI passes, so now this works for Spark 3.0.x - 3.4.x. I'll merge it in.

Great job!

@eyala eyala closed this Jan 23, 2025
@eyala eyala reopened this Jan 23, 2025
@eyala eyala merged commit a8264f7 into apache:main Jan 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants