Added collectNumberOrderedElements #45

benraha · 2024-10-09T14:06:05Z

In a lot of our solutions, we select only a fixed number of rows, based on ordering by a column, usually a small amount. Datafu has dedupTopN, which uses a window function, and dedupWithCombiner, which is limited to only taking one record per grouping. dedupTopN is using a window function, which is inefficient because it orders all of the rows per group, and is very susceptible to skew. DedupWithCombiner won't let us take more than one row.

This PR introduces a solution - a class that implements DeclarativeAggregate, to avoid declaring the schemas explicitly and using the combiner to avoid skew and Codegen.

eyala · 2024-12-09T11:33:41Z

Did you specify that DataFu build with Spark 3.3 or 3.4? I think your PR assumes a newer interface of DeclarativeAggregate than what we have currently, and that's why the build is failing in our CI.

I'm planning on pushing code that will upgrade us to these versions, so that will probably make your PR pass tests.

eyala · 2025-01-23T08:16:08Z

Looks like after the last commit our CI passes, so now this works for Spark 3.0.x - 3.4.x. I'll merge it in.

Great job!

Added the first version of collectNumberOrderedElements

17e59dd

benraha changed the title ~~Added the first version of collectNumberOrderedElements~~ Added collectNumberOrderedElements Oct 9, 2024

Rahamim, Ben added 3 commits January 8, 2025 09:31

Added another sort to the final output of the udaf, and a test

e50f32b

Forgot to add the sorting

1b07649

Removing the override modifier from withNewChildrenInternal

a3bf2c4

eyala closed this Jan 23, 2025

eyala reopened this Jan 23, 2025

eyala merged commit a8264f7 into apache:main Jan 23, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added collectNumberOrderedElements #45

Added collectNumberOrderedElements #45

Uh oh!

benraha commented Oct 9, 2024 •

edited

Loading

Uh oh!

eyala commented Dec 9, 2024 •

edited

Loading

Uh oh!

eyala commented Jan 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added collectNumberOrderedElements #45

Added collectNumberOrderedElements #45

Uh oh!

Conversation

benraha commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eyala commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eyala commented Jan 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benraha commented Oct 9, 2024 •

edited

Loading

eyala commented Dec 9, 2024 •

edited

Loading