Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving dataframe join performance #110

Open
mannharleen opened this issue Jan 6, 2020 · 2 comments
Open

Improving dataframe join performance #110

mannharleen opened this issue Jan 6, 2020 · 2 comments

Comments

@mannharleen
Copy link

A couple of things:

  • add benchmark test for dataframe join
  • currently only supports nestedLoop join. Implement other algo e.g. HashJoin or may be even Merge?

Looking for inputs here.

@mannharleen mannharleen changed the title improving dataframe join performance Improving dataframe join performance Jan 6, 2020
@JKOK005
Copy link

JKOK005 commented Jun 15, 2021

This seems like a valid problem.

I did a brief benchmark performance by joining 2 dataframes containing 43K rows. The joined columns contain unique values, meaning that there can only be a single match between 1 row in df_A and 1 row in df_B.

The performance of and Inner join for go-gota was: 37.68s.

In contrast, the same logic, when executed using Pandas in Python took barely 1s.

From the looks of the present implementation, go-gota is indeed implementing a nested loop join, which can be inefficient for large datasets.

Can I check if there is a road map to address this issue? If not, would it be possible for me to try and submit a PR with implementations for hash join & merge join features? Believe those will help speed up the performance of joins.

Thanks

@chrmang
Copy link
Contributor

chrmang commented Jun 19, 2021

Hi,

feel free to open a PR. Thank you for contributing.

Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants