Improving dataframe join performance #110

mannharleen · 2020-01-06T04:17:19Z

A couple of things:

add benchmark test for dataframe join
currently only supports nestedLoop join. Implement other algo e.g. HashJoin or may be even Merge?

Looking for inputs here.

JKOK005 · 2021-06-15T14:56:33Z

This seems like a valid problem.

I did a brief benchmark performance by joining 2 dataframes containing 43K rows. The joined columns contain unique values, meaning that there can only be a single match between 1 row in df_A and 1 row in df_B.

The performance of and Inner join for go-gota was: 37.68s.

In contrast, the same logic, when executed using Pandas in Python took barely 1s.

From the looks of the present implementation, go-gota is indeed implementing a nested loop join, which can be inefficient for large datasets.

Can I check if there is a road map to address this issue? If not, would it be possible for me to try and submit a PR with implementations for hash join & merge join features? Believe those will help speed up the performance of joins.

Thanks

chrmang · 2021-06-19T13:13:57Z

Hi,

feel free to open a PR. Thank you for contributing.

Chris

mannharleen changed the title ~~improving dataframe join performance~~ Improving dataframe join performance Jan 6, 2020

chrmang added the enhancement label Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving dataframe join performance #110

Improving dataframe join performance #110

mannharleen commented Jan 6, 2020

JKOK005 commented Jun 15, 2021

chrmang commented Jun 19, 2021

Improving dataframe join performance #110

Improving dataframe join performance #110

Comments

mannharleen commented Jan 6, 2020

JKOK005 commented Jun 15, 2021

chrmang commented Jun 19, 2021