Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

between_time api for koalas dataframe #1968

Closed
wants to merge 9 commits into from

Conversation

shril
Copy link
Contributor

@shril shril commented Dec 12, 2020

Implement Koalas Missing APIs #1929

@codecov-io
Copy link

codecov-io commented Dec 12, 2020

Codecov Report

Merging #1968 (b81afcc) into master (b65891d) will increase coverage by 0.00%.
The diff coverage is 96.87%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1968   +/-   ##
=======================================
  Coverage   94.60%   94.60%           
=======================================
  Files          49       50    +1     
  Lines       10890    10905   +15     
=======================================
+ Hits        10302    10317   +15     
  Misses        588      588           
Impacted Files Coverage Δ
databricks/koalas/config.py 99.00% <ø> (ø)
databricks/koalas/plot/plotly.py 94.73% <94.73%> (ø)
databricks/koalas/plot/core.py 91.72% <95.23%> (-1.05%) ⬇️
databricks/koalas/frame.py 96.79% <100.00%> (+0.04%) ⬆️
databricks/koalas/series.py 96.92% <100.00%> (+0.01%) ⬆️
...bricks/koalas/tests/plot/test_frame_plot_plotly.py 100.00% <100.00%> (ø)
...ricks/koalas/tests/plot/test_series_plot_plotly.py 96.61% <100.00%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b65891d...8bae109. Read the comment docs.

@shril
Copy link
Contributor Author

shril commented Dec 12, 2020

@ueshin, @HyukjinKwon can you take a look at this PR?

-------
values_between_time : array of integers
"""
return self.index.to_pandas().indexer_between_time(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shril can we avoid calling to_pandas()? It will bring all data from other nodes to the single client node which can easily OOM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon, my idea here was to convert just the index column to pandas since there is API support for Datetime index hence the indexer_between_time() function.

If you want me to, I can also start with Datetime index support in koalas, and then work on this PR.

Copy link
Contributor

@itholic itholic Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution , @shril . 😄

Yeah, I'd say we need DatetimeIndex support in Koalas rather than use pandas' because only convert index column to pandas also can easily OOM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic , starting with a new PR for DatetimeIndex.
@HyukjinKwon can we leave this PR open till I implement the other PR?
Thanks. :)

Copy link

@vmdhhh vmdhhh Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly new to this so I am sorry if I sound too naive, but doesn't is_all_dates function from koalas provide support for Datetimeindex? @itholic @shril

Copy link
Contributor

@itholic itholic Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shril,
Sure, we can just leave this PR as it is and let's revisit after finishing DatetimeIndex ! :D

@vmdhhh,
Thanks for your interest to Koalas !!
DatetimeIndex in Koalas now actually not DatetimeIndex, but just Index.
It's only shown as DatetimeIndex only when calling repr as below.

>>> idx = ks.Index([datetime(2019, 1, 1, 0, 0, 0), datetime(2019, 2, 3, 0, 0, 0)])
>>> idx  # It's repr is `DatetimeIndex` since we internally convert this to pandas and use pandas' repr.
DatetimeIndex(['2019-01-01', '2019-02-03'], dtype='datetime64[ns]', freq=None)

>>> type(idx)  # So, Actually it's instance of `Index`, not `DatetimeIndex`
<class 'databricks.koalas.indexes.Index'>
>>> type(idx.to_pandas())
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

is_all_dates just check if all data included in the Index is Spark TimestampType type.

return isinstance(self.spark.data_type, TimestampType)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @itholic for clearing this.

@xinrong-meng
Copy link
Contributor

Currently between_time is implemented https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.between_time.html. May I close the pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants