Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hf): use proper source when we create a file entry #555

Merged
merged 7 commits into from
Nov 1, 2024
Merged

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Oct 30, 2024

Fixes #554

Need to understand the reason why we didn't have source set properly in the first place. @dberenbaum might know better (please share if you remember anything about this).

TODO:

  • Add / improve / review tests

@shcheklein shcheklein self-assigned this Oct 30, 2024
@shcheklein shcheklein added the bug Something isn't working label Oct 30, 2024
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.87%. Comparing base (a516c94) to head (725c4d0).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #555   +/-   ##
=======================================
  Coverage   87.87%   87.87%           
=======================================
  Files          96       96           
  Lines        9873     9873           
  Branches     1349     1349           
=======================================
  Hits         8676     8676           
  Misses        857      857           
  Partials      340      340           
Flag Coverage Δ
datachain 87.81% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall if there was a specific reason for excluding the field here, but at the time it was inferred anyway, so it wasn't needed. I'm not sure what has changed since that caused the failure. As noted here and in the original PR, it would be useful to have a test for this client to guard against these regressions, but otherwise this fix LGTM.

Copy link

cloudflare-workers-and-pages bot commented Oct 31, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 725c4d0
Status: ✅  Deploy successful!
Preview URL: https://59a8060b.datachain-documentation.pages.dev
Branch Preview URL: https://fix-554.datachain-documentation.pages.dev

View logs

@shcheklein
Copy link
Member Author

@dberenbaum thanks!

@lhoestq added an example that I'll then wrap as a test - TAL. It is analyzing your datasets using OpenAI, returns pydantic, that is then can be saved back to HF (CVS is not ideal - it's losing structure, I'll try parquet tomorrow since it can preserve Pydantic nested structure that is "native" for datachain). Though, CVS can be also read (with flat normalized fields) and manipulated further.

Copy link
Contributor

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix ! The examples is great as well :)
Let me know once the fix is released ! You should share the demo online as well, I'd be happy to re-share/amplify

btw I added some comments in the example but feel free to ignore

@shcheklein shcheklein force-pushed the fix-554 branch 2 times, most recently from 8c34b87 to 74486e0 Compare October 31, 2024 23:47
@shcheklein
Copy link
Member Author

@lhoestq simpler and HF end to end now. Any other ideas how we can make it more interesting / powerful? :)

(I still need to wrap this example as a test - it will take a bit of time - thus I keep it as a draft)

@shcheklein shcheklein removed the request for review from lhoestq November 1, 2024 17:38
@shcheklein shcheklein marked this pull request as ready for review November 1, 2024 17:38
@shcheklein shcheklein merged commit 7ba4477 into main Nov 1, 2024
38 checks passed
@shcheklein shcheklein deleted the fix-554 branch November 1, 2024 17:38
@@ -0,0 +1,59 @@
from huggingface_hub import InferenceClient
Copy link
Contributor

@mattseddon mattseddon Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'm looking into this ... probably bc we run it in parallel and that's a bit too much for a real e2e test ... I'll if we can run it on a single machine and python

@lhoestq
Copy link
Contributor

lhoestq commented Nov 4, 2024

Great ! I can share the example on social media tomorrow if it sounds good to you :)

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning ? Let me know if I can help with some ideas or existing datasets / use cases for those

@shcheklein
Copy link
Member Author

Great ! I can share the example on social media tomorrow if it sounds good to you

Yes, absolutely. That would be amazing! Let me know when you do this - we'll promote it through our channels as well.

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning ? Let me know if I can help with some ideas or existing datasets / use cases for those

yes. I was looking into Datasets a bit more to better understand how tools can complement each other.

Thoughts so far:

  • HF datasets (and especially Hub) serves as an excellent collaboration layer for DataChain! Publishing and reading back (while perfectly preserving Pydantic structure, etc) DC datasets as parquet files to / from HF gives versioning, nice visualization, ability for someone else to read it, etc.
  • AFAIU datasets is Arrow based and doesn't have it sown engine (I mean SQL in this case) vs DataChain (open source). It means it doesn't have out of core merge, group by? Can we find some interesting scenarios where we use DataChain to merge a few datasets for example?
  • Another differentiator AFAIU - DC doesn't embed binary objects, it stores paths to objects in the cloud and can stream them to UDF when it is executed. I wonder if this can help with some use case? Have you see some datasets that have metadata on HF but objects outside?

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning

any examples from the top of your head btw?

@lhoestq
Copy link
Contributor

lhoestq commented Nov 5, 2024

It means datasets doesn't have out of core merge, group by? Can we find some interesting scenarios where we use DataChain to merge a few datasets for example?

Yes indeed, datasets supports lager-than-memory datasets via memory mapping by default, and there is also a streaming option. So far the only option for group_by/merge has been to do ds.to_polars().

DC stores paths to objects in the cloud and can stream them to UDF when it is executed. I wonder if this can help with some use case? Have you see some datasets that have metadata on HF but objects outside?

It can help for sure. Most datasets contain the bytes themselves, but some of them have separate metadata and media files (audio/image/video), both on HF. There is also the rare case of storing images URLs.

any examples from the top of your head btw?

Some examples: LLM for text extraction from scrapped web pages, LLM as a judge to filter text based on quality, dataset generation using a LLM and personas for diversity

@lhoestq
Copy link
Contributor

lhoestq commented Nov 5, 2024

Just shared the example on X and linkedin @shcheklein !

@dmpetrov
Copy link
Member

dmpetrov commented Nov 5, 2024

thank you, @lhoestq, for sharing and for your valuable feedback!

skshetry pushed a commit that referenced this pull request Nov 14, 2024
* fix(hf): use proper source when we create a file entry

* add more details to the unsupported PyArrow type message

* add example: HF -> OpenAI -> HF -> analyze

* use HF inference endpoint

Co-authored-by: Quentin Lhoest <[email protected]>

* use to_parquet / from_parquet to preserve schema

* add a bit of comments, fix them

* use HF_TOKEN to run e2e HF example

---------

Co-authored-by: Quentin Lhoest <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileNotFoundError when reading from Hugging Face
6 participants