fix(hf): use proper source when we create a file entry #555

shcheklein · 2024-10-30T18:49:52Z

Fixes #554

Need to understand the reason why we didn't have source set properly in the first place. @dberenbaum might know better (please share if you remember anything about this).

TODO:

Add / improve / review tests

codecov · 2024-10-30T18:58:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.87%. Comparing base (a516c94) to head (725c4d0).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #555   +/-   ##
=======================================
  Coverage   87.87%   87.87%           
=======================================
  Files          96       96           
  Lines        9873     9873           
  Branches     1349     1349           
=======================================
  Hits         8676     8676           
  Misses        857      857           
  Partials      340      340

Flag	Coverage Δ
datachain	`87.81% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dberenbaum

I can't recall if there was a specific reason for excluding the field here, but at the time it was inferred anyway, so it wasn't needed. I'm not sure what has changed since that caused the failure. As noted here and in the original PR, it would be useful to have a test for this client to guard against these regressions, but otherwise this fix LGTM.

cloudflare-workers-and-pages · 2024-10-31T03:49:20Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`725c4d0`
Status:	✅ Deploy successful!
Preview URL:	https://59a8060b.datachain-documentation.pages.dev
Branch Preview URL:	https://fix-554.datachain-documentation.pages.dev

View logs

shcheklein · 2024-10-31T03:52:17Z

@dberenbaum thanks!

@lhoestq added an example that I'll then wrap as a test - TAL. It is analyzing your datasets using OpenAI, returns pydantic, that is then can be saved back to HF (CVS is not ideal - it's losing structure, I'll try parquet tomorrow since it can preserve Pydantic nested structure that is "native" for datachain). Though, CVS can be also read (with flat normalized fields) and manipulated further.

lhoestq

Thanks for the quick fix ! The examples is great as well :)
Let me know once the fix is released ! You should share the demo online as well, I'd be happy to re-share/amplify

btw I added some comments in the example but feel free to ignore

examples/llm_and_nlp/hf-dataset-llm-eval.py

Co-authored-by: Quentin Lhoest <[email protected]>

shcheklein · 2024-10-31T23:53:31Z

@lhoestq simpler and HF end to end now. Any other ideas how we can make it more interesting / powerful? :)

(I still need to wrap this example as a test - it will take a bit of time - thus I keep it as a draft)

.github/workflows/tests.yml

mattseddon · 2024-11-03T23:11:46Z

examples/llm_and_nlp/hf-dataset-llm-eval.py

@@ -0,0 +1,59 @@
+from huggingface_hub import InferenceClient


@shcheklein the test for this example looks flaky: https://github.com/iterative/datachain/actions/runs/11655280717/job/32449776121

yep, I'm looking into this ... probably bc we run it in parallel and that's a bit too much for a real e2e test ... I'll if we can run it on a single machine and python

lhoestq · 2024-11-04T18:45:47Z

Great ! I can share the example on social media tomorrow if it sounds good to you :)

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning ? Let me know if I can help with some ideas or existing datasets / use cases for those

shcheklein · 2024-11-05T01:45:40Z

Great ! I can share the example on social media tomorrow if it sounds good to you

Yes, absolutely. That would be amazing! Let me know when you do this - we'll promote it through our channels as well.

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning ? Let me know if I can help with some ideas or existing datasets / use cases for those

yes. I was looking into Datasets a bit more to better understand how tools can complement each other.

Thoughts so far:

HF datasets (and especially Hub) serves as an excellent collaboration layer for DataChain! Publishing and reading back (while perfectly preserving Pydantic structure, etc) DC datasets as parquet files to / from HF gives versioning, nice visualization, ability for someone else to read it, etc.
AFAIU datasets is Arrow based and doesn't have it sown engine (I mean SQL in this case) vs DataChain (open source). It means it doesn't have out of core merge, group by? Can we find some interesting scenarios where we use DataChain to merge a few datasets for example?
Another differentiator AFAIU - DC doesn't embed binary objects, it stores paths to objects in the cloud and can stream them to UDF when it is executed. I wonder if this can help with some use case? Have you see some datasets that have metadata on HF but objects outside?

What about notebook demos for llm-based dataset operations like generation / filtering / cleaning

any examples from the top of your head btw?

lhoestq · 2024-11-05T14:13:50Z

It means datasets doesn't have out of core merge, group by? Can we find some interesting scenarios where we use DataChain to merge a few datasets for example?

Yes indeed, datasets supports lager-than-memory datasets via memory mapping by default, and there is also a streaming option. So far the only option for group_by/merge has been to do ds.to_polars().

DC stores paths to objects in the cloud and can stream them to UDF when it is executed. I wonder if this can help with some use case? Have you see some datasets that have metadata on HF but objects outside?

It can help for sure. Most datasets contain the bytes themselves, but some of them have separate metadata and media files (audio/image/video), both on HF. There is also the rare case of storing images URLs.

any examples from the top of your head btw?

Some examples: LLM for text extraction from scrapped web pages, LLM as a judge to filter text based on quality, dataset generation using a LLM and personas for diversity

lhoestq · 2024-11-05T16:00:35Z

Just shared the example on X and linkedin @shcheklein !

dmpetrov · 2024-11-05T18:15:25Z

thank you, @lhoestq, for sharing and for your valuable feedback!

* fix(hf): use proper source when we create a file entry * add more details to the unsupported PyArrow type message * add example: HF -> OpenAI -> HF -> analyze * use HF inference endpoint Co-authored-by: Quentin Lhoest <[email protected]> * use to_parquet / from_parquet to preserve schema * add a bit of comments, fix them * use HF_TOKEN to run e2e HF example --------- Co-authored-by: Quentin Lhoest <[email protected]>

shcheklein self-assigned this Oct 30, 2024

shcheklein added the bug Something isn't working label Oct 30, 2024

shcheklein mentioned this pull request Oct 30, 2024

FileNotFoundError when reading from Hugging Face #554

Closed

dberenbaum approved these changes Oct 31, 2024

View reviewed changes

lhoestq reviewed Oct 31, 2024

View reviewed changes

examples/llm_and_nlp/hf-dataset-llm-eval.py Outdated Show resolved Hide resolved

examples/llm_and_nlp/hf-dataset-llm-eval.py Outdated Show resolved Hide resolved

examples/llm_and_nlp/hf-dataset-llm-eval.py Outdated Show resolved Hide resolved

shcheklein force-pushed the fix-554 branch 2 times, most recently from 8c34b87 to 74486e0 Compare October 31, 2024 23:47

shcheklein and others added 4 commits October 31, 2024 16:47

fix(hf): use proper source when we create a file entry

bf755b8

add more details to the unsupported PyArrow type message

c8d0454

add example: HF -> OpenAI -> HF -> analyze

1130680

use HF inference endpoint

7eea624

Co-authored-by: Quentin Lhoest <[email protected]>

shcheklein force-pushed the fix-554 branch from dcd3842 to 7eea624 Compare October 31, 2024 23:47

use to_parquet / from_parquet to preserve schema

f2910eb

shcheklein force-pushed the fix-554 branch from 4f6dd39 to f2910eb Compare October 31, 2024 23:49

shcheklein requested a review from lhoestq October 31, 2024 23:52

shcheklein added 2 commits October 31, 2024 17:49

add a bit of comments, fix them

e7bba97

use HF_TOKEN to run e2e HF example

725c4d0

shcheklein commented Nov 1, 2024

View reviewed changes

.github/workflows/tests.yml Show resolved Hide resolved

shcheklein removed the request for review from lhoestq November 1, 2024 17:38

shcheklein marked this pull request as ready for review November 1, 2024 17:38

shcheklein merged commit 7ba4477 into main Nov 1, 2024
38 checks passed

shcheklein deleted the fix-554 branch November 1, 2024 17:38

mattseddon reviewed Nov 3, 2024

View reviewed changes

shcheklein mentioned this pull request Nov 4, 2024

fix(tests): hf test can be flakey if runs in parallel #559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hf): use proper source when we create a file entry #555

fix(hf): use proper source when we create a file entry #555

shcheklein commented Oct 30, 2024 •

edited

Loading

codecov bot commented Oct 30, 2024 •

edited

Loading

dberenbaum left a comment

cloudflare-workers-and-pages bot commented Oct 31, 2024 •

edited

Loading

shcheklein commented Oct 31, 2024

lhoestq left a comment

shcheklein commented Oct 31, 2024

mattseddon Nov 3, 2024 •

edited

Loading

shcheklein Nov 3, 2024

lhoestq commented Nov 4, 2024

shcheklein commented Nov 5, 2024

lhoestq commented Nov 5, 2024 •

edited

Loading

lhoestq commented Nov 5, 2024

dmpetrov commented Nov 5, 2024

fix(hf): use proper source when we create a file entry #555

fix(hf): use proper source when we create a file entry #555

Conversation

shcheklein commented Oct 30, 2024 • edited Loading

codecov bot commented Oct 30, 2024 • edited Loading

Codecov Report

dberenbaum left a comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Oct 31, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

shcheklein commented Oct 31, 2024

lhoestq left a comment

Choose a reason for hiding this comment

shcheklein commented Oct 31, 2024

mattseddon Nov 3, 2024 • edited Loading

Choose a reason for hiding this comment

shcheklein Nov 3, 2024

Choose a reason for hiding this comment

lhoestq commented Nov 4, 2024

shcheklein commented Nov 5, 2024

lhoestq commented Nov 5, 2024 • edited Loading

lhoestq commented Nov 5, 2024

dmpetrov commented Nov 5, 2024

shcheklein commented Oct 30, 2024 •

edited

Loading

codecov bot commented Oct 30, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 31, 2024 •

edited

Loading

mattseddon Nov 3, 2024 •

edited

Loading

lhoestq commented Nov 5, 2024 •

edited

Loading