-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(hf): use proper source when we create a file entry #555
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #555 +/- ##
=======================================
Coverage 87.87% 87.87%
=======================================
Files 96 96
Lines 9873 9873
Branches 1349 1349
=======================================
Hits 8676 8676
Misses 857 857
Partials 340 340
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall if there was a specific reason for excluding the field here, but at the time it was inferred anyway, so it wasn't needed. I'm not sure what has changed since that caused the failure. As noted here and in the original PR, it would be useful to have a test for this client to guard against these regressions, but otherwise this fix LGTM.
Deploying datachain-documentation with
|
Latest commit: |
725c4d0
|
Status: | ✅ Deploy successful! |
Preview URL: | https://59a8060b.datachain-documentation.pages.dev |
Branch Preview URL: | https://fix-554.datachain-documentation.pages.dev |
@dberenbaum thanks! @lhoestq added an example that I'll then wrap as a test - TAL. It is analyzing your datasets using OpenAI, returns pydantic, that is then can be saved back to HF (CVS is not ideal - it's losing structure, I'll try parquet tomorrow since it can preserve Pydantic nested structure that is "native" for datachain). Though, CVS can be also read (with flat normalized fields) and manipulated further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix ! The examples is great as well :)
Let me know once the fix is released ! You should share the demo online as well, I'd be happy to re-share/amplify
btw I added some comments in the example but feel free to ignore
8c34b87
to
74486e0
Compare
@lhoestq simpler and HF end to end now. Any other ideas how we can make it more interesting / powerful? :) (I still need to wrap this example as a test - it will take a bit of time - thus I keep it as a draft) |
@@ -0,0 +1,59 @@ | |||
from huggingface_hub import InferenceClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein the test for this example looks flaky: https://github.com/iterative/datachain/actions/runs/11655280717/job/32449776121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'm looking into this ... probably bc we run it in parallel and that's a bit too much for a real e2e test ... I'll if we can run it on a single machine and python
Great ! I can share the example on social media tomorrow if it sounds good to you :) What about notebook demos for llm-based dataset operations like generation / filtering / cleaning ? Let me know if I can help with some ideas or existing datasets / use cases for those |
Yes, absolutely. That would be amazing! Let me know when you do this - we'll promote it through our channels as well.
yes. I was looking into Datasets a bit more to better understand how tools can complement each other. Thoughts so far:
any examples from the top of your head btw? |
Yes indeed,
It can help for sure. Most datasets contain the bytes themselves, but some of them have separate metadata and media files (audio/image/video), both on HF. There is also the rare case of storing images URLs.
Some examples: LLM for text extraction from scrapped web pages, LLM as a judge to filter text based on quality, dataset generation using a LLM and personas for diversity |
Just shared the example on X and linkedin @shcheklein ! |
thank you, @lhoestq, for sharing and for your valuable feedback! |
* fix(hf): use proper source when we create a file entry * add more details to the unsupported PyArrow type message * add example: HF -> OpenAI -> HF -> analyze * use HF inference endpoint Co-authored-by: Quentin Lhoest <[email protected]> * use to_parquet / from_parquet to preserve schema * add a bit of comments, fix them * use HF_TOKEN to run e2e HF example --------- Co-authored-by: Quentin Lhoest <[email protected]>
Fixes #554
Need to understand the reason why we didn't have
source
set properly in the first place. @dberenbaum might know better (please share if you remember anything about this).TODO: