-
Notifications
You must be signed in to change notification settings - Fork 59
Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
except (OSError, TypeError): | ||
# If source is not available | ||
return [f"<function {d.__name__} (source unavailable)>"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move, | ||
Set, | ||
) | ||
from unitxt.splitters import RenameSplits | ||
from unitxt.struct_data_operators import LoadJson | ||
from unitxt.test_utils.card import test_card | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this very dataset is gated, I suggest to use an example that is accessible to all:
Dataset 'Salesforce/xlam-function-calling-60k' is a gated dataset on the Hub. You must be authenticated to access it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dafnapension it is very easy to access it with huggingface-login, it almost became a standard this days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I easily generated a token for myself and read the dataset, and bumped into a schema validation error, that I suggest to look into. (This is independently of the logical bug I pointed at below; I actually intended to prove that error along the dataset, but bumped into this error before being able to read along the whole dataset):
to_field="required", | ||
expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]", | ||
), | ||
extract_required_parameters, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preprocessing is wrong (was so also before this PR):
Here is what happens in case of two tools, as is with the single example I found in https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k
.
Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True),
applies deep copy only once -- before invoking dict_utils
. The latter processes the *
component of the query, by definition, employing a single value to assign to all satisfying the *
. Since this value is a dict, the very same dict is assigned to all the paths complying with the query.
Thus, all the tools are modified such that the subfield parameters
of each is assigned the very same dict (the very same instance of the dict) {"type": "object"}
.
Then, in Step Copy(field="properties", to_field="tools/*/parameters/properties", set_every_value=True, )
dict utils goes down the first tool, tracing path tools/0/parameters
, reaching the above said dictionary, where it sees that it does not have a field named properties
, and it adds that fields (by default, not-exist-ok==True), assigning the first property to it, and now that above said dict has two fields: "type"
and "properties"
. Then (continuing to process the *
in the to_field) dict utils goes down the second tool, tools/1/parameters
, reaching the very same said dict, and now overwrites the properties field with the second property.
The end result is both tools pointing to same properties, the properties of the second tool.
To fix: simply remove Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True)
If you need this type
field as a sibling of properties
, add the following Step, after the above Copy Step:
Set(fields={"tools/*/parameters/type": "object"})
to_field="required", | ||
expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]", | ||
), | ||
extract_required_parameters, | ||
Copy( | ||
field="required", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see detailed comment above
fa698ef
to
067e359
Compare
067e359
to
bacfefb
Compare
b477c1a
to
b673513
Compare
Hi @elronbandel , I think I now managed to correctly fix the jsonschema issue (rather than translate all to 'type': 'object' which I erroneously suggested Friday). |
7100197
to
2a9b4cf
Compare
5b517c8
to
32c99de
Compare
…ing pipeline) Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
…aration Signed-off-by: dafnapension <[email protected]>
…now read through the whole recipe output Signed-off-by: dafnapension <[email protected]>
Signed-off-by: dafnapension <[email protected]>
Signed-off-by: dafnapension <[email protected]>
c35a903
to
ca55da9
Compare
If you cannot find operators fit to your needs simply use function to modify every instance in the data:
Or a function that modify the entire stream:
Both functions can be plugged in every place in unitxt requires operators, e.g pre-processing pipeline.