Skip to content

Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

elronbandel
Copy link
Member

If you cannot find operators fit to your needs simply use function to modify every instance in the data:

        def my_function(instance, stream_name=None):
            instance["x"] += 42
            return instance

Or a function that modify the entire stream:

        def my_other_function(stream, stream_name=None):
            for instance in stream:
                instance["x"] += 42
                yield instance

Both functions can be plugged in every place in unitxt requires operators, e.g pre-processing pipeline.

except (OSError, TypeError):
# If source is not available
return [f"<function {d.__name__} (source unavailable)>"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, this is how the result looks at the catalog:

image

Move,
Set,
)
from unitxt.splitters import RenameSplits
from unitxt.struct_data_operators import LoadJson
from unitxt.test_utils.card import test_card


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this very dataset is gated, I suggest to use an example that is accessible to all:

Dataset 'Salesforce/xlam-function-calling-60k' is a gated dataset on the Hub. You must be authenticated to access it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dafnapension it is very easy to access it with huggingface-login, it almost became a standard this days

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I easily generated a token for myself and read the dataset, and bumped into a schema validation error, that I suggest to look into. (This is independently of the logical bug I pointed at below; I actually intended to prove that error along the dataset, but bumped into this error before being able to read along the whole dataset):

image

to_field="required",
expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]",
),
extract_required_parameters,
Copy link
Collaborator

@dafnapension dafnapension Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preprocessing is wrong (was so also before this PR):
Here is what happens in case of two tools, as is with the single example I found in https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k.
Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True), applies deep copy only once -- before invoking dict_utils . The latter processes the * component of the query, by definition, employing a single value to assign to all satisfying the *. Since this value is a dict, the very same dict is assigned to all the paths complying with the query.
Thus, all the tools are modified such that the subfield parameters of each is assigned the very same dict (the very same instance of the dict) {"type": "object"}.
Then, in Step Copy(field="properties", to_field="tools/*/parameters/properties", set_every_value=True, )
dict utils goes down the first tool, tracing path tools/0/parameters, reaching the above said dictionary, where it sees that it does not have a field named properties, and it adds that fields (by default, not-exist-ok==True), assigning the first property to it, and now that above said dict has two fields: "type" and "properties". Then (continuing to process the * in the to_field) dict utils goes down the second tool, tools/1/parameters, reaching the very same said dict, and now overwrites the properties field with the second property.

The end result is both tools pointing to same properties, the properties of the second tool.

To fix: simply remove Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True)
If you need this type field as a sibling of properties, add the following Step, after the above Copy Step:
Set(fields={"tools/*/parameters/type": "object"})

to_field="required",
expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]",
),
extract_required_parameters,
Copy(
field="required",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see detailed comment above

@dafnapension dafnapension force-pushed the function-operators branch 14 times, most recently from b477c1a to b673513 Compare July 20, 2025 09:15
@dafnapension
Copy link
Collaborator

Hi @elronbandel , I think I now managed to correctly fix the jsonschema issue (rather than translate all to 'type': 'object' which I erroneously suggested Friday).
Here is a notebook that runs through the whole recipe output, emphasizing instance 15 that has this special case of 'str, optional' which failed the runs before.

@dafnapension
Copy link
Collaborator

image

@dafnapension dafnapension force-pushed the function-operators branch 2 times, most recently from 7100197 to 2a9b4cf Compare July 20, 2025 20:45
@dafnapension
Copy link
Collaborator

dafnapension commented Jul 20, 2025

bfcl on main:

image

and now fixed, with same new Operator that is used for xlam:

image

@dafnapension dafnapension force-pushed the function-operators branch 2 times, most recently from c35a903 to ca55da9 Compare July 22, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants