Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

elronbandel · 2025-06-26T14:14:09Z

If you cannot find operators fit to your needs simply use function to modify every instance in the data:

        def my_function(instance, stream_name=None):
            instance["x"] += 42
            return instance

Or a function that modify the entire stream:

        def my_other_function(stream, stream_name=None):
            for instance in stream:
                instance["x"] += 42
                yield instance

Both functions can be plugged in every place in unitxt requires operators, e.g pre-processing pipeline.

dafnapension · 2025-06-28T10:15:22Z

src/unitxt/text_utils.py

+        except (OSError, TypeError):
+            # If source is not available
+            return [f"<function {d.__name__} (source unavailable)>"]
+


very nice, this is how the result looks at the catalog:

dafnapension · 2025-06-28T10:27:13Z

prepare/cards/xlam_function_calling.py

    Move,
    Set,
 )
 from unitxt.splitters import RenameSplits
 from unitxt.struct_data_operators import LoadJson
 from unitxt.test_utils.card import test_card

+


this very dataset is gated, I suggest to use an example that is accessible to all:

Dataset 'Salesforce/xlam-function-calling-60k' is a gated dataset on the Hub. You must be authenticated to access it.

@dafnapension it is very easy to access it with huggingface-login, it almost became a standard this days

So I easily generated a token for myself and read the dataset, and bumped into a schema validation error, that I suggest to look into. (This is independently of the logical bug I pointed at below; I actually intended to prove that error along the dataset, but bumped into this error before being able to read along the whole dataset):

dafnapension · 2025-06-28T14:23:05Z

prepare/cards/xlam_function_calling.py

-            to_field="required",
-            expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]",
-        ),
+        extract_required_parameters,


This preprocessing is wrong (was so also before this PR):
Here is what happens in case of two tools, as is with the single example I found in https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k.
Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True), applies deep copy only once -- before invoking dict_utils . The latter processes the * component of the query, by definition, employing a single value to assign to all satisfying the *. Since this value is a dict, the very same dict is assigned to all the paths complying with the query.
Thus, all the tools are modified such that the subfield parameters of each is assigned the very same dict (the very same instance of the dict) {"type": "object"}.
Then, in Step Copy(field="properties", to_field="tools/*/parameters/properties", set_every_value=True, )
dict utils goes down the first tool, tracing path tools/0/parameters, reaching the above said dictionary, where it sees that it does not have a field named properties, and it adds that fields (by default, not-exist-ok==True), assigning the first property to it, and now that above said dict has two fields: "type" and "properties". Then (continuing to process the * in the to_field) dict utils goes down the second tool, tools/1/parameters, reaching the very same said dict, and now overwrites the properties field with the second property.

The end result is both tools pointing to same properties, the properties of the second tool.

To fix: simply remove Step Set(fields={"tools/*/parameters": {"type": "object"}}, use_deepcopy=True)
If you need this type field as a sibling of properties, add the following Step, after the above Copy Step:
Set(fields={"tools/*/parameters/type": "object"})

dafnapension · 2025-06-29T08:34:54Z

prepare/cards/xlam_function_calling.py

-            to_field="required",
-            expression="[[p for p, c in tool['parameters']['properties'].items() if 'optional' not in c['type']] for tool in tools]",
-        ),
+        extract_required_parameters,
        Copy(
            field="required",


Please see detailed comment above

dafnapension · 2025-07-20T09:24:26Z

Hi @elronbandel , I think I now managed to correctly fix the jsonschema issue (rather than translate all to 'type': 'object' which I erroneously suggested Friday).
Here is a notebook that runs through the whole recipe output, emphasizing instance 15 that has this special case of 'str, optional' which failed the runs before.

dafnapension · 2025-07-20T09:24:33Z

dafnapension · 2025-07-20T21:03:38Z

bfcl on main:

and now fixed, with same new Operator that is used for xlam:

…ing pipeline) Signed-off-by: elronbandel <[email protected]>

Signed-off-by: elronbandel <[email protected]>

…aration Signed-off-by: dafnapension <[email protected]>

…now read through the whole recipe output Signed-off-by: dafnapension <[email protected]>

Signed-off-by: dafnapension <[email protected]>

elronbandel requested review from dafnapension and yoavkatz June 26, 2025 14:14

dafnapension reviewed Jun 28, 2025

View reviewed changes

dafnapension requested changes Jun 29, 2025

View reviewed changes

dafnapension force-pushed the function-operators branch from fa698ef to 067e359 Compare June 29, 2025 12:43

dafnapension force-pushed the function-operators branch from 067e359 to bacfefb Compare July 8, 2025 09:48

dafnapension force-pushed the function-operators branch 14 times, most recently from b477c1a to b673513 Compare July 20, 2025 09:15

dafnapension force-pushed the function-operators branch 2 times, most recently from 7100197 to 2a9b4cf Compare July 20, 2025 20:45

dafnapension force-pushed the function-operators branch from 5b517c8 to 32c99de Compare July 22, 2025 11:25

elronbandel added 2 commits July 22, 2025 20:30

Allow using python functions instead of operators (e.g in pre-process…

7af857c

…ing pipeline) Signed-off-by: elronbandel <[email protected]>

format

1ef66ad

Signed-off-by: elronbandel <[email protected]>

dafnapension added 4 commits July 22, 2025 20:30

normalize to_dict of simple tyes, like re.DOTALL, and speed test_prep…

4f78a0d

…aration Signed-off-by: dafnapension <[email protected]>

fixed jsonschema programatically, with python string operations. can …

723ccf9

…now read through the whole recipe output Signed-off-by: dafnapension <[email protected]>

combine the fix of jsonschema with that needed for bfcl

bcd847e

Signed-off-by: dafnapension <[email protected]>

last touches in bfcl, reviewing whole datasets

ca55da9

Signed-off-by: dafnapension <[email protected]>

dafnapension force-pushed the function-operators branch 2 times, most recently from c35a903 to ca55da9 Compare July 22, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

Uh oh!

elronbandel commented Jun 26, 2025

Uh oh!

dafnapension Jun 28, 2025

Uh oh!

dafnapension Jun 28, 2025

Uh oh!

elronbandel Jun 29, 2025

Uh oh!

dafnapension Jun 30, 2025

Uh oh!

dafnapension Jun 28, 2025 •

edited

Loading

Uh oh!

dafnapension Jun 29, 2025

Uh oh!

dafnapension commented Jul 20, 2025

Uh oh!

dafnapension commented Jul 20, 2025 •

edited

Loading

Uh oh!

dafnapension commented Jul 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

Are you sure you want to change the base?

Allow using python functions instead of operators (e.g in pre-processing pipeline) #1845

Uh oh!

Conversation

elronbandel commented Jun 26, 2025

Uh oh!

dafnapension Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

elronbandel Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dafnapension Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension commented Jul 20, 2025

Uh oh!

dafnapension commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dafnapension commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dafnapension Jun 28, 2025 •

edited

Loading

dafnapension commented Jul 20, 2025 •

edited

Loading

dafnapension commented Jul 20, 2025 •

edited

Loading