Skip to content

Commit 77d5ede

Browse files
feat(api): api update
1 parent 39a676e commit 77d5ede

13 files changed

+443
-145
lines changed

.stats.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
configured_endpoints: 54
2-
openapi_spec_hash: 57e29e33aec4bbc20171ec3128594e75
2+
openapi_spec_hash: 49989625bf633c5fdb3e11140f788f2d
33
config_hash: 930284cfa37f835d949c8a1b124f4807

src/codex/resources/projects/projects.py

Lines changed: 42 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -460,6 +460,7 @@ def validate(
460460
quality_preset: Literal["best", "high", "medium", "low", "base"] | NotGiven = NOT_GIVEN,
461461
rewritten_question: Optional[str] | NotGiven = NOT_GIVEN,
462462
task: Optional[str] | NotGiven = NOT_GIVEN,
463+
tools: Optional[Iterable[project_validate_params.Tool]] | NotGiven = NOT_GIVEN,
463464
x_client_library_version: str | NotGiven = NOT_GIVEN,
464465
x_integration_type: str | NotGiven = NOT_GIVEN,
465466
x_source: str | NotGiven = NOT_GIVEN,
@@ -504,17 +505,16 @@ def validate(
504505
505506
The default values corresponding to each quality preset are:
506507
507-
- **best:** `num_candidate_responses` = 6, `num_consistency_samples` = 8,
508-
`use_self_reflection` = True. This preset improves LLM responses.
509-
- **high:** `num_candidate_responses` = 4, `num_consistency_samples` = 8,
510-
`use_self_reflection` = True. This preset improves LLM responses.
511-
- **medium:** `num_candidate_responses` = 1, `num_consistency_samples` = 8,
512-
`use_self_reflection` = True.
513-
- **low:** `num_candidate_responses` = 1, `num_consistency_samples` = 4,
514-
`use_self_reflection` = True.
515-
- **base:** `num_candidate_responses` = 1, `num_consistency_samples` = 0,
516-
`use_self_reflection` = False. When using `get_trustworthiness_score()` on
517-
"base" preset, a faster self-reflection is employed.
508+
- **best:** `num_consistency_samples` = 8, `num_self_reflections` = 3,
509+
`reasoning_effort` = `"high"`.
510+
- **high:** `num_consistency_samples` = 4, `num_self_reflections` = 3,
511+
`reasoning_effort` = `"high"`.
512+
- **medium:** `num_consistency_samples` = 0, `num_self_reflections` = 3,
513+
`reasoning_effort` = `"high"`.
514+
- **low:** `num_consistency_samples` = 0, `num_self_reflections` = 3,
515+
`reasoning_effort` = `"none"`.
516+
- **base:** `num_consistency_samples` = 0, `num_self_reflections` = 1,
517+
`reasoning_effort` = `"none"`.
518518
519519
By default, TLM uses the: "medium" `quality_preset`, "gpt-4.1-mini" base
520520
`model`, and `max_tokens` is set to 512. You can set custom values for these
@@ -550,12 +550,11 @@ def validate(
550550
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
551551
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
552552
553-
use_self_reflection (bool, default = `True`): whether the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
554-
Setting this False disables reflection and will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
555-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts
556-
and catches responses that are noticeably incorrect/bad upon further analysis.
553+
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
554+
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
555+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
557556
558-
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "semantic"): how the
557+
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
559558
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
560559
Supported similarity measures include - "semantic" (based on natural language inference),
561560
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
@@ -574,6 +573,8 @@ def validate(
574573
- name: Name of the evaluation criteria.
575574
- criteria: Instructions specifying the evaluation criteria.
576575
576+
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
577+
577578
prompt: The prompt to use for the TLM call. If not provided, the prompt will be
578579
generated from the messages.
579580
@@ -582,6 +583,9 @@ def validate(
582583
rewritten_question: The re-written query if it was provided by the client to Codex from a user to be
583584
used instead of the original query.
584585
586+
tools: Tools to use for the LLM call. If not provided, it is assumed no tools were
587+
provided to the LLM.
588+
585589
extra_headers: Send extra headers
586590
587591
extra_query: Add additional query parameters to the request
@@ -620,6 +624,7 @@ def validate(
620624
"quality_preset": quality_preset,
621625
"rewritten_question": rewritten_question,
622626
"task": task,
627+
"tools": tools,
623628
},
624629
project_validate_params.ProjectValidateParams,
625630
),
@@ -1028,6 +1033,7 @@ async def validate(
10281033
quality_preset: Literal["best", "high", "medium", "low", "base"] | NotGiven = NOT_GIVEN,
10291034
rewritten_question: Optional[str] | NotGiven = NOT_GIVEN,
10301035
task: Optional[str] | NotGiven = NOT_GIVEN,
1036+
tools: Optional[Iterable[project_validate_params.Tool]] | NotGiven = NOT_GIVEN,
10311037
x_client_library_version: str | NotGiven = NOT_GIVEN,
10321038
x_integration_type: str | NotGiven = NOT_GIVEN,
10331039
x_source: str | NotGiven = NOT_GIVEN,
@@ -1072,17 +1078,16 @@ async def validate(
10721078
10731079
The default values corresponding to each quality preset are:
10741080
1075-
- **best:** `num_candidate_responses` = 6, `num_consistency_samples` = 8,
1076-
`use_self_reflection` = True. This preset improves LLM responses.
1077-
- **high:** `num_candidate_responses` = 4, `num_consistency_samples` = 8,
1078-
`use_self_reflection` = True. This preset improves LLM responses.
1079-
- **medium:** `num_candidate_responses` = 1, `num_consistency_samples` = 8,
1080-
`use_self_reflection` = True.
1081-
- **low:** `num_candidate_responses` = 1, `num_consistency_samples` = 4,
1082-
`use_self_reflection` = True.
1083-
- **base:** `num_candidate_responses` = 1, `num_consistency_samples` = 0,
1084-
`use_self_reflection` = False. When using `get_trustworthiness_score()` on
1085-
"base" preset, a faster self-reflection is employed.
1081+
- **best:** `num_consistency_samples` = 8, `num_self_reflections` = 3,
1082+
`reasoning_effort` = `"high"`.
1083+
- **high:** `num_consistency_samples` = 4, `num_self_reflections` = 3,
1084+
`reasoning_effort` = `"high"`.
1085+
- **medium:** `num_consistency_samples` = 0, `num_self_reflections` = 3,
1086+
`reasoning_effort` = `"high"`.
1087+
- **low:** `num_consistency_samples` = 0, `num_self_reflections` = 3,
1088+
`reasoning_effort` = `"none"`.
1089+
- **base:** `num_consistency_samples` = 0, `num_self_reflections` = 1,
1090+
`reasoning_effort` = `"none"`.
10861091
10871092
By default, TLM uses the: "medium" `quality_preset`, "gpt-4.1-mini" base
10881093
`model`, and `max_tokens` is set to 512. You can set custom values for these
@@ -1118,12 +1123,11 @@ async def validate(
11181123
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
11191124
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
11201125
1121-
use_self_reflection (bool, default = `True`): whether the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
1122-
Setting this False disables reflection and will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
1123-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts
1124-
and catches responses that are noticeably incorrect/bad upon further analysis.
1126+
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
1127+
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
1128+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
11251129
1126-
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "semantic"): how the
1130+
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
11271131
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
11281132
Supported similarity measures include - "semantic" (based on natural language inference),
11291133
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
@@ -1142,6 +1146,8 @@ async def validate(
11421146
- name: Name of the evaluation criteria.
11431147
- criteria: Instructions specifying the evaluation criteria.
11441148
1149+
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
1150+
11451151
prompt: The prompt to use for the TLM call. If not provided, the prompt will be
11461152
generated from the messages.
11471153
@@ -1150,6 +1156,9 @@ async def validate(
11501156
rewritten_question: The re-written query if it was provided by the client to Codex from a user to be
11511157
used instead of the original query.
11521158
1159+
tools: Tools to use for the LLM call. If not provided, it is assumed no tools were
1160+
provided to the LLM.
1161+
11531162
extra_headers: Send extra headers
11541163
11551164
extra_query: Add additional query parameters to the request
@@ -1188,6 +1197,7 @@ async def validate(
11881197
"quality_preset": quality_preset,
11891198
"rewritten_question": rewritten_question,
11901199
"task": task,
1200+
"tools": tools,
11911201
},
11921202
project_validate_params.ProjectValidateParams,
11931203
),

0 commit comments

Comments
 (0)