BigCodeBench v0.2.1.post3
What's Changed
- Fix 
calibrationsetting in the code evaluation. - Add 
--no_executeargument for code evaluation. - Support concurrent API inference for 
o1anddeepseek-chat. - Fix API inference for Google Gemini.
 - Add 
--instruction_prefixand--response_prefixarguments for code generation. - Change 
--id_rangeinput type. - Add 
--revisionarguments for code generation. 
Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
 - grok-beta
 - claude-3-5-haiku-20241022
 
Full Changelog: v0.2.0...v0.2.1.post2