Following strictly the instructions but getting the following results, which are non-negligibly lower than what reported in the paper. Any idea why this happens. Thanks.
ViSpeak_real_stats.csv
task_type,total,correct,accuracy
Clips Summarize,317,240,0.7570977917981072
total,2500,1667,0.6668
Object Recognition,367,279,0.7602179836512262
Attribute Recognition,306,227,0.7418300653594772
Prospective Reasoning,108,65,0.6018518518518519
Action Recognition,353,239,0.6770538243626062
Spatial Understanding,246,147,0.5975609756097561
Event Understanding,161,119,0.7391304347826086
Counting,193,45,0.23316062176165803
Text-Rich Understanding,321,223,0.6947040498442367
Causal Reasoning,128,83,0.6484375
ViSpeak_sqa_stats.csv
task_type,total,correct,accuracy
Sequential Question Answering,250,96,0.384
ViSpeak_proactive_stats.csv
task_type,total,time_correct,time_accuracy,answer_correct,answer_accuracy
Proactive Output,250,115,0.46,109,0.436
ViSpeak_omni_stats.csv
task_type,total,correct,accuracy
Misleading Context Understanding,250,84,0.336
total,1500,790,0.5266666666666666
Source Discrimination,250,141,0.564
Emotion Recognition,250,112,0.448
Anomaly Context Understanding,250,114,0.456
Scene Understanding,250,143,0.572
Multimodal Alignment,250,196,0.784
Following strictly the instructions but getting the following results, which are non-negligibly lower than what reported in the paper. Any idea why this happens. Thanks.
ViSpeak_real_stats.csv
ViSpeak_sqa_stats.csv
ViSpeak_proactive_stats.csv
ViSpeak_omni_stats.csv