Scripts and graphics used for ongoing testing of IMAGE
Scripts are used to iterate through the graphics, get output from the server, and compare outputs.
Iterates through collection of graphics and builds a testset based on specified flags, sends a POST request for each of these graphics, writes the output to a timestamped JSON file. Default server is unicorn if not specified.
to iterate through the entire test set on pegasus:
./testset.py -s pegasus
to iterate through graphics that are larger than 1000000 bytes and are tagged as "outdoor":
./testset.py --minBytes 1000000 -t outdoor
Compares preprocessor output for any two JSON files.
to compare the output from August 6 2022 at midnight and August 7 2022 at midnight for all graphics which were run at that time:
./testdiff.py -t 08_06_2022_00_00_00 08_07_2022_00_00_00, output will be a list of all graphics that have both timestamps, plus any differences
to compare the output from August 6 2022 at midnight and August 7 2022 at midnight for graphic 35:
./testdiff.py -t 08_06_2022_00_00_00 08_07_2022_00_00_00 -n 35
to compare the grouping, sorting, and semantic segmentation output from August 6 2022 at midnight and August 7 2022 at midnight for graphic 35:
./testdiff.py -t 08_06_2022_00_00_00 08_07_2022_00_00_00 -n 35 --preprocessor grouping sorting semanticSegmentation
to get a list of objects found for object detection
Insert two time stamps and then use --od flag to specify which timestamp is which model. For example, if Azure was run at August 6 at 12:00am and YOLO was run at August 6 at 12:01am for graphic 35
./testdiff.py -t 08_06_2022_00_00_00 08_06_2022_00_00_01 -n 35 --od Azure YOLO
-d flag on testset.py will run a testdiff on the JSON that was just created and the next most recent JSON that was created for the graphic(s) (if it exists)
azure.sh switches the docker compose for object detection from YOLO (default) to Azure yolo.sh switches back through a restoreunstable
to compare YOLO and Azure outputs for all indoor graphics
./testset.py -t indoor
./azure.sh
./testset.py -t indoor -d
./yolo.sh
Automated testing script for evaluating multimodal LLM descriptions of images from the IMAGE-test-graphics repository.
- Python 3.7+
- Ollama running locally (
ollama serve) - Internet connection (for GitHub API access)
pip install requests pandas pillowpython llm-caption-test.pyEdit the MODELS list in the script with model names and temperature settings:
MODELS = [
("gemma3:12b", 0.0),
("gemma3:12b", 1.0),
("llama3.2-vision:latest", 0.0),
("llama3.2-vision:latest", 1.0)
]- Prompt:
PROMPTparameter is applied to all models - Image size: Modify
max_sizeinimage_to_base64()(default: 2048x2048) - API endpoint: Update
urlinrun_ollama_model()if not using localhost:11434 - Image formats: Add to
IMAGE_EXTENSIONSset for other formats
-
llm_test_results.csv - Main results with columns:
folder: Folder number (0000-0067)filename: Original image filenameimage: HTML-embedded thumbnail- Model description columns (e.g., "llama3.2-vision:latest (t=0.0)")
-
llm_test_results.html - Formatted HTML view with embedded images
-
intermediate_results.csv - Auto-saved every 5 images (backup)
- Processes 68 folders (0000-0067) from Shared-Reality-Lab/IMAGE-test-graphics
- Images are resized to max 2048x2048
- 1-second delay between model calls to avoid overload
- Handles various image formats (JPG, JPEG, PNG, GIF, BMP, WEBP)
- Error handling for missing images or API failures