A computational pipeline for harvesting, processing, and analyzing academic literature about visual art using OpenAlex, Wikidata, and multimodal embeddings.
ArtContext automates the collection and analysis of scholarly articles about painters and their works. The pipeline:
- Harvests painter metadata from Wikidata
- Queries OpenAlex for academic papers about each painter
- Downloads available PDFs
- Converts PDFs to Markdown for text processing
- Extracts sentences and generates multimodal embeddings using CLIP
The pipeline consists of batch scripts that should be run in the following order:
python batch_harvest_wikidata.py- Queries Wikidata for painter information
- Populates
artists.jsonwith painter metadata - Creates initial painter entries in
paintings.xlsx
python batch_query_open_alex.py- Reads painter names from
painters.xlsx - Queries OpenAlex API for academic works mentioning each painter
- Saves results to
Artist-JSONs/<painter_name>.json - Uses helper:
query_open_alex_with.py
python batch_download_works.py- Processes all artist JSON files in
Artist-JSONs/ - Downloads available PDFs to
PDF_Bucket/ - Updates
works.jsonwith download metadata - Uses helpers:
download_works_on.py,download_single_work.py
python batch_pdf_to_markdown.py- Converts downloaded PDFs to Markdown format
- Outputs to
Marker_Output/<work_id>/ - Uses helper:
single_pdf_to_markdown.py
python batch_markdown_file_to_english_sentences.py- Extracts English sentences from Markdown files
- Updates
sentences.jsonwith extracted sentences - Updates
works.jsonwith sentence counts - Uses helper:
markdown_file_to_english_sentences.py
python batch_embed_sentences.py- Generates CLIP embeddings for all sentences
- Generates PaintingCLIP embeddings (fine-tuned model)
- Saves embeddings to
CLIP_Embeddings/andPaintingCLIP_Embeddings/ - Uses helper:
embed_sentence_with_clip.py
python build_topics_json.py- Creates reverse index of OpenAlex topics to works
- Generates
topics.jsonfromworks.jsondata
python generate_painter_list.py- Generates/updates the painter list for processing
Artist-JSONs/- OpenAlex query results per artistPDF_Bucket/- Downloaded PDF files organized by artistMarker_Output/- Converted Markdown files from PDFsCLIP_Embeddings/- Standard CLIP text embeddingsPaintingCLIP_Embeddings/- Fine-tuned CLIP embeddingsExcel-Files/- Excel outputs and working fileslogs/- Execution logs for debugging
PaintingCLIP/- Fine-tuned CLIP adapter (LoRA weights)Helper Scripts/- Utility scripts for data cleaning and analysisArchive/- Previous versions and documentationScripts from Project/- Legacy scripts (excluded from quality checks)
artists.json- Artist metadata and associated work IDsworks.json- Work metadata including URLs, sentences, topicssentences.json- Extracted sentences with embedding statustopics.json- Topic to work ID mappingpainters.xlsx- Master list of painters to processpaintings.xlsx- Painting metadata
See requirements.txt for Python dependencies. Key requirements:
- Python 3.8+
- OpenAlex API access (free, requires email)
- Wikidata SPARQL access
- GPU recommended for embedding generation
- Set your email in scripts that query OpenAlex (required by their API)
- Adjust batch sizes and concurrency settings based on your system
- Configure logging levels in individual scripts
- The pipeline is designed to be resumable - it checks for existing data before reprocessing
- Most scripts support filtering by artist name or work ID for selective processing
- Embedding generation is computationally intensive - consider running on GPU
- OpenAlex API has rate limits - the scripts include automatic retry logic
The pipeline produces structured JSON files that can be used for downstream analysis:
- Sentence-level embeddings for semantic search
- Work metadata for bibliometric analysis
- Topic associations for thematic studies