ArtContext

A computational pipeline for harvesting, processing, and analyzing academic literature about visual art using OpenAlex, Wikidata, and multimodal embeddings.

Overview

ArtContext automates the collection and analysis of scholarly articles about painters and their works. The pipeline:

Harvests painter metadata from Wikidata
Queries OpenAlex for academic papers about each painter
Downloads available PDFs
Converts PDFs to Markdown for text processing
Extracts sentences and generates multimodal embeddings using CLIP

Pipeline Workflow

The pipeline consists of batch scripts that should be run in the following order:

1. Wikidata Harvest

python batch_harvest_wikidata.py

Queries Wikidata for painter information
Populates artists.json with painter metadata
Creates initial painter entries in paintings.xlsx

2. Query OpenAlex

python batch_query_open_alex.py

Reads painter names from painters.xlsx
Queries OpenAlex API for academic works mentioning each painter
Saves results to Artist-JSONs/<painter_name>.json
Uses helper: query_open_alex_with.py

3. Download Works

python batch_download_works.py

Processes all artist JSON files in Artist-JSONs/
Downloads available PDFs to PDF_Bucket/
Updates works.json with download metadata
Uses helpers: download_works_on.py, download_single_work.py

4. PDF to Markdown

python batch_pdf_to_markdown.py

Converts downloaded PDFs to Markdown format
Outputs to Marker_Output/<work_id>/
Uses helper: single_pdf_to_markdown.py

5. Extract Sentences

python batch_markdown_file_to_english_sentences.py

Extracts English sentences from Markdown files
Updates sentences.json with extracted sentences
Updates works.json with sentence counts
Uses helper: markdown_file_to_english_sentences.py

6. Generate Embeddings

python batch_embed_sentences.py

Generates CLIP embeddings for all sentences
Generates PaintingCLIP embeddings (fine-tuned model)
Saves embeddings to CLIP_Embeddings/ and PaintingCLIP_Embeddings/
Uses helper: embed_sentence_with_clip.py

Additional Scripts

Topic Analysis

python build_topics_json.py

Creates reverse index of OpenAlex topics to works
Generates topics.json from works.json data

Painter List Generation

python generate_painter_list.py

Generates/updates the painter list for processing

Directory Structure

Data Directories

Artist-JSONs/ - OpenAlex query results per artist
PDF_Bucket/ - Downloaded PDF files organized by artist
Marker_Output/ - Converted Markdown files from PDFs
CLIP_Embeddings/ - Standard CLIP text embeddings
PaintingCLIP_Embeddings/ - Fine-tuned CLIP embeddings
Excel-Files/ - Excel outputs and working files
logs/ - Execution logs for debugging

Configuration & Models

PaintingCLIP/ - Fine-tuned CLIP adapter (LoRA weights)
Helper Scripts/ - Utility scripts for data cleaning and analysis
Archive/ - Previous versions and documentation
Scripts from Project/ - Legacy scripts (excluded from quality checks)

Data Files

artists.json - Artist metadata and associated work IDs
works.json - Work metadata including URLs, sentences, topics
sentences.json - Extracted sentences with embedding status
topics.json - Topic to work ID mapping
painters.xlsx - Master list of painters to process
paintings.xlsx - Painting metadata

Requirements

See requirements.txt for Python dependencies. Key requirements:

Python 3.8+
OpenAlex API access (free, requires email)
Wikidata SPARQL access
GPU recommended for embedding generation

Configuration

Set your email in scripts that query OpenAlex (required by their API)
Adjust batch sizes and concurrency settings based on your system
Configure logging levels in individual scripts

Usage Notes

The pipeline is designed to be resumable - it checks for existing data before reprocessing
Most scripts support filtering by artist name or work ID for selective processing
Embedding generation is computationally intensive - consider running on GPU
OpenAlex API has rate limits - the scripts include automatic retry logic

Output Format

The pipeline produces structured JSON files that can be used for downstream analysis:

Sentence-level embeddings for semantic search
Work metadata for bibliometric analysis
Topic associations for thematic studies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArtContext

Overview

Pipeline Workflow

1. Wikidata Harvest

2. Query OpenAlex

3. Download Works

4. PDF to Markdown

5. Extract Sentences

6. Generate Embeddings

Additional Scripts

Topic Analysis

Painter List Generation

Directory Structure

Data Directories

Configuration & Models

Data Files

Requirements

Configuration

Usage Notes

Output Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Archive		Archive
Helper Scripts		Helper Scripts
PaintingCLIP		PaintingCLIP
Scripts from Project		Scripts from Project
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Bede_Guide_1.md		Bede_Guide_1.md
README.md		README.md
Using_Bede.md		Using_Bede.md
artists.json		artists.json
batch_download_works.py		batch_download_works.py
batch_embed_sentences.py		batch_embed_sentences.py
batch_harvest_wikidata.py		batch_harvest_wikidata.py
batch_markdown_file_to_english_sentences.py		batch_markdown_file_to_english_sentences.py
batch_pdf_to_markdown.py		batch_pdf_to_markdown.py
batch_query_open_alex.py		batch_query_open_alex.py
build_topics_json.py		build_topics_json.py
download_single_work.py		download_single_work.py
download_works_on.py		download_works_on.py
embed_sentence_with_clip.py		embed_sentence_with_clip.py
enrich_works_metadata.py		enrich_works_metadata.py
generate_painter_list.py		generate_painter_list.py
markdown_file_to_english_sentences.py		markdown_file_to_english_sentences.py
painters.xlsx		painters.xlsx
paintings.xlsx		paintings.xlsx
pyproject.toml		pyproject.toml
query_open_alex_with.py		query_open_alex_with.py
report-light-version.pdf		report-light-version.pdf
requirements.txt		requirements.txt
sentences.json		sentences.json
single_pdf_to_markdown.py		single_pdf_to_markdown.py
topics.json		topics.json
works.json		works.json
works_enriched.json		works_enriched.json

Folders and files

Latest commit

History

Repository files navigation

ArtContext

Overview

Pipeline Workflow

1. Wikidata Harvest

2. Query OpenAlex

3. Download Works

4. PDF to Markdown

5. Extract Sentences

6. Generate Embeddings

Additional Scripts

Topic Analysis

Painter List Generation

Directory Structure

Data Directories

Configuration & Models

Data Files

Requirements

Configuration

Usage Notes

Output Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages