DemocratizeESG is an open-source set of tools designed to evaluate the true environmental performance of corporations. By extracting environmental key performance indicators (eKPIs) from company reports, we aim to objectively evaluate and compare the environmental impact of large companies.
Our core contributions to the open-source community include:
- Report Collection: A dataset of 1,350 annual company reports (link).
- eKPI Definitions: 50 industry-agnostic eKPIs and over 70 industry-specific eKPIs (link).
- Extraction Pipeline: An automated, LLM-based information extraction pipeline to parse eKPIs from the reports (contained in this repository).
- Training/Test Data: 62 manually annotated reports used to train and test the pipeline (SQL file).
- Extracted Database: A complete database of all the information extracted from the report corpus (SQL file).
Climate Action 100+ is an investor-led initiative to ensure the world’s largest corporate greenhouse gas emitters take necessary action on climate change to mitigate financial risk and maximize the long-term value of assets.
We manually collected 1,350 Annual Reports from 170 CA100+ companies spanning the years 2020-2024. This collection is publicly available on our Google Drive. (Note: Reports were collected in mid-2024 and may not reflect the absolute current state of CA100+).
To determine what information to extract, we synthesized the most commonly used metrics across the three pillars of ESG from:
- Governmental and NGO frameworks: TCFD, GRI, SASB, ESRS
- ESG rating agencies: CDP, LSEG, EcoVadis
- Scientific Literature: Colesanti Senni et al. (2024); Zou et al. (2025)
The resulting list of 49 industry-agnostic and 70+ industry-specific eKPIs can be found in this Google Sheet. We specifically focused on outcome- and impact-oriented metrics that are straightforward to visualize.
The pipeline loads the target documents and the desired eKPIs. The reports are loaded entirely into the context window of an LLM (currently Gemini 2.5 Flash), where highly engineered prompts are used to extract the target data. The extracted information is then stored in a local MySQL database for further analysis and visualization.
🚀 Current Performance: The pipeline achieves an F1-Score of 93% on our test set.
- Install Python 3.12.6
- Create and Store a Gemini API key, following the Gemini documentation
- Install MySQL '8.0.43' and use democratize_esg.sql to generate all required tables
- Use createGoogleAccessToken.py to create a Google access token, enabling script access to the required Google-drive.
To run the full extraction process, follow this workflow:
- Prepare Prompts: Run
Fullcontext_main.py. This contains the most effective pipeline version ("Divide and Conquer"). Reports are downloaded from the Drive, uploaded to the Google Files API, and prompts are built and saved to a large.jsonlfile. - Submit Batch Job: Open
gemini_batch_management.ipynbto submit the.jsonlfile to the Gemini 2.5 Flash Batch API. You can query the status of the job here (expected runtime is ~20-30 minutes). - Parse Results: Once the batch job finishes, use
gemini_batch_output_parsing.ipynbto download the results and store them in your local MySQL database. - Clean Data: Run
ConsolidateBatchResults.pyto access the local database and resolve any conflicting extracted information. - Standardize Units: Finally, run
UnitConversion.pyto convert numerical values to a common format and standard units, enabling historical and cross-company comparisons.
The resulting database is now ready for data analysis! Check out the Data_Analysis folder for usage examples.
We are currently working on releasing the pipeline and database as an interactive website. The site will feature easy access to data visualization methods for deeper insights. Users will also be able to upload new reports into the pipeline, and the resulting extracted information will be continuously added to the database!

