🌍 DemocratizeESG 🌱📊

DemocratizeESG is an open-source set of tools designed to evaluate the true environmental performance of corporations. By extracting environmental key performance indicators (eKPIs) from company reports, we aim to objectively evaluate and compare the environmental impact of large companies.

💡 About

Our core contributions to the open-source community include:

Report Collection: A dataset of 1,350 annual company reports (link).
eKPI Definitions: 50 industry-agnostic eKPIs and over 70 industry-specific eKPIs (link).
Extraction Pipeline: An automated, LLM-based information extraction pipeline to parse eKPIs from the reports (contained in this repository).
Training/Test Data: 62 manually annotated reports used to train and test the pipeline (SQL file).
Extracted Database: A complete database of all the information extracted from the report corpus (SQL file).

🔎 Project Overview

Document Corpus

Climate Action 100+ is an investor-led initiative to ensure the world’s largest corporate greenhouse gas emitters take necessary action on climate change to mitigate financial risk and maximize the long-term value of assets.

We manually collected 1,350 Annual Reports from 170 CA100+ companies spanning the years 2020-2024. This collection is publicly available on our Google Drive. (Note: Reports were collected in mid-2024 and may not reflect the absolute current state of CA100+).

Environmental KPIs

To determine what information to extract, we synthesized the most commonly used metrics across the three pillars of ESG from:

Governmental and NGO frameworks: TCFD, GRI, SASB, ESRS
ESG rating agencies: CDP, LSEG, EcoVadis
Scientific Literature: Colesanti Senni et al. (2024); Zou et al. (2025)

The resulting list of 49 industry-agnostic and 70+ industry-specific eKPIs can be found in this Google Sheet. We specifically focused on outcome- and impact-oriented metrics that are straightforward to visualize.

Information Extraction Pipeline

The pipeline loads the target documents and the desired eKPIs. The reports are loaded entirely into the context window of an LLM (currently Gemini 2.5 Flash), where highly engineered prompts are used to extract the target data. The extracted information is then stored in a local MySQL database for further analysis and visualization.

🚀 Current Performance: The pipeline achieves an F1-Score of 93% on our test set.

🛠️ Pipeline Usage

Installation and Preparation

Install Python 3.12.6
Create and Store a Gemini API key, following the Gemini documentation
Install MySQL '8.0.43' and use democratize_esg.sql to generate all required tables
Use createGoogleAccessToken.py to create a Google access token, enabling script access to the required Google-drive.

To run the full extraction process, follow this workflow:

Prepare Prompts: Run Fullcontext_main.py. This contains the most effective pipeline version ("Divide and Conquer"). Reports are downloaded from the Drive, uploaded to the Google Files API, and prompts are built and saved to a large .jsonl file.
Submit Batch Job: Open gemini_batch_management.ipynb to submit the .jsonl file to the Gemini 2.5 Flash Batch API. You can query the status of the job here (expected runtime is ~20-30 minutes).
Parse Results: Once the batch job finishes, use gemini_batch_output_parsing.ipynb to download the results and store them in your local MySQL database.
Clean Data: Run ConsolidateBatchResults.py to access the local database and resolve any conflicting extracted information.
Standardize Units: Finally, run UnitConversion.py to convert numerical values to a common format and standard units, enabling historical and cross-company comparisons.

The resulting database is now ready for data analysis! Check out the Data_Analysis folder for usage examples.

🚀 Roadmap

We are currently working on releasing the pipeline and database as an interactive website. The site will feature easy access to data visualization methods for deeper insights. Users will also be able to upload new reports into the pipeline, and the resulting extracted information will be continuously added to the database!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.idea		.idea
Async_tests		Async_tests
Data_Analysis		Data_Analysis
__pycache__		__pycache__
batch_input_output_files		batch_input_output_files
pickles		pickles
plans		plans
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
ChainOfAgents.py		ChainOfAgents.py
CompanyReportFile.py		CompanyReportFile.py
ConsolidateBatchResults.py		ConsolidateBatchResults.py
Fullcontext_main.py		Fullcontext_main.py
Gemini.py		Gemini.py
GroundTruth.py		GroundTruth.py
MySQL_client.py		MySQL_client.py
UnitConversion.py		UnitConversion.py
batch_output_parsing.py		batch_output_parsing.py
createGoogleAccessToken.py		createGoogleAccessToken.py
gemini_batch_management.ipynb		gemini_batch_management.ipynb
gemini_batch_output_parsing.ipynb		gemini_batch_output_parsing.ipynb
license.md		license.md
readme.md		readme.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌍 DemocratizeESG 🌱📊

📑 Table of Contents

💡 About

🔎 Project Overview

Document Corpus

Environmental KPIs

Information Extraction Pipeline

🛠️ Pipeline Usage

Installation and Preparation

🚀 Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌍 DemocratizeESG 🌱📊

📑 Table of Contents

💡 About

🔎 Project Overview

Document Corpus

Environmental KPIs

Information Extraction Pipeline

🛠️ Pipeline Usage

Installation and Preparation

🚀 Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages