This repository has been inspired from the EDGAR CRAWLER repository for generating 10-K filings. The repository has been modified to generate filings from 8-K and 10-Q forms as well.
- Crawl and download financial reports for each publicly-traded company, for specified years, through the
edgar_crawler.pymodule. - Extract and clean specific text sections, such as Risk Factors, MD&A, and others, through the
extract_items.pymodule.
- Set up a virtual enviroment using
pipenv shellorpythom3 -m venv - Download all dependencies using
pip install -r requirements.txt
- CIK Numbers are available in
CIK.csv edgar_crawler.pyto generate all the filings for the given cik numbers and durationextract_items_8k.pyto extract all the 8k formsextract_items_10k.pyto extract all the 10k formsextract_items_10q.pyto extract all the 10q forms- INDICES contains files for the specified year
-
Edit the
config.jsonfile with"filing_types": []specify the form type as8-K or 10-K or 10-Qdepending on the type of form we wish to extract -
Run
edgar_crawler.py. This will create a csv namedFILLINGS_METADATA.csvalongwith a folder namedRAW_FILLINGS -
Run the corresponding script file in
extract_items_{type mentioned in the config.json file}.py -
Creates a folder in the datasets directory corresponding to the form type
- Before running any script, you can edit the
config.jsonfile.- Arguments for
edgar_crawler.py, the module to download financial reports:--start_year XXXX: the year range to start from--end_year YYYY: the year range to end to--quarters: the quarters that you want to download filings from (List).
Default value is:[1, 2, 3, 4].--filing_types: list of filing types to download.
Default value is:['10-K', '10-Q', '8-K'].--cik_tickers: list or path of file containing CIKs or Tickers. e.g.[789019, "1018724", "AAPL", "TWTR"]
In case of file, provide each CIK or Ticker in a different line.
If this argument is not provided, then the toolkit will download annual reports for all the U.S. publicly traded companies.--user_agent: the User-agent that will be declared to SEC EDGAR.--raw_filings_folder: the name of the folder where downloaded filings will be stored.
Default value is'RAW_FILINGS'.--indices_folder: the name of the folder where EDGAR TSV files will be stored. These are used to locate the annual reports. Default value is'INDICES'.--filings_metadata_file: CSV filename to save metadata from the reports.--skip_present_indices: Whether to skip already downloaded EDGAR indices or download them nonetheless.
Default value isTrue.remove_tables: Whether to remove tables containing mostly numerical (financial) data. This work is mostly to facilitate NLP research and often numerical tables are not useful
- Arguments for