civic-scraper

Overview
Usage
Contributors

Overview

civic-scraper is a Python package that helps identify and download agendas, minutes and other file assets related to government meetings from local public agencies around the U.S.

License: Apache License, Version 2.0

This package currently contains library code that can be used in Python scripts for downloading metadata and files. In the future, we plan to add a command-line tool to simplify data acquisition for common use cases.

Usage

To use civic-scraper in your own Python projects, follow the steps below.

Note: At present, civic-scraper supports only websites using CivicPlus's Agenda Center, but in the future, it will support several types of websites where local governments post information about public meetings.

Install

Install the package from GitHub, for example using pip:

pip install git+https://github.com/biglocalnews/civic-scraper.git

Scraping asset information

Create an instance of CivicPlusSite by passing it the URL for an agency's CivicPlus Agenda Center site. Then call the scrape method.

# Example for Menlo Park, CA

from civic_scraper.scrapers import CivicPlusSite
url = 'https://ca-menlopark.civicplus.com/AgendaCenter'
site = CivicPlusSite(url)
assets = site.scrape()

By default, scrape returns metadata about meeting minutes, agendas and video recordings posted on the current day. However, it does not automatically download the assets!! See below for details on how to download files and export metadata.

Scraping can be modified to capture assets from varying time ranges by calling scrape() with the optional start_date and/or end_date arguments. Their values must be strings of the form YYYY-MM-DD.

# Scrape info from January 1-30, 2020
assets = site.scrape(start_date='2020-01-01', end_date='2020-01-30')

Downloading assets

There are two ways to download all file assets discovered by a scrape.

# Scrape metadata first,
# then call download on the response
assets = site.scrape()
assets.download()

# or...

# Scrape and download simultaneously
site.scrape(download=True)

By default, the above steps will download all of the assets to the current working directory.

You can exercise more fine-grained control over the download directory and other variables as demonstrated below:

# Download meeting minutes that are 20MB
# or less in size to /tmp/assets

assets = site.scrape()
assets.download(
  target_dir='/tmp/assets',
  file_size=20,
  asset_list=['minutes']
)

# OR

site.scrape(
  target_dir='/tmp/assets',
  file_size=20,
  asset_list=['minutes']
)

Here are more details on the parameters mentioned above:

target_dir (str) - Target download directory. The default is the current working directory.
file_size (int) - Limit downloads to files with max file size in megabytes. The default is None, in which case all files will be downloaded.
asset_list (list of str) - Limit downloads to one or more asset types. The default is to download all types of documents. Valid options:
- agenda
- minutes
- audio
- video
- agenda_packet - The exhibits and ancillary documents attached to a meeting agenda.
- captions - The transcript of a meeting recording.

Exporting asset metadata to csv

To bypass downloading of assets and instead generate a CSV of links to the assets and metadata about them:

output_file = '/tmp/asset_metadata.csv'

# Scrape, then write
assets = site.scrape()
assets.to_csv(output_file)

# OR


# Scrape and write all at once
site.scrape(csv_export=output_file)

You can choose to append to a pre-existing metadata file by using the append argument:

output_file = '/tmp/asset_metadata.csv'

# Scrape, then write
assets = site.scrape()
assets.to_csv(output_file, append=True)

# OR


# Scrape and write all at once
site.scrape(csv_export=output_file, append=True)

Here are more details on the above arguments:

target_path (str) - Full path to output file (required).
append (bool) - Update a pre-existing file if set to True. By default, a pre-existing file will be over-written.

The generated file contains the following information:

url (str) - The URL for an asset. Ex: https://ca-menlopark.civicplus.com/AgendaCenter/ViewFile/Agenda/_12082020-3549
asset_name (str) - The title of an asset. Ex: City Council Regular Meeting
committee_name (str) - The name of the committee that generated the asset. Ex: City Council
place (str) - The name of the place associated with the asset in lowercase with spaces and punctuation removed. Ex: menlopark
state_or_province (str) - The lowercase two-letter abbreviation for the state or province associated with an asset. Ex: ca
asset_type (str) - One of the following strings: agenda, minutes, audio, video, agenda_packet, captions
meeting_date (str) - Date of meeting or blank if no meeting date given in the format YYYY-MM-DD.
meeting_time (str) - Time of meeting or blank if no time given.
meeting_id: (str) - Platform name, state_or_province and place followed by unique meeting ID assigned by platform. Ex: civicplus_ca_menlopark_12082020-3549
scraped_by (str) - Module and version that produced the asset using semantic versioning. Ex: civicplus.py_1.0.0
content_type (str) - The MIME type of the asset. Ex: application/pdf
content_length (str) - The size of the asset in bytes.

Contributors

We welcome contributions of all varieties, from documentation to new scrapers to help with the underlying framework.

A great way to get started is to check out our Issues tracker to find work that needs doing. Or file an Issue proposing a feature you'd like to work on.

That way we can coordinate your efforts with that of other contributors and the core team.

Once you've found a way to contribute and are ready to dive in:

Fork this repo.
Write some snazzy code.
Test.We have minimal test coverage but we're always looking to add more. For code contributions (as opposed to documentation), we welcome added test coverage although we'll accept code without tests. However, please do at least run existing tests and verify there are no regressions before sending us a Pull Request.
Pull Request. Send us a Pull Request with changes once you're ready for us to review and merge.

Testing

TK

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
civic_scraper		civic_scraper
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

civic-scraper

Overview

Usage

Install

Scraping asset information

Downloading assets

Exporting asset metadata to csv

Contributors

Testing

About

Releases

Packages

Languages

License

DiPierro/civic-scraper

Folders and files

Latest commit

History

Repository files navigation

civic-scraper

Overview

Usage

Install

Scraping asset information

Downloading assets

Exporting asset metadata to csv

Contributors

Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages