-
Notifications
You must be signed in to change notification settings - Fork 36
6. Data (OR Configuring Input Data for New Study Regions?)
If you are only generating a policy checklist refer to the Policy Checklist Data
When configuring new study region(s), some input data will be required. This is to be stored in sub-folders within the process/data
folder. Examples of required data include a population grid, an excerpt from OpenStreetMap, and a study region boundary (either an administrative boundary or an urban region from the Global Human Settlements Urban Centres Database). Other data are optional: for example, custom boundaries for aggregation (see example), GTFS transit feed data, or a completed policy checklist.
The kinds of data that can be configured for usage are summarised in the below table. We have provided examples for each of these for Las Palmas, with the exception at the time of writing of a completed policy checklist.
Usage note | Data sub-folder | Purpose |
---|---|---|
Required | OpenStreetMap | an OpenStreetMap .pbf file with coverage of the region (and time) of interest; this could be an historical planet file, or a region-specific excerpt |
Required | population_grids | Population distribution raster grid or vector data with coverage of urban region of interest. GHS population grid (R2023) is recommended (for example, the 2020 Molleweide 100m grid tiles corresponding to your area of interest, with these saved and extracted to a folder like process/data/GHS/R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0 , which may be specified in process/configuration/datasets.yml . Take care to select the correct Epoch for your analysis before downloading!
|
Conditional | region_boundaries | Vector boundary for identifying study region (e.g. geopackage, geojson or shp). If a geopackage is used, a specific layer can optionally be specified and queried, as per this example |
Conditional | urban_regions | Global Human Settlements Layer Urban Centres database and/or administrative boundary for urban region of interest |
Optional | policy_review | Summarising results of a policy review analysis |
Optional | transit_feeds | Collections of zipped GTFS feeds to represent public transport service frequency |
Optional | other_custom_data | Other custom data, such as points of interest |
Specific paths to data are configurable in the configuration files. However, in general, it is recommended that input data be stored within subfolders of the data folder as per the provided example.
OpenStreetMap data are used to represent features of interest within regions of interest.
This can be retrieved in .pbf (recommended; smaller file size) or .osm format from sites including:
- https://download.geofabrik.de/
- http://download.openstreetmap.fr/extracts/
- https://planet.openstreetmap.org/pbf/ (whole of planet data archives; very large file sizes, but can be useful for retrieving historical data)
It is recommended that a suffix to indicate the publication date is added to the name of downloaded files as a record of the time point at which the excerpt is considered representative of the region of interest e.g. "las_palmas-latest.osm_20230210.pbf" or "oceania_yyyymmdd.pbf" where yyyy is the year, mm is the 2-digit numerical month, and dd is the 2-digit numerical day date_.
The main considerations are that the excerpt is as small as possible while ensuring that it has complete coverage of the region(s) of interest (and about 1600 metres additional beyond the boundary, or as otherwise configured). Using a smaller rather than a larger file will speed up processing.
For example, for a project considering multiple cities in Spain, the researchers could download an excerpt for the country of Spain or for specific sub-regions like Catalunya as required to ensure that the region of interest is encompassed within the extract. Using the example of Spain, which also contains regions outside Europe, if sourcing from Geofabrik as the example links above, it is worth noting that the excerpts are grouped by continent so some care should be taken. For example, Las Palmas de Gran Canaria are found under the Canary Islands in Africa on Geofabrik, or a smaller PBF specifically for Las Palmas can be retrieved from download.openstreetmap.fr under 'africa/spain/canarias'. Our software provides example data for Las Palmas, and to reduce file size we prepared a clipped portion of the data around this city using only the minimum area required.
Population grid or vector data are used to represent the population distribution within regions of interest.
We recommend usage and configuration of the R2023a or more recent population data from the Global Human Settlements Layer project, with a time point of 2020 (or as otherwise appropriate for your project needs, and bearing in mind the limitations of the GHSL population model) and using Mollweide (equal areas, 100m) projection. note: take care when retrieving the GHSL population data to select the correct Epoch! The default at the time of writing is a population projection for 2030, however, currently the most relevant dataset is the 2020 estimates; make sure to select 2020
A whole-of-planet population dataset for 2020 can be downloaded from the following link: https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2022A/GHS_POP_E2020_GLOBE_R2022A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2022A_54009_100_V1_0.zip
This file is approximately 5 GB in size, so if you are only analysing one or a few cities, it may be worth identifying the specific tiles that relate to your study region(s) using the interactive map downloader provided by GHSL and storing them as suggested above.
The report describing this data is located here.
The citation for the data is: Schiavina M., Freire S., MacManus K. (2022): GHS-POP R2022A - GHS population grid multitemporal (1975-2030).European Commission, Joint Research Centre (JRC). https://doi.org/10.2905/D6D86A90-4351-4508-99C1-CB074B022C4A
Optionally, official population data can be used if this is judged to be more accurate or meaningful for your city's analysis and interpretation of results by your intended audience. For example, to use the official 1km Australian population grid for 2021 (instead of the 100m grid estimates from the GHSL-POP r2022a dataset for 2020 or 2025), this could be specified as per the following code block:
population:
name: "Australian Population Grid 2021 (ABS, 2023)"
data_dir: population_grids/Australian_Population_Grid_2021_in_TIFF_format
data_type: raster:Int64
resolution: 1km
raster_band: 1
raster_nodata:
pop_min_threshold: 5
crs_name: GDA94 / Australian Albers
crs_standard: EPSG
crs_srid: 3577
source_url: https://www.abs.gov.au/statistics/people/population/regional-population/2021/GEOTIFF.zip
year_published: 2022
year_target: 2021
date_acquired: 20230302
licence: CC BY 4.0
citation: "Australian Bureau of Statistics (2022), Australian Population Grid 2021. https://www.abs.gov.au/statistics/people/population/regional-population/2021"
Optionally, population estimates from a vector data source can also be configured, for example:
population:
alias: catalunya_2021
name: "Població de Catalunya georeferenciada a 1 de gener de 2021"
data_dir: population_grids/gridpoblacio01012021/gridpoblacio_01012021.shp
vector_population_data_field: TOTAL
population_denominator: TOTAL
crs_name: ETRS89 / UTM zone 31N
crs_standard: ESRI
crs_srid: 25831
source_url: https://www.idescat.cat/serveis/biblioteca/docs/bib/publicacions/gridpoblacio01012021.zip
provider: Institut d’EstadÃstica de Catalunya
year_published: 2023
year_target: 2021
date_acquired: 20230608
licence: CC BY 4.0
licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast
data_type: vector
pop_min_threshold: 1
# urban sample points intersecting grid cells with an estimated population less than this will be excluded from the analysis
citation: "Població De Catalunya Georeferenciada a 1 De Gener De .. Barcelona: Generalitat de Catalunya. Institut d'EstadÃstica de Catalunya, 2016. https://biblio.idescat.cat/publicacions/Record/21104"
In the above vector data example, this contains multiple fields with estimates by age group which opens up additional possibilities for analysis of indicators for specific population sub-groups (in this case, using age). To do this, one could modify the vector_population_data_field
to the field containing the population of interest, e.g.
vector_population_data_field: P_15_64
In this way, age-specific indicators could be calculated and compared. Indicators based on overall population density use the estimates from the population_denominator
field; hence, if you are interested in the overall population for your indicators the data_field and the denominator should share the same value.
Choice of the population data grid is an important methodological choice for analysts as it determines the resolution at which results are output (for example a choice between 100m estimated population, or 1000m official population statistic, or administrative areas of varying dimensions). As such, it is good to know that this is an option available to ensure your indicator results serve their intended purpose and audience.
Region boundary data are used to identify the study region(s) of interest. These may be retrievable from a government open data portal, or the UN OCHA Humanitarian Data Exchange.
Additional custom region boundaries can be used for custom aggregations, as per the example:
###########
## Optional custom aggregation to additional areas of interest (e.g. neighbourhoods, suburbs, specific developments):
custom_aggregations:
## Name for this aggregation layer
## The name is followed by a colon, indicating that a list of details follows
school_districts_grid_pop:
## path to data relative to the project data folder
data: "region_boundaries/Example/Las Palmas excerpt- gobcan_educacion_areainfluenciacentrosecundaria.geojson"
## The field used as a unique identifier
id: 'Codigo'
## A list of column field names to be retained
keep_columns: Denominaci, cod_postal
## The indicator layer to be aggregated ("point" or "grid")
## Aggregation is based on the average of intersecting results
## unless the agg_distance parameter is defined (see alternative example below)
aggregation_source: grid
## The variable used for weighting (e.g. 'pop_est' for population when using the grid; leave blank or "false" if using sample points)
weight: pop_est
note: "Example of aggregating indicators for high school catchment districts within Las Palmas, using the intersection with the population grid and taking the population weighted average of indicators. Boundary data was derived from data sourced from the open data portal of the Gobierno de Canarias under CC BY 4.0 licence terms: https://opendata.sitcan.es/dataset/centros-educativos/resource/ea650255-c6ea-48c1-84e8-547735624017 (last updated 31 May 2023)."
## an example for aggregating for buildings represented in OpenStreetMap
buildings_osm_30m:
data: "OSM:building is not NULL"
keep_columns: building
## Distance within metres to use for taking average when aggregating
## (see note)
aggregate_within_distance: 30
aggregation_source: point
note: "Example of aggregating using buildings extracted from the configured OpenStreetMap data, taking the average of sample point estimates taken along the pedestrian network within 30m. This has been done because the point indicators were sampled along the pedestrian network and are therefore unlikely to intersect with buildings. By taking the average of points within some reasonable distance, the result is like a moving window average that should provide a reasonable representation of the immediate neighbourhood milieu surrounding the building."
###########
Urban region data are used to optionally identify and restrict the analysis to the urban portion of a study region, and optionally can also be used to identify an urban region of interest (e.g. an urban agglomeration surrounding a major city). Additionally, city-level urban covariates contained within the database may be linked with the final estimates (e.g. air pollution estimates).
This is done using the Global Human Settlements Urban Centres Database, for example GHS_STAT_UCDB2015MT_GLOBE_R2019A_V1_2.gpkg (or a more recent release, if available), which may be retrieved from https://ghsl.jrc.ec.europa.eu/download.php?ds=ucdb.
The dataset linked above represents urban centres in 2015 and may be cited as: Florczyk A., Corbane C,. Schiavina M., Pesaresi M., Maffenini L., Melchiorri, M., Politis P., Sabo F., Freire S., Ehrlich D., Kemper T., Tommasi P., Airaghi D., Zanchetta L. (2019) GHS Urban Centre Database 2015, multitemporal and multidimensional attributes, R2019A. European Commission, Joint Research Centre (JRC)PID: https://data.jrc.ec.europa.eu/dataset/53473144-b88c-44bc-b4a3-4583ed1f547e
The global indicators software has been designed to report on both policy and spatial indicator results. Once the policy checklist has been completed for a city, save this in the project data folder (e.g. within data/policy_review
). Using the GHSCI web app interface you can now select this completed file to view, query and generate a PDF summary report.
Collections of General Transit Feed Specification (GTFS) feeds are used to represent public transport service frequency. See the GTFS reference for details on what is required for GTFS feeds. More than one transit operator may operate in the region of interest, so it is important to aim to have full coverage of both the region and the operators, without overlap of services, to represent the service frequency of public transport stops in a city.
Transit feed data may be available from Government open data portals, or otherwise from aggregator sites such as https://transitfeeds.com/ and https://www.transit.land/.
We recommend that zipped feeds be stored in a region-specific GTFS parent folder, e.g. "Spain - Valencia", that contains one or zipped GTFS feeds, e.g. gtfs_spain_valencia_EMT_20190403.zip and gtfs_spain_valencia_MetroValencia_20190403.zip. The parent folder and specific feeds are configured in the region configuration file. For example,
## Optional set-up for General Transit Feed Specification (GTFS) transit data.
## GTFS feed data is used to evaluate access to public transport stops with regular weekday daytime service
## For cities with no GTFS feeds identified, this may be left commented out.
gtfs_feeds:
## City-specific parent folder in the 'process/data/transit_feeds' directory
folder: Example
## list of zipped GTFS feeds saved in the above folder
gtfs_es_las_palmas_de_gran_canaria_guaguas_20230222.zip:
## Name of agency that published this data
gtfs_provider: Guaguas
## Year the data was published
gtfs_year: 2023
## Source URL for the data
gtfs_url: http://www.guaguas.com/transit/google_transit.zip
## The start date of a representative period for analysis
## (outside school holidays and extreme weather events), e.g. Spring/Summer
## for Northern Hemisphere: 20230405
## for Southern Hemisphere: 20231008
start_date_mmdd: 20230405
## The start date of a representative period for analysis
## (outside school holidays and extreme weather events), e.g. Spring/Summer
## for Northern Hemisphere: 20230605
## for Southern Hemisphere: 20231205
end_date_mmdd: 20230605
Some GTFS feeds don't adhere to the reference standard and some customisation is possible to accommodate the mapping of alternate route_types
and agency_id
values. The defaults for these fields can be seen in the datasets.yml file here. These can be over-ridden for each mode, for example, for the below feed which describes service schedules for a semi-Metro a route type of 400 was used. This was mapped as a value for the 'Tram' mode to allow analysis to progress:
gtfs_feeds:
folder: Marc issues/Malaga
20230519_130136_Metro_Malaga:
gtfs_provider: Metro de Málaga
gtfs_year: 2023
start_date_mmdd: 20230519
end_date_mmdd: 20230831
modes:
Tram: {'route_types': [400]}
Multiple values can also be specified, e.g.,
modes:
Bus: {'route_types': [700,712,714]}
However, in most cases, the route_type
values for a feed (found in the routes.txt
file) should correspond to the defaults of the specification, in which case the modes
parameter can be omitted as per the example configuration. This will draw upon the default settings.
Generated resources are saved to study region-specific codename sub-folders within the data/_study_region_outputs
directory.
The following provides an indicative list of the contents of this folder (italics indicate a word that will vary depending on configuration):
Item | Type | Description |
---|---|---|
_web reports | Folder | Contains generated policy and spatial indicator reports, optionally in multiple languages |
figures | Folder | Contains generated maps and figures |
__region name__codename_processing_log.txt | text file | A text file that is progressively appended to with the screen outputs for analyses that are not otherwise displayed. This contains a record of processing, and is useful when debugging if something has gone awry with a particular configuration or supplied data |
_parameters.yml | text file | A text file containing records of the most recent configuration analysed (on detection of a new version, older versions will be retained with a suffix indicating the date at which it was current) |
analysis_report_yyyy-mm-dd_hhmm.pdf | PDF file | A PDF report summarising approach taken for analysis of the core set of spatial indicators and generation of associated resources |
compare_reference codename_comparison_codename_yyyy-mm-dd_hhmm.csv | CSV file | A CSV spreadsheet containing a comparison of summary indicator results for a reference study region and a comparison study region (generated as a result of running the compare process) |
codename_1600m_buffer.gpkg | Geopackage file | A geopackage containing derived study region features of interest used in analyses, and including grid and overall summary results for this region |
codenameindicators{resolution}_yyyy.csv | CSV file | Grid or small area summary results of indicator analysis for this region |
codename_indicators_region.csv | CSV file | Overall summary results of indicator analysis for this region |
codename_metadata.xml | XML file | XML metadata (ISO19115) |
codename_metadata.yml | YML file | YML metadata |
output_data_dictionary.csv | CSV file | CSV data dictionary |
output_data_dictionary.xlsx | XLSX file | Formatted Excel data dictionary |
codename_osm_yyyymmdd.pbf | PBF file | An excerpt from OpenStreetMap for this buffered study region as configured |
poly_codename.poly | text file | A polygon boundary file; this is generated for the buffered urban region of interest as per configuration in regions/codename.yml, and is used to excerpt a portion of OpenStreetMap for this region from the configured input data |
population_resolution_codename_project epsg code.tif | TIF file | A population raster for this buffered study region, excerpted from the input data, in the projects coordinate reference system |
population_resolution_codename_source crs.tif | TIF file | A population raster for this buffered study region, excerpted from the input data, in the coordinate reference system of the input population data (e.g. Mollweide, in the case of the recommended GHS-POP data) |