Skip to content

Commit f9ceb3f

Browse files
Updates to documentation
1 parent d177056 commit f9ceb3f

7 files changed

+36342
-15081
lines changed

01a_extract-sql copy.ipynb

+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Extracting the data locally\n",
8+
"\n",
9+
"To statistically analyse the entire EUBUCCO database is too data-intensive for most local machines.\n",
10+
"\n",
11+
"To make the analysis possible on a local machine, this notebook downloads a semi-random sample of the database (currently 10%) with a reduced number of columns in chunks to parquet files. Note that the dbmods module needs to be updated with current IP Address of server, which often changes."
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 3,
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"from src import dbmods"
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": 4,
26+
"metadata": {},
27+
"outputs": [
28+
{
29+
"name": "stdout",
30+
"output_type": "stream",
31+
"text": [
32+
"Server connected.\n",
33+
"{'database': 'eubucco', 'user': 'readonly', 'password': 'readonly', 'host': 'localhost', 'port': 49690, 'connect_timeout': 10}\n"
34+
]
35+
},
36+
{
37+
"name": "stderr",
38+
"output_type": "stream",
39+
"text": [
40+
"2023-11-12 19:28:33,046| ERROR | Could not establish connection from local ('127.0.0.1', 49690) to remote ('192.168.48.2', 5432) side of the tunnel: open new channel ssh error: Timeout opening channel.\n"
41+
]
42+
},
43+
{
44+
"ename": "OperationalError",
45+
"evalue": "connection to server at \"localhost\" (::1), port 49690 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (127.0.0.1), port 49690 failed: server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n",
46+
"output_type": "error",
47+
"traceback": [
48+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
49+
"\u001b[0;31mOperationalError\u001b[0m Traceback (most recent call last)",
50+
"\u001b[1;32m/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql copy.ipynb Cell 3\u001b[0m line \u001b[0;36m8\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m query \u001b[39m=\u001b[39m \u001b[39m'''\u001b[39m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m \u001b[39m select db.id, db.height, db.age, db.geometry\u001b[39m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=2'>3</a>\u001b[0m \u001b[39m from data_building db\u001b[39m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=3'>4</a>\u001b[0m \u001b[39m tablesample system (10)\u001b[39m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=4'>5</a>\u001b[0m \u001b[39m repeatable (22);\u001b[39m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=5'>6</a>\u001b[0m \u001b[39m \u001b[39m\u001b[39m'''\u001b[39m\n\u001b[0;32m----> <a href='vscode-notebook-cell:/Users/chrishedemann/coding/eubucco_data_quality/01a_extract-sql%20copy.ipynb#W2sZmlsZQ%3D%3D?line=7'>8</a>\u001b[0m dbmods\u001b[39m.\u001b[39;49mSQL_to_parquet(query,fpath\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39m./data/parquet/sample_10percent_\u001b[39;49m\u001b[39m\"\u001b[39;49m)\n",
51+
"File \u001b[0;32m~/coding/eubucco_data_quality/src/dbmods.py:44\u001b[0m, in \u001b[0;36mSQL_to_parquet\u001b[0;34m(query, chunksize, fpath, chunk_per_file)\u001b[0m\n\u001b[1;32m 41\u001b[0m \u001b[39mprint\u001b[39m(params)\n\u001b[1;32m 43\u001b[0m \u001b[39m# 2. connect to database and start query iterator\u001b[39;00m\n\u001b[0;32m---> 44\u001b[0m conn \u001b[39m=\u001b[39m psycopg2\u001b[39m.\u001b[39;49mconnect(\u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mparams)\n\u001b[1;32m 45\u001b[0m conn\u001b[39m.\u001b[39mset_session(readonly\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m)\n\u001b[1;32m 47\u001b[0m data \u001b[39m=\u001b[39m gpd\u001b[39m.\u001b[39mread_postgis(\n\u001b[1;32m 48\u001b[0m sql\u001b[39m=\u001b[39mquery,\n\u001b[1;32m 49\u001b[0m con\u001b[39m=\u001b[39mconn,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 53\u001b[0m chunksize\u001b[39m=\u001b[39mchunksize,\n\u001b[1;32m 54\u001b[0m )\n",
52+
"File \u001b[0;32m~/coding/eubucco_data_quality/.venv/lib/python3.11/site-packages/psycopg2/__init__.py:122\u001b[0m, in \u001b[0;36mconnect\u001b[0;34m(dsn, connection_factory, cursor_factory, **kwargs)\u001b[0m\n\u001b[1;32m 119\u001b[0m kwasync[\u001b[39m'\u001b[39m\u001b[39masync_\u001b[39m\u001b[39m'\u001b[39m] \u001b[39m=\u001b[39m kwargs\u001b[39m.\u001b[39mpop(\u001b[39m'\u001b[39m\u001b[39masync_\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m 121\u001b[0m dsn \u001b[39m=\u001b[39m _ext\u001b[39m.\u001b[39mmake_dsn(dsn, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m--> 122\u001b[0m conn \u001b[39m=\u001b[39m _connect(dsn, connection_factory\u001b[39m=\u001b[39;49mconnection_factory, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwasync)\n\u001b[1;32m 123\u001b[0m \u001b[39mif\u001b[39;00m cursor_factory \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 124\u001b[0m conn\u001b[39m.\u001b[39mcursor_factory \u001b[39m=\u001b[39m cursor_factory\n",
53+
"\u001b[0;31mOperationalError\u001b[0m: connection to server at \"localhost\" (::1), port 49690 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (127.0.0.1), port 49690 failed: server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n"
54+
]
55+
}
56+
],
57+
"source": [
58+
"\n",
59+
"query = '''\n",
60+
" select db.id, db.height, db.age, db.geometry\n",
61+
" from data_building db\n",
62+
" tablesample system (10)\n",
63+
" repeatable (22);\n",
64+
" '''\n",
65+
"\n",
66+
"dbmods.SQL_to_parquet(query,fpath=\"./data/parquet/sample_10percent_\")\n"
67+
]
68+
}
69+
],
70+
"metadata": {
71+
"kernelspec": {
72+
"display_name": "Python 3 (ipykernel)",
73+
"language": "python",
74+
"name": "python3"
75+
},
76+
"language_info": {
77+
"codemirror_mode": {
78+
"name": "ipython",
79+
"version": 3
80+
},
81+
"file_extension": ".py",
82+
"mimetype": "text/x-python",
83+
"name": "python",
84+
"nbconvert_exporter": "python",
85+
"pygments_lexer": "ipython3",
86+
"version": "3.11.3"
87+
}
88+
},
89+
"nbformat": 4,
90+
"nbformat_minor": 4
91+
}

01b_repartition-parquet.ipynb

+13-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,23 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Repartition the parquet data\n",
8+
"\n",
9+
"The data downloaded from the database is supposed to be downloaded in equal sized chunks, but this doesn't always work out as expected. This notebook does a repartition aiming for around 300MB per parquet file. \n",
10+
"\n",
11+
"NB: Afterwards, manually replace the old parquet folder with the reparitioned folder. "
12+
]
13+
},
314
{
415
"cell_type": "code",
516
"execution_count": 6,
617
"metadata": {},
718
"outputs": [],
819
"source": [
9-
"import dask_geopandas as dgp\n"
20+
"import dask_geopandas as dgp"
1021
]
1122
},
1223
{
@@ -25,7 +36,7 @@
2536
"metadata": {},
2637
"outputs": [],
2738
"source": [
28-
"dgp.to_parquet(df, \"./data/parquet2/\")"
39+
"dgp.to_parquet(df, \"./data/parquet_repartitioned/\")"
2940
]
3041
}
3142
],

02_error-analysis-HEIGHT.ipynb

+36,087-15,025
Large diffs are not rendered by default.

03_error-analysis-ELONGATION-AREA.ipynb

+111-51
Large diffs are not rendered by default.

04_bretagne-case-study.ipynb

+15-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Bretagne case study\n",
8+
"\n",
9+
"This notebook investigates whether building height, caclulated from the eves to the ground, can be replaced with another height estimate from parameters in the raw data.\n",
10+
"\n"
11+
]
12+
},
313
{
414
"cell_type": "code",
515
"execution_count": 13,
@@ -15,7 +25,11 @@
1525
"\n",
1626
"import dask_geopandas as dgp\n",
1727
"\n",
18-
"plt.style.use('../styles/matplotlib-stylesheets/pitayasmoothie-dark.mplstyle')"
28+
"# Plot style, thanks to https://github.com/dhaitz\n",
29+
"plt.style.use('../styles/matplotlib-stylesheets/pitayasmoothie-dark.mplstyle')\n",
30+
"\n",
31+
"# Alternative\n",
32+
"# plt.style.use('seaborn-v0_8-pastel')"
1933
]
2034
},
2135
{

README.md

+16-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,16 @@
1-
# eubucco_data_quality
2-
Some investigations into existing data quality in the EUBUCCO database, including height, area and elongation factors.
1+
# EUBUCCO height-data quality
2+
3+
## Overview
4+
This repo investigates height data within the EUBUCCO database, with the aim of removing buildings with bad heights to improve energy models derived from the data. The following table describes the notebook and the steps taken in the analysis.
5+
6+
7+
| Notebook | Purpose |
8+
|---- |---- |
9+
| [01a_extract-sql.ipynb](./01a_extract-sql.ipynb)| Extracts a semi-random 10% sample of EUBUCCO database in chunks and stores the chunks in partitioned parquet files. <br /> This is helpful for performing statistical analysis on a local machine with limited computing resources, by using a largish subset of the data.|
10+
| [01b_extract-sql.ipynb](./01b_extract-sql.ipynb)| The download from the database doesn't optimize the chunk size for further analysis with dask. <br /> This notebook repartitions the downloaded parquet files (into 300-MB partitions). |
11+
| [02_error-analysis-HEIGHT.ipynb](02_error-analysis-HEIGHT.ipynb)| A general analysis of missing numbers, and of height data in different categories. The frequency and relative frequency of low heights and invalid/missing heights is visualised per country and region. Low heights are defined as below 2.5 m. |
12+
| [03_error-analysis-ELONGATION-AREA.ipynb](03_error-analysis-ELONGATION-AREA.ipynb)| Not all low buildings are necessarily misidentified structures. Here, the criteria of *area* and *elongation* are considered to help exclude undesirable and misidentified structures from the energy model. |
13+
14+
15+
## TO DOs
16+
* Plot the elongation distribution for Switzerland (see notebook 03). Switzerland may have a significant number of elongated structures.

requirements.txt

+9
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ packaging
77
geopandas>=0.13.0
88
dask
99
dask-geopandas
10+
python-dotenv
11+
pyarrow
12+
momepy
13+
ipywidgets
14+
shapely==2.0.1
1015

1116
# geodatabase access
1217
psycopg2-binary>=2.8.0
@@ -20,6 +25,10 @@ geopy
2025
matplotlib>=3.3.4
2126
mapclassify
2227
seaborn
28+
plotly
29+
ipykernel>=6.26.0
30+
nbformat>=5.9.2
31+
contextily
2332

2433
# PostGIS writing
2534
GeoAlchemy2

0 commit comments

Comments
 (0)