In the early days (March 2008), WSPR Spots measured in the hundreds of thousands per month. Today, that number has increased to over 75+ Million per month and shows no sign of abatement. By any reasonable definition, it is safe to say that WSPR has entered the realm of Big Data.
- Full Tool Chain Installation and Environment Setup Guide(s)
- Tutorials, experiments, and tests on large data sets
- Exposure to leading-edge technologies in the realm of Big Data Processing
- Hints, tips and tricks for keeping your Linux distro running smooth
- And eventually, produce useful datasets for the greater Amateur Radio Community to consume
The focus of this project is to provide a set of tools to download, manage, transform and query WSPR DataSets using modern Big Data frameworks.
The setup section in the documentation below below will walk users through everything they need to setup their system for big data processing. The guide has been well tested on three different Linux distributions, namely: Ubuntu-20.04, Arch Linux, and Alpine.
For newer Linux users, I'd highly recommend Ubuntu-20.04, as Arch and Alpine can be difficult if you are not accustom to their installation methods.
Each folder will contain a series of README files that explain the content, and where warranted ,
usage. Additionally, project documentation website can be used for more extensive and exhaustive
explanation of the project and its contents.
- WSPR Analytics Docs - bookmark this location for future reference
Several frameworks are used in this repository. The following matrix provides a short description of each, and their intended purpose.
| Folder | Frameworks | Description | 
|---|---|---|
| docs | Python, MkDocs | General repository documentation | 
| golang | Golang | General purpose command line apps and utilities | 
| java | Java, Maven, SBT | Java apps for RDD and Avro examples | 
| notebooks | Jupyter Notebooks | Notebooks for basic test and visualization | 
| pyspark | Python, PyArrow | Scripts that interact with CSV and Parquet files | 
| spark | Scala | Scala programs to perform ETL tasks | 
| wsprdaemon | Python, Scala, Psql | Utilities related to the WSPR Daemon project | 
| wsprana | Python | (soon to be retired) | 
You must have Python, Java, PySpark / Spark (Scala) and SBT available from the command line.
- Java OpenJDK version 1.8.0_275 or later
- Python 3.7 or 3.8, PyArrow has issues with 3.9 at present
- PySpark from PyPi
- Apache Arrow 2.0+
- Scala 2.12.12 - patch version 10,11,12,13 also work with Spark 3.0.1 / 3.1.1
- Spark 3.0.1
- PostgreSQL Database (local, remote, Docker, Vagrant, etc)
- Optional ClickHouse High Performance Database
IMPORTANT: The Spark / Scala combinations are version sensitive. Check the Spark download page for recommended version combinations if you deviate from what is listed here. As of this writing, Spark 3.0.1 and above was built with Scala 2.12.10. For the least amount of frustration, stick with what's known to work (any of the 2.12.xx series)
The main data source will be the monthly WSPRNet Archives. At present, there is no plan to pull nightly updates. That could change if a reasonable API is identified. WSPR Daemon
The tools (apps/scripts) will be used to convert the raw CSV files into a format better suited for parallel processing, namely, Parquet. Read speeds, storage footprints, and ingestion improve dramatically with this storage format. However, there is a drawback, one cannot simply view a binary file as they can with raw text files. The original CSV will remain in place, but all bulk processing will be pulled from Parquet or a high performance database such as ClickHouse. During these transformations is where PyArrow, PySpark or Spark will earn it's keep.
A PostgreSQL database server will be needed. There are many ways to perform this installation (local, remote, Dockerize PostgreSQL, PostgreSQL with Vagrant, etc).
While PostgreSQL is a highly-capabale RDMS, another database that is better suited to big data and extremely fast queries called ClickHouse will be used.
It is column-oriented and allows to generate analytical reports using SQL queries in real-time.
- Blazingly fast
- Linearly scalable
- Feature rich
- Hardware efficient
- Fault-tolerant
- Highly reliable