Project uses use GitHub archives and local copy of all git repositories. The possible alternatives are:
- BigQuery:
- You can query any data you want, but the structure is quite flat and entire GitHub event payloads are stored as a single column containing JSON text.
- This limits usage due to the need of parsing that JSON in DB queries.
- BigQuery is commercial, paid and is quite expensive.
- It is not a standard SQL.
- GitHub API:
- You can get the current state of the objects, but you cannot get repo, PR, issue state in the past (for example summary fields, etc).
- It is limited by GitHub API usage per hour, which makes local development harder.
- API limits are very aggressive for unauthorized access, and even with authorized access, you're limited to 5000 API calls/hour. With this limit, it would take more than 2 months to get all Kubernetes GitHub events (estimate).
- It is much slower than processing GitHub archives or BigQuery.
- You must query it via API and it is returning a single result.
- You can use GitHub hook callbacks, but they only fire for current events.
- GitHub archives
- All GitHub events are packed into multi-json gzipped files each hour and made available from Github Archive. To use this data, you need to extract all hours (since the Kubernetes project started) and filter out all data except for events from the 4 kubernetes organizations (kubernetes, kubernetes-incubator, kubernetes-client, kubernetes-csi).
- This is a lot of data to process, but you have all possible GitHub events in the past, processing more than 3 years of this data takes about 2-2,5 hours, but this must only be done once and then the processed results are available for other's use.
- You have a lot of data in a single file, that can be processed/filtered in memory.
- You are getting all possible events, and all of them include the current state of PRs, issues, repos at given point in time.
- Processing of GitHub archives is free, so local development is easy.
- GitHub archives format changed in 2015-01-01, so it is using older format (pre-2015) before that date, and newer after. For details please see USAGE, specially
GHA2DB_OLDFMT
environment variable. - I have 1.2M events in my Psql database, and each event contains quite complex structure, I would estimate about 3-6 GitHub API calls are needed to get that data. It means about 7M API calls.
- 7.2M / 5K (API limit per hour) gives 1440 hours which is 2 months. And we're on GitHub API limit all the time. Processing ALL GitHub events takes about 2 hours without ANY limit.
- You can optionally save downloaded JSONs to avoid network traffic in next calls (also usable for local development mode).
- There is an already implemented version in Go, please see usage here USAGE
- Dashboards can be displayed here link
- Project also uses
git
to store local copy of all projects repositories to allow file related analysis (like list of files changed in a given commit, file sizes etc.).
- All projects repositories are cloned into local directory
- All projects repositories are updated every hour using
git
(part of standard cron workflow). - For all commits retrieved from GitHub archives we are storing list of modified (modified, added, deleted) files and their size (at the commit time).
- This allows file name analsysis, assigning given files from repositories to repository groups (file level granularity) and file size analysis (for example file size growth in time).
- cncf/gitdm uses git to GitHub connection from git pushes to allow connecting committers with their GitHub accounts. This is used to determine committers company affiliations.
We're getting all possible GitHub data for all objects, and all objects historical state as well (not discarding any data). We are also keeping copy of all git repositories used in all projects and update it every hour:
structure
(manages database structure, summaries, views)
- structure
- It is used to create database structure, indexes and to update database summary tables, views etc.
- Postgres advantages over MySQL include:
- Postgres supports hash joins that allows multi-million table joins in less than 1s, while MySQL requires more than 3 minutes. MySQL had to use data duplication in multiple tables to create fast metrics.
- Postgres has built-in fast REGEXP extract & match, while MySQL only has slow REGEXP match and no REGEXP extract, requiring external libraries like
lib_mysql_pcre
to be installed. - Postgres supports materialized views - so complex metrics can be stored by such views, and we only need to refresh them when syncing data. MySQL requires creating an additional table and managing it.
- MySQL has utf8 related issues, I've found finally workaround that requires to use
utf8mb4
and do some additionalmysqld
configuration.
gha2db
(imports GitHub archives to database and eventually JSON files)
- devstats
- Reads from GitHub archive and writes to Postgres
- It saves ALL data from GitHub archives, so we have all GitHub structures fully populated. See Database structure.
- We have all historical data from all possible GitHub events and summary values for repositories at given points of time.
- The idea is to divide all data into two categories:
const
andvariable
. Const data is a data that is not changing in time, variable data is a data that changes in time, soevent_id
is added as a part of this data primary key. - Table structure,
const
andvariable
description can be found in USAGE - The program can be parallelized very easy (events are distinct in different hours, so each hour can be processed by other CPU), uses 48 CPUs on our test machine.
calc_metric
(computes metrics given as SQL files to be run on Postgres and saves time series output to time series)
- calc_metric
- This separates metrics complex logic in SQL files,
calc_metric
executes parameterized SQL files and write final data as a time-series (also on Postgres). - Parameters are
'{{from}}'
,'{{to}}'
to allow computing the given metric for any date period. - For histogram metrics there is a single parameter
'{{period}}'
instead. To runcalc_metric
in histogram mode add "h" as last parameter after all other params.gha2db_sync
already handles this. - This means that time series tables will only hold multiple time-series (very simple data).
- Grafana will read from Postgres time series.
- Adding new metric will mean add Postgres SQL that will compute this metric.
gha2db_sync
(synchronizes data sources and Postgres databases)
- gha2db_sync
- This program figures out what is the most recent data in Postgres database then queries GitHub archive from this date to current date.
- It will add data to Postgres database (since the last run)
- It will update summary tables and/or (materialized) views on Postgres DB.
- It will update new commits files list using
get_repos
program. - Then it will call
calc_metric
for all defined SQL metrics and update time series data database as well. - You need to set
GHA2DB_PROJECT=project_name
currently it can be either kubernetes, prometheus or opentracing. Projects are defined inprojects.yaml
file. - It reads a list of metrics from YAML file:
metrics/{{project}}/metrics.yaml
, some metrics require to fill gaps in their data. Those metrics are defined in another YAML filemetrics/{{project}}/gaps.yaml
. Please try to use Grafana's "nulls as zero" instead of using gaps filling. - This tool also supports initial computing of all time series data (instead of default update since the last run).
- It can be called by cron job on 1:10, 2:10, ... and so on - GitHub archive publishes new file every hour, so we're off by at most 1 hour.
- It can also be called automatically by
devstats
tool
devstats
(callsgha2db_sync
for all defined projects)
- devstats
- This program will read
projects.yaml
callget_repos
to update all projects git repos, then callgha2db_sync
for all defined projects that are not disabled bydisabled: true
. - It uses own database just to store logs from running project syncers, this is a Postgres database "devstats".
- It creates PID file
/tmp/devstats.pid
while it is running, so it is safe when instances overlap. - It is called by cron job on 1:10, 2:10, ... and so on - GitHub archive publishes new file every hour, so we're off by at most 1 hour.
get_repos
: it can update list of all projects repositories (clone and/or pull as needed), update each commits files list, display all repos and orgs data bneeded bycncf/gitdm
.
- get_repos
get_repos
is used to clone or pull all repos used in alldevstats
project in a location fromGHA2DB_REPOS_DIR
environment variable, or by default in "~/devstats_repos/".- Those repos are used later to search for commit SHA's using
git log
to determine files modifed by particular commits and other objects. - It can also be used to return list of all distinct repos and their locations - this can be used by
cncf/gitdm
to create concatenatedgit.log
from all repositories for affiliations analysis. - This tool is also used to create/update mapping between commits and list of files that given commit refers to, it also keep file sizes info at the commit time.
ghapi2db
: it uses GitHub API to get labels and milestones information for all open issues and PRs from last 2 hours.
- ghapi2db.
- Issues/PRs contain labels/milestones information from the last GitHub event on those issues/PR. This is a state from last issue comment.
- Sometimes labels and/or milestone information is changed after the last commit. New issue labels/milestone will only be visible after the next issue comment.
- This tool queries all open issues/PRs from last 2 hours to check their label set and milestone. If it detects difference it creates artificial events with the new state.
- This is used by 'Open issues/PRs by milestone' dashboard to make sure that we have correct informations.
- GitHub API points are limited to 5000/hour, use
GHA2DB_GITHUB_OAUTH
env variable to set GitHub OAUth token path. Default is/etc/github/oauth
. You can set to "-" to force public acces, but you will be limited to 60 API calls/hour.
- Additional stuff, most important being
runq
andimport_affs
tools.
- runq
runq
gets SQL file name and parameter values and allows to run metric manually from the command line (this is for local development)- import_affs
import_affs
takes one parameter - JSON file name (this is a file from cncf/gitdm: github_users.json- This tools imports GitHub usernames (in addition to logins from GHA) and creates developers - companies affiliations (that can be used by Companies stats metric)
- annotations
annotations
is used to add annotations on charts. It uses GitHub API to fetch tags from project main repository defined inprojects.yaml
, it only includes tags matching annotation regexp also defined inprojects.yaml
.- tags
tags
is used to add tags. Those tags are used to populate Grafana template drop-down values and names. This is used to auto-populate Repository groups drop down, so when somebody adds new repository group - it will automatically appear in the drop-down.tags
uses tags.yaml file to configure tags generation.- columns
columns
is used to specify which columns are mandatory on which time series tables (because missing column is an error in Postgres). You can define table9s) by regexp and then specify which columns are mandatory by specifying tags table and column.columns
uses columns.yaml file to configure mandatory columns.- You can use all defined environments variables, but add
_SRC
suffic for source database and_DST
suffix for destination database. - webhook
webhook
is used to react to Travis CI webhooks and trigger deploy if status, branch and type match defined values, more details here.- Add
[no deploy]
to the commit message, to skip deploying. - Add
[ci skip]
to skip testing (will not spawn Travis CI build). - Add
[deploy]
to do a full deploy using./devel/deploy_all.sh
script, this needs more environment variables to be set, see here. - There are few shell scripts for example: running sync every N seconds, setup time series data etc.
- merge_dbs
merge_dbs
is used to generate Postgres database that contains data from other multiple databases.- You can use
merge_dbs
to add new projects to a existing database, but please consider running './devel/remove_db_dups.sh' then or use: './all/add_project.sh' script. - replacer
replacer
is used to mass replace data in text files. It has regexp modes, string modes, terminate on no match etc.- Supports MODE, FROM, TO, NREPLACES, REPLACEFROM environment variables, see
devel/replace.sh
script for examples. - vars
vars
is used to add special variables (tags) to the database, see here for more info.- sqlitedb
sqlitedb
is used to manipulate Grafana's SQLite database, see here for more info.
The main idea is that we divide tables into 2 groups:
- const: meaning that data in this table is not changing in time (is saved once)
- variable: meaning that data in those tables can change between GH events, and GH
event_id
is a part of this tables primary key. - there are also "compute" tables that are auto-updated by
gha2db_sync
/structure
tools and affiliations table that is filled byimport_affs
tool.
Please see USAGE for detailed list of database tables.
Ruby version was dropped, but you can see benchmarks of Ruby using MySQL, Ruby using Postgres and current Go using Postgres here:
In summary: Go version can import all GitHub archives data (not discarding anything) for all Kubernetes orgs/repos, from the beginning on GitHub 2014-06-01 in about 2-2,5 hours! Cloning all projects repositories takes about 15 minutes when done for a first time and takes about 12 GB disk space.