Skip to content

Commit d2102d1

Browse files
authored
Add daily workflow to export GitHub release download count (#34)
* wip * add prerelease, postrelease, devrelease * fix name * fix name * cleanup * Update README.md * Update metrics.py * fix unit test
1 parent 538f270 commit d2102d1

File tree

12 files changed

+353
-34
lines changed

12 files changed

+353
-34
lines changed

.github/workflows/daily_collection.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ jobs:
2727
timeout-minutes: 25
2828
steps:
2929
- uses: actions/checkout@v4
30+
with:
31+
repository: sdv-dev/PyMetrics
32+
token: ${{ secrets.GH_TOKEN }}
3033
- name: Install uv
3134
uses: astral-sh/setup-uv@v6
3235
with:
@@ -56,6 +59,13 @@ jobs:
5659
env:
5760
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
5861
ANACONDA_OUTPUT_FOLDER: ${{ secrets.ANACONDA_OUTPUT_FOLDER }}
62+
- name: Collect GitHub Downloads
63+
run: |
64+
uv run pymetrics collect-github \
65+
--output-folder ${{ secrets.GH_OUTPUT_FOLDER }}
66+
env:
67+
PYDRIVE_CREDENTIALS: ${{ secrets.PYDRIVE_CREDENTIALS }}
68+
GH_OUTPUT_FOLDER: ${{ secrets.GH_OUTPUT_FOLDER }}
5969
alert:
6070
needs: [collect]
6171
runs-on: ubuntu-latest
@@ -77,4 +87,4 @@ jobs:
7787
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
7888
-m 'Daily Collection PyMetrics failed :fire: :dumpster-fire: :fire:'
7989
env:
80-
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
90+
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}

.github/workflows/daily_summarization.yaml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ jobs:
1717
timeout-minutes: 10
1818
steps:
1919
- uses: actions/checkout@v4
20+
with:
21+
repository: sdv-dev/PyMetrics
22+
token: ${{ secrets.GH_TOKEN }}
2023
- name: Install uv
2124
uses: astral-sh/setup-uv@v6
2225
with:
@@ -69,6 +72,6 @@ jobs:
6972
uv run python -m pymetrics.slack_utils \
7073
-r ${{ github.run_id }} \
7174
-c ${{ github.event.inputs.slack_channel || 'sdv-alerts' }} \
72-
-m 'Daily Summarize PyMetrics failed :fire: :dumpster-fire: :fire:'
75+
-m 'Daily Summarization PyMetrics failed :fire: :dumpster-fire: :fire:'
7376
env:
74-
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
77+
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,8 @@ Currently, the download data is collected from the following distributions:
4444
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
4545
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
4646

47-
### Future Data Sources
48-
In the future, we may expand the source distributions to include:
49-
* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.
47+
* [GitHub Releases](https://docs.github.com/en/rest/releases): Information about the project downloads from GitHub release assets.
48+
See this [GitHub API](https://docs.github.com/en/rest/releases/releases?apiVersion=2022-11-28#get-a-release).
5049

5150
# Install
5251
Install pymetrics using pip (or uv):
@@ -143,6 +142,9 @@ The aggregation metrics spreasheets contain the following tabs:
143142
* **By Month and Python Version:** Absolute number of downloads per month and Python version.
144143
* **By Month and Country Code:** Absolute number of downloads per month and country.
145144
* **By Month and Installer Name:** Absolute number of downloads per month and Installer.
145+
* **By Prerelease**: Absolute and relative number of downloads for pre-release versions (alpha, beta, release candidate, and development versions).
146+
* **By Postrelease**: Absolute and relative number of downloads for post-release versions.
147+
* **By Devrelease**: Absolute and relative number of downloads for development release versions.
146148

147149
## Known Issues
148150
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45

github_config.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
projects:
2+
sdv-dev:
3+
- sdv-dev/SDV
4+
- sdv-dev/RDT
5+
- sdv-dev/SDMetrics
6+
- sdv-dev/SDGym
7+
- sdv-dev/Copulas
8+
- sdv-dev/CTGAN
9+
- sdv-dev/DeepEcho
10+
gretel:
11+
- gretelai/gretel-python-client
12+
- gretelai/trainer
13+
- gretelai/gretel-synthetics
14+
mostly-ai:
15+
- mostly-ai/mostlyai
16+
- mostly-ai/mostlyai-mock
17+
ydata:
18+
- ydataai/ydata-synthetic
19+
- ydataai/ydata-quality
20+
- ydataai/ydata-fabric-sdk
21+
realtabformer:
22+
- worldbank/REaLTabFormer
23+
synthcity:
24+
- vanderschaarlab/synthcity
25+
smartnoise-sdk:
26+
- opendp/smartnoise-sdk
27+
be_great:
28+
- kathrinse/be_great

pymetrics/__main__.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
import yaml
1111

1212
from pymetrics.anaconda import collect_anaconda_downloads
13+
from pymetrics.gh_downloads import collect_github_downloads
1314
from pymetrics.main import collect_pypi_downloads
1415
from pymetrics.summarize import summarize_downloads
1516

@@ -76,6 +77,19 @@ def _collect_anaconda(args):
7677
)
7778

7879

80+
def _collect_github(args):
81+
config = _load_config(args.config_file)
82+
projects = config['projects']
83+
output_folder = args.output_folder
84+
85+
collect_github_downloads(
86+
projects=projects,
87+
output_folder=output_folder,
88+
dry_run=args.dry_run,
89+
verbose=args.verbose,
90+
)
91+
92+
7993
def _summarize(args):
8094
config = _load_config(args.config_file)
8195
projects = config['projects']
@@ -243,6 +257,29 @@ def _get_parser():
243257
default=90,
244258
help='Max days of data to pull. Default to last 90 days.',
245259
)
260+
261+
# collect GitHub downloads
262+
collect_github = action.add_parser(
263+
'collect-github', help='Collect download data from GitHub.', parents=[logging_args]
264+
)
265+
collect_github.set_defaults(action=_collect_github)
266+
collect_github.add_argument(
267+
'-c',
268+
'--config-file',
269+
type=str,
270+
default='github_config.yaml',
271+
help='Path to the configuration file.',
272+
)
273+
collect_github.add_argument(
274+
'-o',
275+
'--output-folder',
276+
type=str,
277+
required=True,
278+
help=(
279+
'Path to the folder where data will be outputted. It can be a local path or a'
280+
' Google Drive folder path in the format gdrive://<folder-id>'
281+
),
282+
)
246283
return parser
247284

248285

pymetrics/anaconda.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,14 @@
22

33
import logging
44
import os
5-
from datetime import datetime, timedelta
6-
from zoneinfo import ZoneInfo
5+
from datetime import timedelta
76

87
import pandas as pd
98
import requests
109
from tqdm import tqdm
1110

1211
from pymetrics.output import append_row, create_csv, get_path, load_csv
13-
from pymetrics.time_utils import drop_duplicates_by_date
12+
from pymetrics.time_utils import drop_duplicates_by_date, get_current_utc
1413

1514
LOGGER = logging.getLogger(__name__)
1615
dir_path = os.path.dirname(os.path.realpath(__file__))
@@ -89,7 +88,7 @@ def _get_downloads_from_anaconda_org(packages, channel='conda-forge'):
8988

9089
for pkg_name in packages:
9190
URL = f'https://api.anaconda.org/package/{channel}/{pkg_name}'
92-
timestamp = datetime.now(ZoneInfo('UTC'))
91+
timestamp = get_current_utc()
9392
response = requests.get(URL)
9493
row_info = {'pkg_name': [pkg_name], TIME_COLUMN: [timestamp], 'total_ndownloads': 0}
9594
data = response.json()
@@ -158,6 +157,8 @@ def collect_anaconda_downloads(
158157
`start_date` has not been provided. Defaults to 90 days.
159158
dry_run (bool):
160159
If `True`, do not upload the results. Defaults to `False`.
160+
verbose (bool):
161+
If `True`, will output dataframes tails of anaconda data. Defaults to `False`.
161162
"""
162163
overall_df, version_downloads = _collect_ananconda_downloads_from_website(
163164
projects, output_folder=output_folder
@@ -166,7 +167,7 @@ def collect_anaconda_downloads(
166167
previous = _get_previous_anaconda_downloads(output_folder, filename=PREVIOUS_ANACONDA_FILENAME)
167168
previous = previous.sort_values(TIME_COLUMN)
168169

169-
end_date = datetime.now(tz=ZoneInfo('UTC')).date()
170+
end_date = get_current_utc().date()
170171
start_date = end_date - timedelta(days=max_days)
171172
LOGGER.info(f'Getting daily anaconda data for start_date>={start_date} to end_date<{end_date}')
172173
date_ranges = pd.date_range(start=start_date, end=end_date, freq='D')

pymetrics/gh_downloads.py

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
"""Functions to get GitHub downloads from GitHub."""
2+
3+
import logging
4+
import os
5+
from collections import defaultdict
6+
7+
import pandas as pd
8+
from tqdm import tqdm
9+
10+
from pymetrics.github import GithubClient
11+
from pymetrics.output import append_row, create_csv, get_path, load_csv
12+
from pymetrics.time_utils import drop_duplicates_by_date, get_current_utc
13+
14+
LOGGER = logging.getLogger(__name__)
15+
dir_path = os.path.dirname(os.path.realpath(__file__))
16+
TIME_COLUMN = 'timestamp'
17+
18+
GITHUB_DOWNLOAD_COUNT_FILENAME = 'github_download_counts.csv'
19+
20+
21+
def get_previous_github_downloads(output_folder, dry_run=False):
22+
"""Get previous GitHub Downloads."""
23+
csv_path = get_path(output_folder, GITHUB_DOWNLOAD_COUNT_FILENAME)
24+
read_csv_kwargs = {
25+
'parse_dates': [
26+
TIME_COLUMN,
27+
'created_at',
28+
],
29+
'dtype': {
30+
'ecosystem_name': pd.CategoricalDtype(),
31+
'org_repo': pd.CategoricalDtype(),
32+
'tag_name': pd.CategoricalDtype(),
33+
'prerelease': pd.BooleanDtype(),
34+
'download_count': pd.Int64Dtype(),
35+
},
36+
}
37+
data = load_csv(csv_path, read_csv_kwargs=read_csv_kwargs)
38+
return data
39+
40+
41+
def collect_github_downloads(
42+
projects: dict[str, list[str]], output_folder: str, dry_run: bool = False, verbose: bool = False
43+
):
44+
"""Pull data about the downloads of a GitHub project.
45+
46+
Args:
47+
projects (dict[str, list[str]]):
48+
List of projects to analyze. Each key is the name of the ecosystem, and
49+
each value is a list of github repositories (including organization).
50+
output_folder (str):
51+
Folder in which project downloads will be stored.
52+
It can be passed as a local folder or as a Google Drive path in the format
53+
`gdrive://{folder_id}`.
54+
The folder must contain 'github_download_counts.csv'
55+
dry_run (bool):
56+
If `True`, do not upload the results. Defaults to `False`.
57+
verbose (bool):
58+
If `True`, will output dataframes heads of github download data. Defaults to `False`.
59+
"""
60+
overall_df = get_previous_github_downloads(output_folder=output_folder)
61+
62+
gh_client = GithubClient()
63+
download_counts = defaultdict(int)
64+
65+
for ecosystem_name, repositories in projects.items():
66+
for org_repo in tqdm(repositories, position=1, desc=f'Ecosystem: {ecosystem_name}'):
67+
pages_remain = True
68+
page = 1
69+
per_page = 100
70+
download_counts[org_repo] = 0
71+
72+
github_org = org_repo.split('/')[0]
73+
repo = org_repo.split('/')[1]
74+
75+
while pages_remain is True:
76+
response = gh_client.get(
77+
github_org,
78+
repo,
79+
endpoint='releases',
80+
query_params={'per_page': per_page, 'page': page},
81+
)
82+
release_data = response.json()
83+
link_header = response.headers.get('link')
84+
85+
if response.status_code == 404:
86+
LOGGER.debug(f'Skipping: {org_repo} because org/repo does not exist')
87+
pages_remain = False
88+
break
89+
90+
# Get download count
91+
for release_info in tqdm(
92+
release_data, position=0, desc=f'{repo} releases, page={page}'
93+
):
94+
release_id = release_info.get('id')
95+
tag_name = release_info.get('tag_name')
96+
prerelease = release_info.get('prerelease')
97+
created_at = release_info.get('created_at')
98+
endpoint = f'releases/{release_id}'
99+
100+
timestamp = get_current_utc()
101+
response = gh_client.get(github_org, repo, endpoint=endpoint)
102+
data = response.json()
103+
assets = data.get('assets')
104+
105+
tag_row = {
106+
'ecosystem_name': [ecosystem_name],
107+
'org_repo': [org_repo],
108+
'timestamp': [timestamp],
109+
'tag_name': [tag_name],
110+
'prerelease': [prerelease],
111+
'created_at': [created_at],
112+
'download_count': 0,
113+
}
114+
if assets and len(assets) > 0:
115+
for asset in assets:
116+
tag_row['download_count'] += asset.get('download_count', 0)
117+
118+
overall_df = append_row(overall_df, tag_row)
119+
120+
# Check pagination
121+
if link_header and 'rel="next"' in link_header:
122+
page += 1
123+
else:
124+
break
125+
overall_df = drop_duplicates_by_date(
126+
overall_df,
127+
time_column=TIME_COLUMN,
128+
group_by_columns=['ecosystem_name', 'org_repo', 'tag_name'],
129+
)
130+
if verbose:
131+
LOGGER.info(f'{GITHUB_DOWNLOAD_COUNT_FILENAME} tail')
132+
LOGGER.info(overall_df.tail(5).to_string())
133+
134+
overall_df.to_csv('github_download_counts.csv', index=False)
135+
136+
if not dry_run:
137+
gfolder_path = f'{output_folder}/{GITHUB_DOWNLOAD_COUNT_FILENAME}'
138+
create_csv(output_path=gfolder_path, data=overall_df)

0 commit comments

Comments
 (0)