Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions codepile/github_issues/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

Primary source of data is the bigquery [githubarchive](https://www.gharchive.org/) dataset. Raw dumps can also be downloaded from https://www.gharchive.org/ directly. But since the data is large (17TB in bigquery), downloading and analyzing them will be an effort. So, going with bigquery is a reasonable choice to begin with.

Information in githubarchive is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).

bigquery data has a top level field called "type" which is the event type, based on which we can filter the events that are of interest to us.

Events that are of interest to us are IssueCommentEvent, IssuesEvent. Read more about these events [here](https://docs.github.com/en/developers/webhooks-and-events/events/github-event-types). This documentation says that the `payload.action` field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. It is clarified [here](https://github.com/igrigorik/gharchive.org/issues/183) on the fact that edit events are not part of the gharchive.

Github APIs treat both issues and pull request in a similar manner. [Ref](https://docs.github.com/en/rest/issues/issues). `IssuesEvent` contains events related to issue + pull request creation/closed events. `IssueCommentEvent` contains events related to issue + pull request comments. So, we need to exclude events related to pull requests.

Data format for pre-2015 and later periods is different. pre-2015 data contains only the issue id and comment ids while the later data contains title and body as well. So, we need to get the content for pre-2015 data by some other means.

`WatchEvent` can be used to get the list of repos by number of stars. Below query gets the list of repos and the number of stars
```
SELECT
COUNT(*) naive_count,
COUNT(DISTINCT actor.id) unique_by_actor_id,
COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
FROM `githubarchive.day.2*`
where type = 'WatchEvent'
GROUP BY repo.id, repo.url
```
Note that the number of stars from this query is only approximate. Read [SO Post](https://stackoverflow.com/questions/42918135/how-to-get-total-number-of-github-stars-for-a-given-repo-in-bigquery) to understand the nuances around the star counts.

Data in big query is organized in three ways (day/month/year wise). Use the daily tables for exploration and testing since bigquery pricing is per the data that gets scanned during query execution.

Issue, comments data is extracted from monthly tables on 27,28th Oct 2022.
Repo list is extracted from daily tables on 30th Oct 2022.

#### Some stats
Total repos extracted = ~25.7M
Repos with <= 100 stars = ~324K


post 2015 issue = ~85M issues
pre 2015 issues = ~9.4M issues
post 2015 issue comments = ~156M
pre 2015 issue comments = ~17.7M


filtered post 2015 issues = ~29.5M
filtered pre 2015 issues = ~2.9M
filtered post 2015 issue comments = ~100M
filtered pre 2015 issue comments = ~11M


#### Other data sources explored:
[ghtorrent](https://ghtorrent.org/) project doesn't seem to be active. Data is there only till 2019. Even that, we can only get the issue ids.
Bigquery public dataset `github_repo` doesn't have data related to issues and comments. It only has code.



103 changes: 103 additions & 0 deletions codepile/github_issues/docs/bigquery-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
#### Get information about different tables in the dataset
```
SELECT * from `<data-set-id>.__TABLES__`
```

#### Sample events form a table
To explore the data in a table, use sampling instead of a normal query as it is cost efficient
```
SELECT * FROM `<table-id>` TABLESAMPLE SYSTEM (.001 percent)
```

#### Get issues
```
SELECT json_payload.issue.id as issue_id, json_payload.issue.number as issue_no, issue_url, payload, repo, org, created_at, id, other FROM
(SELECT
id, SAFE.PARSE_JSON(payload) AS json_payload, JSON_VALUE(payload, '$.action') AS action, JSON_QUERY(payload, '$.issue.url') as issue_url, payload, repo, org, created_at, other
FROM `githubarchive.month.20*`
WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300'
AND type = 'IssuesEvent'
) WHERE action = 'opened' AND issue_url IS NOT NULL

```

#### Get issue comments
```
SELECT issue_id, issue_no, comment_id, comment_url, payload, repo, org, created_at, id, other FROM
(SELECT
id, JSON_VALUE(payload, '$.issue.id') AS issue_id, JSON_VALUE(payload, '$.issue.number') as issue_no, JSON_VALUE(payload, '$.comment.id') as comment_id, JSON_VALUE(payload, '$.comment.url') as comment_url, JSON_QUERY(payload, '$.issue.pull_request') as pull_request,payload, repo, org, created_at, other
FROM `githubarchive.month.20*`
WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300'
AND type = 'IssueCommentEvent'
) WHERE comment_url IS NOT NULL AND pull_request IS NULL

```

#### Get pre-2015 issues
```
SELECT tb1.json_payload.issue as issue_id, tb1.json_payload.number as issue_no, payload, repo, org, created_at, id, other FROM
(
select SAFE.PARSE_JSON(payload) as json_payload, JSON_QUERY(payload, '$.action') as action, JSON_QUERY(payload, '$.issue.url') as issue_url, *
from `githubarchive.year.201*`
WHERE type = 'IssuesEvent'
AND _TABLE_SUFFIX BETWEEN '1' and '5'
) tb1
WHERE tb1.action = '"opened"' AND tb1.issue_url IS NULL
```

#### Get pre-2015 issue comments
```
SELECT issue_id, comment_id, comment_url, payload, repo, org, created_at, id, other FROM
(
select JSON_VALUE(payload, '$.comment_id') as comment_id, JSON_VALUE(payload, '$.issue_id') as issue_id, JSON_VALUE(other, '$.url') as comment_url, payload, repo, org, created_at, id, other
from `githubarchive.month.201*`
WHERE _TABLE_SUFFIX BETWEEN '000' AND '501' AND type = 'IssueCommentEvent'
) tb1
WHERE comment_id IS NOT NULL AND NOT CONTAINS_SUBSTR(tb1.comment_url, '/pull/')
LIMIT 100
```

#### Issues filtered
```
select stars, html_url as repo_url, issue_id, issue_no, title, body FROM
`<dataset-id>.100-star-repos` as t1 INNER JOIN
(select issue_id, issue_no, issue_url, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body from `<dataset-id>.issues`) as t2 ON SUBSTR(t1.html_url, 20) = t2.repo.name
```

#### Issue comments filtered
```
select stars, html_url as repo_url, issue_id, issue_no, comment_id, title, body, comment, created_at FROM
`<dataset-id>.100-star-repos` as t1 INNER JOIN
(select issue_id, issue_no, comment_id, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body, JSON_VALUE(payload, '$.comment.body') as comment, created_at from
`<dataset-id>.issue-comments`) as t2
ON SUBSTR(t1.html_url, 20) = t2.repo.name
```

#### Star count per repo
```
SELECT
COUNT(*) naive_count,
COUNT(DISTINCT actor.id) unique_by_actor_id,
COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
FROM `githubarchive.day.2*`
where type = 'WatchEvent'
GROUP BY repo.id, repo.url
```
There is additional post processing done on top of the results of this to unify the repo url to same format since different events have different format of the url (https://github.com, https://api.github.com etc)

#### pre-2015 issues filtered by 100-star repos
```
select t1.stars as repo_stars, t1.html_url as repo_url, t2.issue_id, t2.issue_no
FROM `<dataset-id>.100-star-repos` as t1
INNER JOIN `<dataset-id>.pre-2015-issues` as t2
ON t1.html_url = t2.repo.url
```

#### pre-2015 issue comments filtered by 100-star repo list
```
select t1.html_url as repo_url, t1.stars as repo_stars, t2.issue_id, t2.comment_id, t2.comment_url
FROM `<dataset-id>.100-star-repos` as t1
INNER JOIN `<dataset-id>.pre-2015-issue-comments` as t2
ON t1.html_url = t2.repo.url
```

16 changes: 16 additions & 0 deletions codepile/github_issues/docs/repo-stars.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Get number of stars by repo
```
SELECT
COUNT(*) naive_count,
COUNT(DISTINCT actor.id) unique_by_actor_id,
COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
FROM `githubarchive.day.2*`
where type = 'WatchEvent'
GROUP BY repo.url
```

Some of the events don't contain repo.id and some don't have actor.id. So, `unique_by_actor_login` is the most accurate of all the counts.

repo.url takes values of different format over the time period. Some have urls of the format https://api.github.com/repos/... while some have https://api.github.dev/repos/...

Result of big query is further processed to get the repo url into single format and then sum stars by new url. This list is further filtered down to get the repos that have >= 100 stars.
7 changes: 7 additions & 0 deletions codepile/github_issues/gh_graphql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
* This Scrapy crawler is not production-grade implementation and may not be following best practices.
* This uses Github graphql endpoint to fetch data. API lets us get upto 100 issues along with other metadata(labels, comments, author etc) in a single request.
* When getting issues+comments+other metadata, API is returning something called secondary rate limits even when we haven't breached to 5k/hour request limit. It is not entirely clear on how to mitigate this.
* Graphql requests and responses can be explored here https://docs.github.com/en/graphql/overview/explorer
* Refer the standalone scripts "extract-*" to convert the raw graphql responses into a flat list of issues/comments.

Run scrapy spider using a command like this `python run.py 0|1|2...`
46 changes: 46 additions & 0 deletions codepile/github_issues/gh_graphql/comments.graphql
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
query($repo_owner: String!, $repo_name: String!, $page_size: Int!, $after_cursor: String) {
repository(owner: $repo_owner, name: $repo_name) {
issues(first: $page_size, after: $after_cursor) {
pageInfo {
endCursor
hasNextPage
},
edges {
node {
number,
databaseId,
createdAt,
comments(first: 100) {
pageInfo {
hasNextPage,
endCursor
}
nodes {
databaseId
authorAssociation,
author {
login,
avatarUrl,
__typename
}
body
reactionGroups {
content,
reactors {
totalCount
}
}
},
totalCount
}
}
}
}
},
rateLimit {
limit
cost
remaining
resetAt
}
}
39 changes: 39 additions & 0 deletions codepile/github_issues/gh_graphql/extract-comments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, filter, size, transform

spark_dir = "/data/tmp/"
spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()
df = spark.read.json("/data/comments-*.jsonl")

df2 = df.select(["data.repository.issues.pageInfo.hasNextPage", explode("data.repository.issues.edges").alias("issue")])
df3 = df2.select([
col("issue.node.number").alias("issue_no"),
col("issue.node.databaseId").alias("issue_id"),
col("issue.node.createdAt").alias("issue_created_at"),
col("issue.node.comments.pageInfo.hasNextPage").alias("has_more_comments"),
col("issue.node.comments.pageInfo.endCursor").alias("next_comments_cursor"),
explode("issue.node.comments.nodes").alias("comment")
])

def filter_reactions(x):
return x.reactors.totalCount > 0

def transform_reactions(x):
print(x)
return {x.content: x.reactors.totalCount}

df4 = df3.select([
"issue_no",
"issue_id",
"issue_created_at",
"has_more_comments",
"next_comments_cursor",
"comment.databaseId",
"comment.authorAssociation",
col("comment.author.login").alias("comment_author"),
col("comment.body").alias("comment_body"),
filter("comment.reactionGroups", filter_reactions)
.alias("reaction_groups")
]).dropDuplicates(["databaseId"])

df4.write.parquet("/data/comments")
33 changes: 33 additions & 0 deletions codepile/github_issues/gh_graphql/extract-issues-small.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, filter, size, transform

spark_dir = "/data/tmp/"
spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()

def labels_transformer(x):
return {"name": x.node.name, "description": x.node.description}

def filter_reactions(x):
return x.reactors.totalCount > 0

df = spark.read.json("/data/issues-*.jsonl")

# separate issues into their own rows
df2 = df.select([
col("data.repository.databaseId").alias("repo_id"),
col("data.repository.nameWithOwner").alias("repo_name_with_owner"),
explode("data.repository.issues.edges").alias("issue")
])

#extract, clean issue metadata
df3 = df2.select([
"repo_id",
"repo_name_with_owner",
col("issue.node.number").alias("issue_no"),
col("issue.node.databaseId").alias("issue_id"),
col("issue.node.createdAt").alias("issue_created_at"),
col("issue.node.title").alias("title"),
col("issue.node.body").alias("body")
]).dropDuplicates(["issue_id"])
df3.write.parquet("/data/issues-lite")
print(df3.count())
49 changes: 49 additions & 0 deletions codepile/github_issues/gh_graphql/extract-issues.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, filter, size, transform

spark_dir = "/data/tmp/"
spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()

def labels_transformer(x):
return {"name": x.node.name, "description": x.node.description}

def filter_reactions(x):
return x.reactors.totalCount > 0

df = spark.read.json("/data/issues.jsonl")

# separate issues into their own rows
df2 = df.select([
col("data.repository.databaseId").alias("repo_id"),
col("data.repository.nameWithOwner").alias("repo_name_with_owner"),
col("data.repository.stargazerCount").alias("star_count"),
col("data.repository.description").alias("repo_description"),
col("data.repository.languages.edges").alias("languages"),
"data.repository.issues.pageInfo.hasNextPage",
col("data.repository.issues.totalCount").alias("issue_count"),
explode("data.repository.issues.edges").alias("issue")
])

#extract, clean issue metadata
df3 = df2.select([
"repo_id",
"repo_name_with_owner",
"star_count",
"repo_description",
"languages",
"issue_count",
col("issue.node.number").alias("issue_no"),
col("issue.node.databaseId").alias("issue_id"),
col("issue.node.createdAt").alias("issue_created_at"),
col("issue.node.title").alias("title"),
col("issue.node.author.login").alias("author"),
col("issue.node.author.avatarUrl").alias("author_avatar"),
col("issue.node.author.__typename").alias("author_type"),
col("issue.node.authorAssociation").alias("author_association"),
col("issue.node.comments.totalCount").alias("comment_count"),
col("issue.node.labels.edges").alias("labels"),
filter("issue.node.reactionGroups", filter_reactions)
.alias("reaction_groups")
]).dropDuplicates(["issue_id"])
df3.write.parquet("/data/issues")
print(df3.count())
Empty file.
12 changes: 12 additions & 0 deletions codepile/github_issues/gh_graphql/gh_graphql/items.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GhGraphqlItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
Loading