CarperAI · vanga · Nov 19, 2022 · Nov 19, 2022
diff --git a/codepile/github_issues/README.md b/codepile/github_issues/README.md
@@ -0,0 +1,53 @@
+
+Primary source of data is the bigquery [githubarchive](https://www.gharchive.org/) dataset. Raw dumps can also be downloaded from https://www.gharchive.org/ directly. But since the data is large (17TB in bigquery), downloading and analyzing them will be an effort. So, going with bigquery is a reasonable choice to begin with.
+
+Information in githubarchive is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).
+
+bigquery data has a top level field called "type" which is the event type, based on which we can filter the events that are of interest to us.
+
+Events that are of interest to us are IssueCommentEvent, IssuesEvent. Read more about these events [here](https://docs.github.com/en/developers/webhooks-and-events/events/github-event-types). This documentation says that the `payload.action` field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. It is clarified [here](https://github.com/igrigorik/gharchive.org/issues/183) on the fact that edit events are not part of the gharchive.
+
+Github APIs treat both issues and pull request in a similar manner. [Ref](https://docs.github.com/en/rest/issues/issues). `IssuesEvent` contains events related to issue + pull request creation/closed events. `IssueCommentEvent` contains events related to issue + pull request comments. So, we need to exclude events related to pull requests.
+
+Data format for pre-2015 and later periods is different. pre-2015 data contains only the issue id and comment ids while the later data contains title and body as well. So, we need to get the content for pre-2015 data by some other means.
+
+`WatchEvent` can be used to get the list of repos by number of stars. Below query gets the list of repos and the number of stars
+```
+SELECT 
+  COUNT(*) naive_count,
+  COUNT(DISTINCT actor.id) unique_by_actor_id, 
+  COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
+FROM `githubarchive.day.2*`
+where type = 'WatchEvent'
+GROUP BY repo.id, repo.url
+```
+Note that the number of stars from this query is only approximate. Read [SO Post](https://stackoverflow.com/questions/42918135/how-to-get-total-number-of-github-stars-for-a-given-repo-in-bigquery) to understand the nuances around the star counts.
+
+Data in big query is organized in three ways (day/month/year wise). Use the daily tables for exploration and testing since bigquery pricing is per the data that gets scanned during query execution.
+
+Issue, comments data is extracted from monthly tables on 27,28th Oct 2022.
+Repo list is extracted from daily tables on 30th Oct 2022.
+
+#### Some stats
+Total repos extracted = ~25.7M
+Repos with <= 100 stars = ~324K
+
+
+post 2015 issue = ~85M issues
+pre 2015 issues = ~9.4M issues
+post 2015 issue comments = ~156M
+pre 2015 issue comments = ~17.7M
+
+
+filtered post 2015 issues = ~29.5M
+filtered pre 2015 issues = ~2.9M
+filtered post 2015 issue comments = ~100M
+filtered pre 2015 issue comments = ~11M
+
+
+#### Other data sources explored:
+[ghtorrent](https://ghtorrent.org/) project doesn't seem to be active. Data is there only till 2019. Even that, we can only get the issue ids.
+Bigquery public dataset `github_repo` doesn't have data related to issues and comments. It only has code.
+
+
+
diff --git a/codepile/github_issues/docs/bigquery-queries.md b/codepile/github_issues/docs/bigquery-queries.md
@@ -0,0 +1,103 @@
+#### Get information about different tables in the dataset
+```
+SELECT * from `<data-set-id>.__TABLES__` 
+```
+
+#### Sample events form a table
+To explore the data in a table, use sampling instead of a normal query as it is cost efficient
+```
+SELECT * FROM `<table-id>` TABLESAMPLE SYSTEM (.001 percent)
+```
+
+#### Get issues
+```
+SELECT json_payload.issue.id as issue_id, json_payload.issue.number as issue_no, issue_url, payload, repo, org, created_at, id, other FROM 
+    (SELECT
+        id, SAFE.PARSE_JSON(payload) AS json_payload, JSON_VALUE(payload, '$.action') AS action, JSON_QUERY(payload, '$.issue.url') as issue_url, payload, repo, org, created_at, other
+    FROM `githubarchive.month.20*`
+    WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300'
+    AND type = 'IssuesEvent'
+    ) WHERE action = 'opened' AND issue_url IS NOT NULL
+
+```
+
+#### Get issue comments
+```
+SELECT issue_id, issue_no, comment_id, comment_url, payload, repo, org, created_at, id, other FROM 
+    (SELECT
+        id, JSON_VALUE(payload, '$.issue.id') AS issue_id, JSON_VALUE(payload, '$.issue.number') as issue_no, JSON_VALUE(payload, '$.comment.id') as comment_id, JSON_VALUE(payload, '$.comment.url') as comment_url, JSON_QUERY(payload, '$.issue.pull_request') as pull_request,payload, repo, org, created_at, other
+    FROM `githubarchive.month.20*`
+    WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300'
+    AND type = 'IssueCommentEvent'
+    ) WHERE comment_url IS NOT NULL AND pull_request IS NULL
+
+```
+
+#### Get pre-2015 issues
+```
+SELECT tb1.json_payload.issue as issue_id, tb1.json_payload.number as issue_no, payload, repo, org, created_at, id, other FROM
+( 
+  select SAFE.PARSE_JSON(payload) as json_payload, JSON_QUERY(payload, '$.action') as action, JSON_QUERY(payload, '$.issue.url') as issue_url,  *
+  from `githubarchive.year.201*`
+  WHERE type = 'IssuesEvent'
+  AND _TABLE_SUFFIX BETWEEN '1' and '5'
+) tb1
+WHERE tb1.action = '"opened"' AND tb1.issue_url IS NULL
+```
+
+#### Get pre-2015 issue comments
+```
+SELECT issue_id, comment_id, comment_url, payload, repo, org, created_at, id, other FROM
+( 
+  select JSON_VALUE(payload, '$.comment_id') as comment_id, JSON_VALUE(payload, '$.issue_id') as issue_id, JSON_VALUE(other, '$.url') as comment_url, payload, repo, org, created_at, id, other
+  from `githubarchive.month.201*`
+  WHERE _TABLE_SUFFIX BETWEEN '000' AND '501' AND type = 'IssueCommentEvent'
+) tb1
+WHERE comment_id IS NOT NULL AND NOT CONTAINS_SUBSTR(tb1.comment_url, '/pull/')
+LIMIT 100
+```
+
+#### Issues filtered
+```
+select stars, html_url as repo_url, issue_id, issue_no, title, body FROM 
+`<dataset-id>.100-star-repos` as t1 INNER JOIN
+(select issue_id, issue_no, issue_url, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body from `<dataset-id>.issues`) as t2 ON SUBSTR(t1.html_url, 20) = t2.repo.name
+```
+
+#### Issue comments filtered
+```
+select stars, html_url as repo_url, issue_id, issue_no, comment_id, title, body, comment, created_at FROM 
+`<dataset-id>.100-star-repos` as t1 INNER JOIN
+(select issue_id, issue_no, comment_id, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body, JSON_VALUE(payload, '$.comment.body') as comment, created_at from 
+`<dataset-id>.issue-comments`) as t2 
+ON SUBSTR(t1.html_url, 20) = t2.repo.name
+```
+
+#### Star count per repo
+```
+SELECT 
+  COUNT(*) naive_count,
+  COUNT(DISTINCT actor.id) unique_by_actor_id, 
+  COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
+FROM `githubarchive.day.2*`
+where type = 'WatchEvent'
+GROUP BY repo.id, repo.url
+```
+There is additional post processing done on top of the results of this to unify the repo url to same format since different events have different format of the url (https://github.com, https://api.github.com etc)
+
+#### pre-2015 issues filtered by 100-star repos
+```
+select t1.stars as repo_stars, t1.html_url as repo_url, t2.issue_id, t2.issue_no 
+    FROM `<dataset-id>.100-star-repos` as t1 
+    INNER JOIN `<dataset-id>.pre-2015-issues` as t2 
+    ON t1.html_url = t2.repo.url
+```
+
+#### pre-2015 issue comments filtered by 100-star repo list
+```
+select t1.html_url as repo_url, t1.stars as repo_stars, t2.issue_id, t2.comment_id, t2.comment_url 
+    FROM `<dataset-id>.100-star-repos` as t1 
+    INNER JOIN `<dataset-id>.pre-2015-issue-comments` as t2 
+    ON t1.html_url = t2.repo.url
+```
+
diff --git a/codepile/github_issues/docs/repo-stars.md b/codepile/github_issues/docs/repo-stars.md
@@ -0,0 +1,16 @@
+Get number of stars by repo
+```
+SELECT 
+  COUNT(*) naive_count,
+  COUNT(DISTINCT actor.id) unique_by_actor_id, 
+  COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url
+FROM `githubarchive.day.2*`
+where type = 'WatchEvent'
+GROUP BY repo.url
+```
+
+Some of the events don't contain repo.id and some don't have actor.id. So, `unique_by_actor_login` is the most accurate of all the counts.
+
+repo.url takes values of different format over the time period. Some have urls of the format https://api.github.com/repos/... while some have https://api.github.dev/repos/...
+
+Result of big query is further processed to get the repo url into single format and then sum stars by new url. This list is further filtered down to get the repos that have >= 100 stars.
diff --git a/codepile/github_issues/gh_graphql/README.md b/codepile/github_issues/gh_graphql/README.md
@@ -0,0 +1,7 @@
+* This Scrapy crawler is not production-grade implementation and may not be following best practices.
+* This uses Github graphql endpoint to fetch data. API lets us get upto 100 issues along with other metadata(labels, comments, author etc) in a single request.
+* When getting issues+comments+other metadata, API is returning something called secondary rate limits even when we haven't breached to 5k/hour request limit. It is not entirely clear on how to mitigate this.
+* Graphql requests and responses can be explored here https://docs.github.com/en/graphql/overview/explorer
+* Refer the standalone scripts "extract-*" to convert the raw graphql responses into a flat list of issues/comments.
+
+Run scrapy spider using a command like this `python run.py 0|1|2...` 
diff --git a/codepile/github_issues/gh_graphql/comments.graphql b/codepile/github_issues/gh_graphql/comments.graphql
@@ -0,0 +1,46 @@
+query($repo_owner: String!, $repo_name: String!, $page_size: Int!, $after_cursor: String) {
+  repository(owner: $repo_owner, name: $repo_name) {
+    issues(first: $page_size, after: $after_cursor) {
+      pageInfo {
+        endCursor
+        hasNextPage
+      },
+      edges {
+        node {
+          number,
+          databaseId,
+          createdAt,
+          comments(first: 100) {
+            pageInfo {
+              hasNextPage,
+              endCursor
+            }
+            nodes {
+              databaseId
+            	authorAssociation,
+              author {
+                login,
+                avatarUrl,
+                __typename
+              }
+              body
+              reactionGroups {
+                content,
+                reactors {
+                  totalCount
+                }
+              }
+            },
+            totalCount
+          }
+        }
+      }
+    }
+  },
+  rateLimit {
+    limit
+    cost
+    remaining
+    resetAt
+  }  
+}
diff --git a/codepile/github_issues/gh_graphql/extract-comments.py b/codepile/github_issues/gh_graphql/extract-comments.py
@@ -0,0 +1,39 @@
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode, col, filter, size, transform
+
+spark_dir = "/data/tmp/"
+spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()
+df = spark.read.json("/data/comments-*.jsonl")
+
+df2 = df.select(["data.repository.issues.pageInfo.hasNextPage", explode("data.repository.issues.edges").alias("issue")])
+df3 = df2.select([
+    col("issue.node.number").alias("issue_no"), 
+    col("issue.node.databaseId").alias("issue_id"),
+    col("issue.node.createdAt").alias("issue_created_at"),
+    col("issue.node.comments.pageInfo.hasNextPage").alias("has_more_comments"),
+    col("issue.node.comments.pageInfo.endCursor").alias("next_comments_cursor"),
+    explode("issue.node.comments.nodes").alias("comment")
+])
+
+def filter_reactions(x):
+    return x.reactors.totalCount > 0
+
+def transform_reactions(x):
+    print(x)
+    return {x.content: x.reactors.totalCount}
+
+df4 = df3.select([
+    "issue_no",
+    "issue_id",
+    "issue_created_at",
+    "has_more_comments",
+    "next_comments_cursor",
+    "comment.databaseId",
+    "comment.authorAssociation",
+    col("comment.author.login").alias("comment_author"),
+    col("comment.body").alias("comment_body"),
+    filter("comment.reactionGroups", filter_reactions)
+        .alias("reaction_groups")
+]).dropDuplicates(["databaseId"])
+
+df4.write.parquet("/data/comments")
diff --git a/codepile/github_issues/gh_graphql/extract-issues-small.py b/codepile/github_issues/gh_graphql/extract-issues-small.py
@@ -0,0 +1,33 @@
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode, col, filter, size, transform
+
+spark_dir = "/data/tmp/"
+spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()
+
+def labels_transformer(x):
+    return {"name": x.node.name, "description": x.node.description}
+
+def filter_reactions(x):
+    return x.reactors.totalCount > 0    
+
+df = spark.read.json("/data/issues-*.jsonl")
+
+# separate issues into their own rows
+df2 = df.select([
+    col("data.repository.databaseId").alias("repo_id"),
+    col("data.repository.nameWithOwner").alias("repo_name_with_owner"),
+    explode("data.repository.issues.edges").alias("issue")
+])
+
+#extract, clean issue metadata
+df3 = df2.select([
+    "repo_id",
+    "repo_name_with_owner",
+    col("issue.node.number").alias("issue_no"), 
+    col("issue.node.databaseId").alias("issue_id"),
+    col("issue.node.createdAt").alias("issue_created_at"),
+    col("issue.node.title").alias("title"),
+    col("issue.node.body").alias("body")
+]).dropDuplicates(["issue_id"])
+df3.write.parquet("/data/issues-lite")
+print(df3.count())
diff --git a/codepile/github_issues/gh_graphql/extract-issues.py b/codepile/github_issues/gh_graphql/extract-issues.py
@@ -0,0 +1,49 @@
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode, col, filter, size, transform
+
+spark_dir = "/data/tmp/"
+spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate()
+
+def labels_transformer(x):
+    return {"name": x.node.name, "description": x.node.description}
+
+def filter_reactions(x):
+    return x.reactors.totalCount > 0    
+
+df = spark.read.json("/data/issues.jsonl")
+
+# separate issues into their own rows
+df2 = df.select([
+    col("data.repository.databaseId").alias("repo_id"),
+    col("data.repository.nameWithOwner").alias("repo_name_with_owner"),
+    col("data.repository.stargazerCount").alias("star_count"),
+    col("data.repository.description").alias("repo_description"),
+    col("data.repository.languages.edges").alias("languages"),
+    "data.repository.issues.pageInfo.hasNextPage",
+    col("data.repository.issues.totalCount").alias("issue_count"),
+    explode("data.repository.issues.edges").alias("issue")
+])
+
+#extract, clean issue metadata
+df3 = df2.select([
+    "repo_id",
+    "repo_name_with_owner",
+    "star_count",
+    "repo_description",
+    "languages",
+    "issue_count",
+    col("issue.node.number").alias("issue_no"), 
+    col("issue.node.databaseId").alias("issue_id"),
+    col("issue.node.createdAt").alias("issue_created_at"),
+    col("issue.node.title").alias("title"),
+    col("issue.node.author.login").alias("author"),
+    col("issue.node.author.avatarUrl").alias("author_avatar"),
+    col("issue.node.author.__typename").alias("author_type"),
+    col("issue.node.authorAssociation").alias("author_association"),
+    col("issue.node.comments.totalCount").alias("comment_count"),
+    col("issue.node.labels.edges").alias("labels"),
+    filter("issue.node.reactionGroups", filter_reactions)
+        .alias("reaction_groups")
+]).dropDuplicates(["issue_id"])
+df3.write.parquet("/data/issues")
+print(df3.count())
diff --git a/codepile/github_issues/gh_graphql/gh_graphql/__init__.py b/codepile/github_issues/gh_graphql/gh_graphql/__init__.py
diff --git a/codepile/github_issues/gh_graphql/gh_graphql/items.py b/codepile/github_issues/gh_graphql/gh_graphql/items.py
@@ -0,0 +1,12 @@
+# Define here the models for your scraped items
+#
+# See documentation in:
+# https://docs.scrapy.org/en/latest/topics/items.html
+
+import scrapy
+
+
+class GhGraphqlItem(scrapy.Item):
+    # define the fields for your item here like:
+    # name = scrapy.Field()
+    pass