Review #1

alaws-scottlogic · 2024-08-06T09:47:34Z

https://alaws-scottlogic.github.io/blog/2024/08/05/lessons-on-data-quality.html

chrisprice · 2024-08-07T10:53:03Z

_posts/2024-08-05-lessons-on-data-quality.md

+author: alaws
+---
+
+[Binoculars](https://github.com/ahans30/Binoculars) is a zero-shot method of detecting LLM-generated text, meaning it is designed to be able to perform classification without having previously seen any examples of these categories. This has the advantage of allowing it to achieve good classification accuracy, even on previously unseen data.


This doesn't really tell me anything about what I'm going to learn by reading this post or who the target audience is.

chrisprice · 2024-08-07T10:54:33Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+First, we provided the pipeline with the URLs of some GitHub repositories and used the GitHub API to scrape the files in the repositories. To ensure that the code was human written, we chose repositories that were archived before the release of Generative AI coding tools like [GitHub Copilot](https://github.com/features/copilot).
+
+If we were using the pipeline to generate functions, we would first use an LLM ([GPT-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)) to identify individual functions from the file and extract them programmatically.


I think it needs a sentence to explain why an LLM was used for this (low effort way to get results across a broad range of languages)

chrisprice · 2024-08-07T10:59:16Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+Because of this difference in scores between human and AI-written text, classification can be performed by selecting a threshold, and categorising text which falls above or below the threshold as human or AI-written respectively. Therefore, our team set out to investigate whether we could use Binoculars to detect AI-written code, and what factors might impact its classification performance.
+
+## Creating a Dataset


Do we want to mention that we were looking for samples with a range of token lengths?

chrisprice · 2024-08-07T12:06:42Z

_posts/2024-08-05-lessons-on-data-quality.md

+To investigate this, we tested 3 models , namely [IBM Granite 3B](https://huggingface.co/ibm-granite/granite-3b-code-base), [DeepSeek Coder 1.3B](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) and [CodeLlama 7B](https://huggingface.co/codellama/CodeLlama-7b-hf) using datasets containing Python and JavaScript code.
+
+![jpg]({{ site.github.url }}/alaws/assets/data-quality/binoculars_score_model_box_plot_old.png)
+_Box plots showing the distribution Binoculars scores calculated using each model_


Suggested change

_Box plots showing the distribution Binoculars scores calculated using each model_

_Box plots showing the distribution of Binoculars scores calculated using each model_

chrisprice · 2024-08-07T12:08:29Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+To investigate this, we tested 3 models , namely [IBM Granite 3B](https://huggingface.co/ibm-granite/granite-3b-code-base), [DeepSeek Coder 1.3B](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) and [CodeLlama 7B](https://huggingface.co/codellama/CodeLlama-7b-hf) using datasets containing Python and JavaScript code.
+
+![jpg]({{ site.github.url }}/alaws/assets/data-quality/binoculars_score_model_box_plot_old.png)


I think these need more explanation. It's not obvious what these show relative to any expectation.

chrisprice · 2024-08-07T12:09:43Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+To get an indication of classification, we also plotted our results on a [ROC Curve](<https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,False%20Positive%20Rate)>), which shows the classification performance across all thresholds. The AUC (Area Under the Curve) value is then calculated, which is a single value representing the performance across all thresholds.
+
+The above ROC Curve shows the same findings, with a clear split in classification accuracy when we compare token lengths above and below 300 tokens. This, coupled with the fact that performance was worse than random chance for input lengths of 25 tokens, suggested that to for Binoculars to reliably classify code as human or AI-written, there may be minimum input token length requirement.


Suggested change

The above ROC Curve shows the same findings, with a clear split in classification accuracy when we compare token lengths above and below 300 tokens. This, coupled with the fact that performance was worse than random chance for input lengths of 25 tokens, suggested that to for Binoculars to reliably classify code as human or AI-written, there may be minimum input token length requirement.

The above ROC Curve shows the same findings, with a clear split in classification accuracy when we compare token lengths above and below 300 tokens. This, coupled with the fact that performance was worse than random chance for input lengths of 25 tokens, suggested that for Binoculars to reliably classify code as human or AI-written, there may be a minimum input token length requirement.

chrisprice · 2024-08-07T12:13:02Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+## Discovering a Problem
+
+Although these findings were interesting, they were also surprising, which meant we needed to exhibit caution. We decided to reexamine our process, starting with the data. It could be the case that we were seeing such good classification results because the quality our AI-written code was poor.


Suggested change

Although these findings were interesting, they were also surprising, which meant we needed to exhibit caution. We decided to reexamine our process, starting with the data. It could be the case that we were seeing such good classification results because the quality our AI-written code was poor.

Although these findings were interesting, they were also surprising, which meant we needed to exhibit caution. We decided to reexamine our process, starting with the data. It could be the case that we were seeing such good classification results because the quality of our AI-written code was poor.

chrisprice · 2024-08-07T12:14:53Z

_posts/2024-08-05-lessons-on-data-quality.md

+![jpg]({{ site.github.url }}/alaws/assets/data-quality/new-code-generation-pipeline.png)
+_Our new pipeline for generating AI code samples_
+
+First, we swapped our data source to use the [github-code-clean](https://huggingface.co/datasets/codeparrot/github-code-clean) dataset, in which the code files had been filtered to remove files that are auto-generated, have short line lengths, or a high proportion of non-alphanumeric characters.


I think we need to justify why we thought this was ok despite this being a very likely training dataset.

chrisprice · 2024-08-07T12:18:38Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+## Lessons Learnt
+
+#### The foundation of good research is good quality data


Adjacent titles always look a bit weird. Maybe add an intro sentence?

chrisprice · 2024-08-07T12:19:45Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+Automation can be both a blessing and a curse, so exhibit caution when you're using it. Automation allowed us to rapidly generate the huge amounts of data we needed to conduct this research, but by relying on automation too much, we failed to spot the issues in our data. In hindsight, we should have dedicated more time to manually checking the outputs of our pipeline, rather than rushing ahead to conduct our investigations using Binoculars.
+
+Although our data issues were a setback, we had set up our research tasks in such a way that they could be easily rerun, predominantly by using notebooks. Research process often need refining and to be repeated, so should be developed with this in mind.


Maybe add a conclusion heading before this?

chrisprice · 2024-08-07T12:22:01Z

_posts/2024-08-05-lessons-on-data-quality.md

+
+Automation can be both a blessing and a curse, so exhibit caution when you're using it. Automation allowed us to rapidly generate the huge amounts of data we needed to conduct this research, but by relying on automation too much, we failed to spot the issues in our data. In hindsight, we should have dedicated more time to manually checking the outputs of our pipeline, rather than rushing ahead to conduct our investigations using Binoculars.
+
+Although our data issues were a setback, we had set up our research tasks in such a way that they could be easily rerun, predominantly by using notebooks. Research process often need refining and to be repeated, so should be developed with this in mind.


I think we should also conclude something about the viability of using Binoculars for this task. Maybe accepting that it is an intrinsically hard task

alaws-scottlogic added 5 commits August 5, 2024 15:40

first draft

8ad9469

Merge remote-tracking branch 'upstream/gh-pages' into gh-pages

7a723e7

update metadata

ecc9bf0

add image captions

3471192

readability improvements

977d26d

chrisprice reviewed Aug 7, 2024

View reviewed changes

improvements based on review feedback

346189f

alaws-scottlogic temporarily deployed to github-pages August 7, 2024 14:28 — with GitHub Pages Inactive

alaws-scottlogic temporarily deployed to github-pages August 12, 2024 08:45 — with GitHub Pages Inactive

update post date

7f83f6c

alaws-scottlogic force-pushed the gh-pages branch from c1d4946 to 7f83f6c Compare August 12, 2024 09:26

alaws-scottlogic temporarily deployed to github-pages August 12, 2024 09:28 — with GitHub Pages Inactive

reduce image size and update date

75f6507

alaws-scottlogic temporarily deployed to github-pages September 4, 2024 13:28 — with GitHub Pages Inactive

correct image background

609b80c

alaws-scottlogic deployed to github-pages September 4, 2024 13:40 — with GitHub Pages View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review #1

Review #1

Uh oh!

alaws-scottlogic commented Aug 6, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

chrisprice Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		First, we provided the pipeline with the URLs of some GitHub repositories and used the GitHub API to scrape the files in the repositories. To ensure that the code was human written, we chose repositories that were archived before the release of Generative AI coding tools like [GitHub Copilot](https://github.com/features/copilot).

		If we were using the pipeline to generate functions, we would first use an LLM ([GPT-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo)) to identify individual functions from the file and extract them programmatically.


		Because of this difference in scores between human and AI-written text, classification can be performed by selecting a threshold, and categorising text which falls above or below the threshold as human or AI-written respectively. Therefore, our team set out to investigate whether we could use Binoculars to detect AI-written code, and what factors might impact its classification performance.

		## Creating a Dataset

	_Box plots showing the distribution Binoculars scores calculated using each model_
	_Box plots showing the distribution of Binoculars scores calculated using each model_


		To investigate this, we tested 3 models , namely [IBM Granite 3B](https://huggingface.co/ibm-granite/granite-3b-code-base), [DeepSeek Coder 1.3B](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) and [CodeLlama 7B](https://huggingface.co/codellama/CodeLlama-7b-hf) using datasets containing Python and JavaScript code.

		![jpg]({{ site.github.url }}/alaws/assets/data-quality/binoculars_score_model_box_plot_old.png)


		To get an indication of classification, we also plotted our results on a [ROC Curve](<https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,False%20Positive%20Rate)>), which shows the classification performance across all thresholds. The AUC (Area Under the Curve) value is then calculated, which is a single value representing the performance across all thresholds.

		The above ROC Curve shows the same findings, with a clear split in classification accuracy when we compare token lengths above and below 300 tokens. This, coupled with the fact that performance was worse than random chance for input lengths of 25 tokens, suggested that to for Binoculars to reliably classify code as human or AI-written, there may be minimum input token length requirement.


		## Discovering a Problem

		Although these findings were interesting, they were also surprising, which meant we needed to exhibit caution. We decided to reexamine our process, starting with the data. It could be the case that we were seeing such good classification results because the quality our AI-written code was poor.


		## Lessons Learnt

		#### The foundation of good research is good quality data


		Automation can be both a blessing and a curse, so exhibit caution when you're using it. Automation allowed us to rapidly generate the huge amounts of data we needed to conduct this research, but by relying on automation too much, we failed to spot the issues in our data. In hindsight, we should have dedicated more time to manually checking the outputs of our pipeline, rather than rushing ahead to conduct our investigations using Binoculars.

		Although our data issues were a setback, we had set up our research tasks in such a way that they could be easily rerun, predominantly by using notebooks. Research process often need refining and to be repeated, so should be developed with this in mind.

Review #1

Are you sure you want to change the base?

Review #1

Uh oh!

Conversation

alaws-scottlogic commented Aug 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants