Replication package for the paper "Analyzing techniques for duplicate detection on Q&A websites for game development"

This repository contains all of the code used for extracting and processing data from the Stack Exchange data dump, and training and evaluating the duplicate question detection techniques presented in the paper.

The data

The data we used in our study is available on Zenodo. This data includes everything we used in our study, including all of the results, measures, and the models trained throughout the methodology. More information about how the data is organized can be found in the README file in the data/ directory.

In this repository we provide mock datasets that simulate the ones used in our study. These datasets contain a small set of questions and duplicate pairs that can be used to test the code we provide.

Benchmark datasets

We provide three datasets that can be used for evaluating other duplicate detection methodologies and comparing with our results. These datasets have been packaged separately and can also be found on Zenodo.

There are two datasets focused on game development based on questions from Stack Overflow and the Game Development Stack Exchange. The third dataset is comprised of five randomly selected samples of equal size collected from Stack Overflow. All of these datasets were extracted from the June 2021 Stack Exchange Data Dump. More information about the datasets is available in our paper.

Using the code

All of the code used in our paper can be found in the code/ directory. The scripts are written in Python 3.9 and use several libraries that need to be installed for their correct execution. The code also uses features that are exclusive to Unix systems, and executing them on other environments may lead to errors. More information about the code can be found in the README file in the code/ directory.

To help executing and reproducing our results, we have included a requirements.txt file that can be used to install the required packages. We also provide a Dockerfile that can be used to create and run a container with all the requirements for running the code. Please read the following sections for instructions on how to set up a virtual environment or Docker container for running the code.

We do note that there are still hardware limitations for running the code. We recommend using a system with at least 32GB of RAM and 200Gb of storage space. We also recommend using a CUDA-capable GPU to reduce the time it takes to calculate the embeddings based on deep-learning techniques. In case you do not have access to a CUDA-capable GPU, you can use Google Colab for free to compute these embeddings. In the notebooks/ directory we provide an example notebook that can be used on Google Colab for computing embeddings.

Virtual environment

You can use your prefered method for creating a virtual environment on your system. conda and venv are two popular options for this task:

Creating a conda environment

Use the following commands while on the root directory to create a conda environment. Note that you need to install conda prior to using these.

Create the environment with conda create -n gamedev_dups python=3.9
Activate the environment with conda activate gamedev_dups
Install the required packages with conda install -c conda-forge --file requirements.txt

Creating an environment with venv

Use the following commands while on the root directory to create an environment using venv. Note that you need to have Python 3.9 installed to use these.

Create the environment with python3 -m venv gamedev_dups
Activate the environment with source gamedev_dups/bin/activate on Unix systems or gamedev_dups\Scripts\activate on Windows.
Install the required packages with pip install -r requirements.txt

Docker container

Please follow these steps to create a Docker container to run the code:

Install Docker on your system if you have not already done so;
On a terminal or command prompt, type the following command to create a Docker image based on the repository: docker build -t dup_questions .
After Docker has finished building the image, use the following command to launch a container based on the image and use it in interactive mode: docker run -dit dup_questions /bin/bash
After executing the scripts, log out of the Docker container using CTRL+D and use the following command to copy the data from the container back to your system: docker cp dup_questions:~/data .

BEFORE EXECUTING THE CODE

We highly recommend that you perform a test run of the code using the mock data provided in this repository before attempting to use other data. To do this, just type python3 full_pipeline.py while in the code/ directory on your terminal or Docker container. This should take a few minutes and should complete with no errors.

After performing a successful test run, please change the values in the consts.py file to your prefered ones. We have changed some portions of the file to make the test runs quicker. If you wish to use the same parameters as we did in our study, replace the consts.py file with the one that shows the values used in our study.

Executing the code from scracth

Please follow the following steps to execute the code from scratch, i.e., starting from the data from the Stack Exchange Data Dump.

Download the Posts.xml and PostLinks.xml files for Stack Overflow from the data dump;
Replace the existing files in the data/stackoverflow/raw/ directory with the ones that you just downloaded. Do not alter the names of the files (keep them as Posts.xml and PostLinks.xml);
Download the archive for the Game Development Stack Exchange from the data dump;
Replace the existing archive in the data/gamedev_se/raw/ directory with the one that you just downloaded;
Execute the command python3 scripts/full_pipeline.py while in the code/ directory to run the whole pipeline from start to finish, or execute the scripts following the order described in the README file in the code/ directory.

Executing the code using our data

To execute the code using our data, download the data package available on Zenodo and unzip it on the data/ directory. You can then execute the script or notebooks in the code/ directory in any order you like.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
data		data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Replication package for the paper "Analyzing techniques for duplicate detection on Q&A websites for game development"

The data

Benchmark datasets

Using the code

Virtual environment

Creating a conda environment

Creating an environment with venv

Docker container

BEFORE EXECUTING THE CODE

Executing the code from scracth

Executing the code using our data

About

Uh oh!

Languages

asgaardlab/done-21-arthur-duplicate_gamedev_questions-code

Folders and files

Latest commit

History

Repository files navigation

Replication package for the paper "Analyzing techniques for duplicate detection on Q&A websites for game development"

The data

Benchmark datasets

Using the code

Virtual environment

Creating a conda environment

Creating an environment with venv

Docker container

BEFORE EXECUTING THE CODE

Executing the code from scracth

Executing the code using our data

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages