Writing is a cognitively active task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities.
Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge while meeting high academic standards.
To understand writers' cognitive thinking process, one should fully decode the end-to-end writing data (from scratch to final manuscript) and understand their complex cognitive mechanisms in scientific writing.
We introduce ScholaWrite dataset, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke.
Our dataset includes
This branch contains following folders:
scholawrite_system
: ScholaWrite system which includes data collection backend, admin page, and annotation page.scholawrite_finetune
: Fine-tuning scripts of BERT, RoBERTa, Llama-3b-instruct, and Llama-8b-instruct on our dataset.gpt4o
: Scripts for running GPT-4o on iterative writing and intention prediction.meta_inference
: Scripts for running Llama-3b-instruct and Llama-8b-instruct baseline model on iterative writing and intention prediction.eval_tool
: Webpage for visualizing Llama-8b-instruct (baseline) and Llama-8b-SW iterative writing output for human evaluation.analysis
: Scripts for computing consine similarity between seed documents and final outputs of iterative writing, lexical diversity of final outputs from iterative writing, f1 scores in intention prediction task, and intention diversity/converage in the iterative writing.
- Go to this site to download and install the MongoDB on your computer.
- Go to this site to download the MongoDB Compass, it provides a user friendly GUI that allows you to view/find/manage the documents in the database. MongoDB Compass also provide mongoDB Shell feature.
- Install the Database Tools based on your OS so that you can backup and restore your database.
- Run the MongoDB on default port 27017.
- The database named
flask_db
and collectionactivity
inside it will be created once you run the ScholaWrite System.
- Make sure you have a Google Cloud project.
- Follow the steps to create an OAuth client for a Desktop app.
- Download the OAuth client file you just created and rename it to
sheet_credential.json
. - Place it in any folder you want.
- Replace lines 12 and 13 in
/scholwrite_system/docker-compose.yml
with the following:volumes: - <folder path you put the sheet_credential.json>/sheet_credential.json:/usr/local/src/scholawrite/flaskapp/sheet_credential.json - <folder path you put the token.json>/google_OAuth2/token.json:/usr/local/src/scholawrite/flaskapp/token.json
- Make sure you have a Google Sheet.
- Add all Overleaf project IDs you want the system to monitor. The IDs should be on consecutive rows in the same column (e.g.,
A1:A9
). - Go to line 21 of
scholawrite_system/App.py
. - Replace
SAMPLE_SPREADSHEET_ID
with the ID of your Google Sheet. The ID is in the URL of your Google Sheet:https://docs.google.com/spreadsheets/d/<SAMPLE_SPREADSHEET_ID>/edit?gid=0#gid=0
. - Go to line 23 of
scholawrite_system/App.py
. - Replace
SAMPLE_RANGE_NAME
with the actual range in the Google Sheet where you stored the Overleaf project IDs.
- You need either:
- One Ngrok account that supports 3 Static Domains and 3 Secure Tunnel Agents, or
- Three Ngrok accounts that each support 1 Static Domain and 1 Secure Tunnel Agent.
- Go to the Ngrok dashboard and copy your AuthToken.
- Create three configuration files in the
scholawrite_system
folder namedngrok_admin.yml
,ngrok_annotation.yml
, andngrok_schola.yml
. - Paste your AuthToken(s) into these files in the following format:
version: 2 authtoken: <Your AuthToken>
- Create three domains.
- Paste the domains into the
command
lines on lines 36, 63, and 90 in/scholwrite_system/docker-compose.yml
. For example:command: ["ngrok", "http", "annotation:5100", "--host-header=annotation:5100", "--domain=<your domain>", "--log=stdout", "--log-level=debug"]
Run the following command:
docker-compose up
The data collection backend, admin page, and annotation page will be running and assesible to the public through Ngrok.
- Copy the domain from the
command
on line 36 inscholawrite_system/docker-compose.yml
. - Paste it into:
- Line 3 of
scholawrite_system/extention/background.js
- Line 5 of
scholawrite_system/extention/popup.js
- Line 3 of
- Open your browser and go to
chrome://extensions
. - Enable Developer Mode at the top-right corner of the page.
- Click Load unpacked and navigate to the
scholawrite_system
folder. - Select the
extension
folder. - Once the extension is loaded:
- Open another Chrome tab and go to your Overleaf project page.
- If already open, refresh the page.
- Click the puzzle icon at the top-right corner of your browser, and the
S
logo will appear. - Click the
S
logo to show the extension UI. - Log in or register, then toggle on Record writer actions.
Note: Due to Overleaf UI updates, the Chrome extension can no longer record writer actions or perform the AI paraphrase feature.
- You need one Ngrok account that supports 1 Static Domain and 1 Secure Tunnel Agent.
- Go to the Ngrok dashboard and copy your AuthToken.
- Create a configuration file in the
eval_tool
folder namedngrok.yml
. - Paste your AuthToken into the file in the following format:
version: 2 authtoken: <Your AuthToken>
- Create one domain.
- Paste the domain into the
command
line in/eval_tool/run_eval_app.sh
:tmux new-session -d -s eval_ngrok "ngrok --config ./ngrok.yml http --url=<your domain> 12345"
- Navigate to the
eval_tool
folder. - Run:
docker-compose up -d
- After the container is created, run:
docker exec -it scholawrite_eval bash
- Inside the container, run:
./run_eval_app.sh
- Go to the domain you pasted into
/eval_tool/run_eval_app.sh
using your browser.
The fine-tuning uses Unsloth and QLoRA. Ensure your GPU has at least 16GB VRAM.
- Navigate to the root folder of
ScholaWrite-Public
. - Create an
.env
file with following contentHUGGINGFACE_TOKEN="<Your Hugging Face access token>" OPEN_AI_API="<Your OpenAI API key>"
- Create a Docker container for fine-tuning and inference:
docker run --name scholawrite_container_2 --gpus all -dt -v ./:/workspace --ipc=host pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel bash
- Access the Docker container:
docker exec -it scholawrite_container_2 bash
- Install required Python packages:
pip install accelerate python-dotenv huggingface-hub datasets transformers trl unsloth diff_match_patch
- Inside the Docker container, navigate to the
scholawrite_finetune
folder. - Select the appropriate folder based on the model you want to fine-tune:
bert_finetune
: For fine-tuningbert-base-uncased
orFacebookAI/roberta-base
on a classification task.llama8b_scholawrite_finetune
: For fine-tuningunsloth/Meta-Llama-3.2-8B-Instruct-bnb-4bit
on classification or iterative writing tasks.llama3b_scholawrite_finetune
: For fine-tuningunsloth/Llama-3.1-3B-Instruct-bnb-4bit
on classification tasks or iterative writing tasks.
-
Iterative Writing:
- Open
args.py
and setPURPOSE = "WRITING"
. - Run the fine-tuning script:
python3 train_writing.py
- Open
-
Classification:
- Open
args.py
and setPURPOSE = "CLASS"
. - Run the fine-tuning script:
python3 train_classifier.py
- Open
- Classification:
- Run the fine-tuning script:
python3 small_model_classifier.py
- Run the fine-tuning script:
After fine-tuning, the fine-tuned model will be stored in the results
folder in the root directory of ScholaWrite-Public
.
Ensure you are inside the scholawrite_container_2
Docker container.
-
If already running:
docker exec -it scholawrite_container_2 bash
-
If not running:
Follow the setup instructions in the Environment Setup Section.
Navigate to the appropriate folder inside the Docker container.
-
scholawrite_finetune
:bert_finetune
: For running fine-tunedbert-base-uncased
orFacebookAI/roberta-base
.llama8b_scholawrite_finetune
: For running fine-tunedunsloth/Meta-Llama-3.2-8B-Instruct-bnb-4bit
.llama3b_scholawrite_finetune
: For running fine-tunedunsloth/Llama-3.1-3B-Instruct-bnb-4bit
.
-
meta_inference
:llama8b_meta_instruction
: For baselineunsloth/Llama-3.2-8B-Instruct-bnb-4bit
.llama3b_meta_instruction
: For baselineunsloth/Llama-3.1-3B-Instruct-bnb-4bit
.
-
Iterative Writing:
Note:scholawrite_finetune/llama3b_scholawrite_finetune
does not have an iterative writing script.- Ensure the model name on lines 65 and 80 of
iterative_writing.py
matches the path to the model or its name on Hugging Face. - On line 15 of
iterative_writing.py
, specify a uniqueoutput_folder_name
to avoid overwriting existing outputs. - Run the script:
python3 iterative_writing.py
- Outputs will be saved in
output_folder_name/seed name/generation
andoutput_folder_name/seed name/intention
folders under theScholaWrite-Public
root directory.seed name
: filenames of seed documents (please refer toScholaWrite-Public/seeds
), here are all possible seed names: seed1, seed2, seed3, and seed4.output_folder_name/seed name/generation
: Stores the model's writing output, with one text file per iteration (e.g., 100 iterations result in 100 text files).output_folder_name/seed name/intention
: Stores the model's intentions corresponding to each writing output, with one text file per iteration (e.g., 100 iterations result in 100 text files).
- Ensure the model name on lines 65 and 80 of
-
Classification:
- Ensure the model name on line 40 of
classification.py
matches the path to the model or its name on Hugging Face. - Run the script:
python3 classification.py
- A CSV file with classification results will be generated in the current directory.
- True label is in the column 'label' and predicted label is in the column 'predicted'
- Ensure the model name on line 40 of
- Classification:
- Run the script:
python3 small_model_inference.py
- A CSV file with classification results will be generated in the current directory.
- True label is in the column 'label' and predicted label is in the column 'predicted_label'
- Run the script: