Complete SourcererCC Tutorial

From Installation to Function-Level Clone Detection

A practical guide based on real implementation on Ubuntu with Python 3.8

Environment Setup

Prerequisites

Ubuntu (or other Linux distribution)
Git
Java 17+ installed
Conda or Python 3.8+

Step 1: Create Python Environment

# Create conda environment with Python 3.8
conda create -n sourcerercc python=3.8

# Activate environment
conda activate sourcerercc

Note: Mac users need Python 3.6 due to multiprocessing issues, but Ubuntu/Linux users can use Python 3.8 or 3.9.

Step 2: Clone SourcererCC Repository

git clone https://github.com/Mondego/SourcererCC.git
cd SourcererCC

Step 3: Install Required Tools

# Install Apache Ant (for building Java components)
sudo apt install ant

# Install Java parsing library (for function extraction)
pip install javalang

Understanding Granularity

SourcererCC works at multiple levels:

Level	Detects Clones Between	Use Case
File-level	Entire files	Finding duplicate files or very similar code files
Block-level	Individual functions/methods	Finding duplicate functions within or across files
Statement-level	Code blocks/statements	Fine-grained clone detection

Key Insight: Our testing showed that two Java files with similar functionality showed NO clones at file-level (entire files were <80% similar) but 7 clone pairs at function-level (individual functions were clones).

File-Level Clone Detection

Step 1: Prepare Your Projects

cd tokenizers/file-level

# Create project list file
echo "project1/sample-input.zip" > project-list.txt

Step 2: Configure `config.ini`

[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt

[Folders/Files]
PATH_stats_file_folder = files_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = files_tokens
PATH_logs = logs

[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java

[Config]
init_file_id = 1
init_proj_id = 1

Important Language Settings:

For Java: comment_inline = //, tags: /* */
For Python: comment_inline = #, tags: ''' '''
For C/C++: Same as Java

Step 3: Run File-Level Tokenizer

python tokenizer.py zip

Output folders created:

bookkeeping_projs/ - Project index
files_stats/ - File statistics (lines, LOC, SLOC)
files_tokens/ - Tokenized files

Step 4: Build Clone Detector

cd ../../clone-detector

# Build the Java components
ant cdi

This creates dist/indexbased.SearchManager.jar

Step 5: Prepare Input for Clone Detection

# Combine all token files
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.file

Step 6: Configure Detection Parameters

Edit sourcerer-cc.properties:

MIN_TOKENS=65
MAX_TOKENS=500000

Edit runnodes.sh (line 9) for similarity threshold:

threshold="${3:-8}"  # 8 = 80%, 7 = 70%, etc.

Edit runnodes.sh (line 20) for JVM memory:

# For systems with 257GB RAM, use:
-Xms32g -Xmx32g

# For smaller systems:
-Xms6g -Xmx6g

Step 7: Run Clone Detection

python controller.py

Step 8: View Results

cat NODE_*/output8.0/query_* > results.pairs
cat results.pairs

Expected for file-level: Often empty if files are structurally different, even if they contain similar functions.

Function-Level Clone Detection

This is where SourcererCC shines for finding duplicate functions.

Step 1: Switch to Block-Level Tokenizer

cd ../tokenizers/block-level

Step 2: Configure for Functions

Edit config.ini:

[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt

[Folders/Files]
PATH_stats_file_folder = file_block_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = blocks_tokens
PATH_logs = logs

[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java

[Config]
init_file_id = 3000000
init_proj_id = 1
proj_id_flag = 1

Step 3: Create Project List

echo "project1/sample-input.zip" > project-list.txt

Step 4: Run Block-Level Tokenizer

# Note the different command - use 'zipblocks' not 'zip'
python tokenizer.py zipblocks

For folder-based projects:

python tokenizer.py folderblocks

Output: Extracts individual functions from each file. For example:

2 Java files → 308 function blocks extracted

Step 5: Verify Block Extraction

# Count extracted blocks
wc -l blocks_tokens/*

# Preview tokens
head -5 blocks_tokens/*

Step 6: Run Clone Detection on Blocks

cd ../../clone-detector

# Clean previous results
bash cleanup.sh

# Prepare block-level input
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file

# Run detection
python controller.py

Step 7: View Clone Results

cat NODE_*/output8.0/query_*

Sample output:

11,100403000001,11,100273000000
11,100433000001,11,100113000000
11,100573000001,11,100293000000

Each line shows: proj_id,block_id_1,proj_id,block_id_2

Interpreting Results

Understanding Block IDs

cd ../tokenizers/block-level

# Find details about a specific block
cat file_block_stats/* | grep 100403000001

Output format:

b11,100403000001,"304d6caf12650df646b19e7ea057c184",52,43,35,841,892

Fields explained:

b = block (vs f for file)
11 = project ID
100403000001 = block ID
"304d6c..." = block hash
52 = total lines
43 = lines of code (LOC)
35 = source lines of code (SLOC)
841,892 = start and end line numbers

Finding the Source File

# Find file information
cat file_block_stats/* | grep "^f"

This shows which file contains the blocks.

Analyzing a Clone Pair

Example: Clone pair 11,100403000001,11,100273000000

cat file_block_stats/* | grep 100403000001
# Output: b11,100403000001,"304d6c...",52,43,35,841,892

cat file_block_stats/* | grep 100273000000
# Output: b11,100273000000,"304d6c...",52,43,35,896,947

Analysis:

Both have identical hash = exact duplicates (Type-1 clones)
Same project ID (11) = same file
Lines 841-892 and 896-947 = consecutive functions
Conclusion: Copy-pasted function in the same file

Troubleshooting

Issue: "ant: command not found"

Solution:

sudo apt install ant

Issue: "No module named 'javalang'"

Solution:

pip install javalang

Issue: "Unable to access jarfile"

Solution:

cd clone-detector
ant cdi  # Build the JAR file

Issue: "Please insert archive format 'zipblocks' or 'folderblocks'"

Solution: Use correct command:

# For zip files
python tokenizer.py zipblocks

# For folders
python tokenizer.py folderblocks

Issue: Empty results at file-level

Solution: This is normal! Files as a whole may be different. Switch to block-level detection to find function clones.

Issue: Python version on Mac

Solution: Mac users must use Python 3.6:

conda create -n sourcerercc python=3.6

Quick Reference Commands

File-Level Detection

cd tokenizers/file-level
python tokenizer.py zip
cd ../../clone-detector
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*

Function-Level Detection

cd tokenizers/block-level
python tokenizer.py zipblocks
cd ../../clone-detector
bash cleanup.sh
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*

Analyzing Results

cd tokenizers/block-level
cat file_block_stats/* | grep <block_id>
cat file_block_stats/* | grep "^f"

Performance Tuning

For Large Datasets

Tokenizer (config.ini):

N_PROCESSES = 8        # Match CPU cores
PROJECTS_BATCH = 100   # Batch size

Clone Detector (JVM settings in runnodes.sh):

# For 257GB RAM system
-Xms128g -Xmx128g

# For 64GB RAM system
-Xms32g -Xmx32g

# For 16GB RAM system
-Xms6g -Xmx6g

Rule of thumb: Use 70-80% of available RAM for large datasets.

Key Takeaways

Granularity matters: File-level detection finds duplicate files; block-level finds duplicate functions
Start with block-level for most practical use cases
Same hash = exact clone: Blocks with identical hashes are Type-1 (exact) clones
Line numbers are gold: They tell you exactly where to find the duplicate code
Build before run: Always run ant cdi before first clone detection
Clean between runs: Use bash cleanup.sh when switching input datasets

Real-World Example Summary

Test case: 2 Java files with similar functionality

File-level results:

0 clones found (files too different as a whole)

Function-level results:

308 functions extracted
7 clone pairs found at 80% similarity
Included exact duplicates (same hash) in the same file

Conclusion: Block-level detection is essential for finding function-level code duplication, which is the most common type of cloning in real projects.

Name		Name	Last commit message	Last commit date
Latest commit History 511 Commits
WebApp		WebApp
clone-detector		clone-detector
scripts-data-analysis		scripts-data-analysis
tokenizers		tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

unosweng/SourcererCC

Folders and files

Latest commit

History

Repository files navigation

Complete SourcererCC Tutorial

From Installation to Function-Level Clone Detection

Table of Contents

Environment Setup

Prerequisites

Step 1: Create Python Environment

Step 2: Clone SourcererCC Repository

Step 3: Install Required Tools

Understanding Granularity

File-Level Clone Detection

Step 1: Prepare Your Projects

Step 2: Configure config.ini

Step 3: Run File-Level Tokenizer

Step 4: Build Clone Detector

Step 5: Prepare Input for Clone Detection

Step 6: Configure Detection Parameters

Step 7: Run Clone Detection

Step 8: View Results

Function-Level Clone Detection

Step 1: Switch to Block-Level Tokenizer

Step 2: Configure for Functions

Step 3: Create Project List

Step 4: Run Block-Level Tokenizer

Step 5: Verify Block Extraction

Step 6: Run Clone Detection on Blocks

Step 7: View Clone Results

Interpreting Results

Understanding Block IDs

Finding the Source File

Analyzing a Clone Pair

Troubleshooting

Issue: "ant: command not found"

Issue: "No module named 'javalang'"

Issue: "Unable to access jarfile"

Issue: "Please insert archive format 'zipblocks' or 'folderblocks'"

Issue: Empty results at file-level

Issue: Python version on Mac

Quick Reference Commands

File-Level Detection

Function-Level Detection

Analyzing Results

Performance Tuning

For Large Datasets

Key Takeaways

Real-World Example Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 2: Configure `config.ini`

Packages