A practical guide based on real implementation on Ubuntu with Python 3.8
- Environment Setup
- Understanding SourcererCC Granularity Levels
- File-Level Clone Detection
- Function-Level Clone Detection
- Interpreting Results
- Troubleshooting
- Ubuntu (or other Linux distribution)
- Git
- Java 17+ installed
- Conda or Python 3.8+
# Create conda environment with Python 3.8
conda create -n sourcerercc python=3.8
# Activate environment
conda activate sourcererccNote: Mac users need Python 3.6 due to multiprocessing issues, but Ubuntu/Linux users can use Python 3.8 or 3.9.
git clone https://github.com/Mondego/SourcererCC.git
cd SourcererCC# Install Apache Ant (for building Java components)
sudo apt install ant
# Install Java parsing library (for function extraction)
pip install javalangSourcererCC works at multiple levels:
| Level | Detects Clones Between | Use Case |
|---|---|---|
| File-level | Entire files | Finding duplicate files or very similar code files |
| Block-level | Individual functions/methods | Finding duplicate functions within or across files |
| Statement-level | Code blocks/statements | Fine-grained clone detection |
Key Insight: Our testing showed that two Java files with similar functionality showed NO clones at file-level (entire files were <80% similar) but 7 clone pairs at function-level (individual functions were clones).
cd tokenizers/file-level
# Create project list file
echo "project1/sample-input.zip" > project-list.txt[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt
[Folders/Files]
PATH_stats_file_folder = files_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = files_tokens
PATH_logs = logs
[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java
[Config]
init_file_id = 1
init_proj_id = 1Important Language Settings:
- For Java:
comment_inline = //, tags:/* */ - For Python:
comment_inline = #, tags:''' ''' - For C/C++: Same as Java
python tokenizer.py zipOutput folders created:
bookkeeping_projs/- Project indexfiles_stats/- File statistics (lines, LOC, SLOC)files_tokens/- Tokenized files
cd ../../clone-detector
# Build the Java components
ant cdiThis creates dist/indexbased.SearchManager.jar
# Combine all token files
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.fileEdit sourcerer-cc.properties:
MIN_TOKENS=65
MAX_TOKENS=500000Edit runnodes.sh (line 9) for similarity threshold:
threshold="${3:-8}" # 8 = 80%, 7 = 70%, etc.Edit runnodes.sh (line 20) for JVM memory:
# For systems with 257GB RAM, use:
-Xms32g -Xmx32g
# For smaller systems:
-Xms6g -Xmx6gpython controller.pycat NODE_*/output8.0/query_* > results.pairs
cat results.pairsExpected for file-level: Often empty if files are structurally different, even if they contain similar functions.
This is where SourcererCC shines for finding duplicate functions.
cd ../tokenizers/block-levelEdit config.ini:
[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt
[Folders/Files]
PATH_stats_file_folder = file_block_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = blocks_tokens
PATH_logs = logs
[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java
[Config]
init_file_id = 3000000
init_proj_id = 1
proj_id_flag = 1echo "project1/sample-input.zip" > project-list.txt# Note the different command - use 'zipblocks' not 'zip'
python tokenizer.py zipblocksFor folder-based projects:
python tokenizer.py folderblocksOutput: Extracts individual functions from each file. For example:
- 2 Java files → 308 function blocks extracted
# Count extracted blocks
wc -l blocks_tokens/*
# Preview tokens
head -5 blocks_tokens/*cd ../../clone-detector
# Clean previous results
bash cleanup.sh
# Prepare block-level input
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file
# Run detection
python controller.pycat NODE_*/output8.0/query_*Sample output:
11,100403000001,11,100273000000
11,100433000001,11,100113000000
11,100573000001,11,100293000000
Each line shows: proj_id,block_id_1,proj_id,block_id_2
cd ../tokenizers/block-level
# Find details about a specific block
cat file_block_stats/* | grep 100403000001Output format:
b11,100403000001,"304d6caf12650df646b19e7ea057c184",52,43,35,841,892
Fields explained:
b= block (vsffor file)11= project ID100403000001= block ID"304d6c..."= block hash52= total lines43= lines of code (LOC)35= source lines of code (SLOC)841,892= start and end line numbers
# Find file information
cat file_block_stats/* | grep "^f"This shows which file contains the blocks.
Example: Clone pair 11,100403000001,11,100273000000
cat file_block_stats/* | grep 100403000001
# Output: b11,100403000001,"304d6c...",52,43,35,841,892
cat file_block_stats/* | grep 100273000000
# Output: b11,100273000000,"304d6c...",52,43,35,896,947Analysis:
- Both have identical hash = exact duplicates (Type-1 clones)
- Same project ID (11) = same file
- Lines 841-892 and 896-947 = consecutive functions
- Conclusion: Copy-pasted function in the same file
Solution:
sudo apt install antSolution:
pip install javalangSolution:
cd clone-detector
ant cdi # Build the JAR fileSolution: Use correct command:
# For zip files
python tokenizer.py zipblocks
# For folders
python tokenizer.py folderblocksSolution: This is normal! Files as a whole may be different. Switch to block-level detection to find function clones.
Solution: Mac users must use Python 3.6:
conda create -n sourcerercc python=3.6cd tokenizers/file-level
python tokenizer.py zip
cd ../../clone-detector
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*cd tokenizers/block-level
python tokenizer.py zipblocks
cd ../../clone-detector
bash cleanup.sh
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*cd tokenizers/block-level
cat file_block_stats/* | grep <block_id>
cat file_block_stats/* | grep "^f"Tokenizer (config.ini):
N_PROCESSES = 8 # Match CPU cores
PROJECTS_BATCH = 100 # Batch sizeClone Detector (JVM settings in runnodes.sh):
# For 257GB RAM system
-Xms128g -Xmx128g
# For 64GB RAM system
-Xms32g -Xmx32g
# For 16GB RAM system
-Xms6g -Xmx6gRule of thumb: Use 70-80% of available RAM for large datasets.
- Granularity matters: File-level detection finds duplicate files; block-level finds duplicate functions
- Start with block-level for most practical use cases
- Same hash = exact clone: Blocks with identical hashes are Type-1 (exact) clones
- Line numbers are gold: They tell you exactly where to find the duplicate code
- Build before run: Always run
ant cdibefore first clone detection - Clean between runs: Use
bash cleanup.shwhen switching input datasets
Test case: 2 Java files with similar functionality
File-level results:
- 0 clones found (files too different as a whole)
Function-level results:
- 308 functions extracted
- 7 clone pairs found at 80% similarity
- Included exact duplicates (same hash) in the same file
Conclusion: Block-level detection is essential for finding function-level code duplication, which is the most common type of cloning in real projects.