Skip to content

unosweng/SourcererCC

 
 

Repository files navigation

Complete SourcererCC Tutorial

From Installation to Function-Level Clone Detection

A practical guide based on real implementation on Ubuntu with Python 3.8


Table of Contents

  1. Environment Setup
  2. Understanding SourcererCC Granularity Levels
  3. File-Level Clone Detection
  4. Function-Level Clone Detection
  5. Interpreting Results
  6. Troubleshooting

Environment Setup

Prerequisites

  • Ubuntu (or other Linux distribution)
  • Git
  • Java 17+ installed
  • Conda or Python 3.8+

Step 1: Create Python Environment

# Create conda environment with Python 3.8
conda create -n sourcerercc python=3.8

# Activate environment
conda activate sourcerercc

Note: Mac users need Python 3.6 due to multiprocessing issues, but Ubuntu/Linux users can use Python 3.8 or 3.9.

Step 2: Clone SourcererCC Repository

git clone https://github.com/Mondego/SourcererCC.git
cd SourcererCC

Step 3: Install Required Tools

# Install Apache Ant (for building Java components)
sudo apt install ant

# Install Java parsing library (for function extraction)
pip install javalang

Understanding Granularity

SourcererCC works at multiple levels:

Level Detects Clones Between Use Case
File-level Entire files Finding duplicate files or very similar code files
Block-level Individual functions/methods Finding duplicate functions within or across files
Statement-level Code blocks/statements Fine-grained clone detection

Key Insight: Our testing showed that two Java files with similar functionality showed NO clones at file-level (entire files were <80% similar) but 7 clone pairs at function-level (individual functions were clones).


File-Level Clone Detection

Step 1: Prepare Your Projects

cd tokenizers/file-level

# Create project list file
echo "project1/sample-input.zip" > project-list.txt

Step 2: Configure config.ini

[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt

[Folders/Files]
PATH_stats_file_folder = files_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = files_tokens
PATH_logs = logs

[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java

[Config]
init_file_id = 1
init_proj_id = 1

Important Language Settings:

  • For Java: comment_inline = //, tags: /* */
  • For Python: comment_inline = #, tags: ''' '''
  • For C/C++: Same as Java

Step 3: Run File-Level Tokenizer

python tokenizer.py zip

Output folders created:

  • bookkeeping_projs/ - Project index
  • files_stats/ - File statistics (lines, LOC, SLOC)
  • files_tokens/ - Tokenized files

Step 4: Build Clone Detector

cd ../../clone-detector

# Build the Java components
ant cdi

This creates dist/indexbased.SearchManager.jar

Step 5: Prepare Input for Clone Detection

# Combine all token files
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.file

Step 6: Configure Detection Parameters

Edit sourcerer-cc.properties:

MIN_TOKENS=65
MAX_TOKENS=500000

Edit runnodes.sh (line 9) for similarity threshold:

threshold="${3:-8}"  # 8 = 80%, 7 = 70%, etc.

Edit runnodes.sh (line 20) for JVM memory:

# For systems with 257GB RAM, use:
-Xms32g -Xmx32g

# For smaller systems:
-Xms6g -Xmx6g

Step 7: Run Clone Detection

python controller.py

Step 8: View Results

cat NODE_*/output8.0/query_* > results.pairs
cat results.pairs

Expected for file-level: Often empty if files are structurally different, even if they contain similar functions.


Function-Level Clone Detection

This is where SourcererCC shines for finding duplicate functions.

Step 1: Switch to Block-Level Tokenizer

cd ../tokenizers/block-level

Step 2: Configure for Functions

Edit config.ini:

[Main]
N_PROCESSES = 1
PROJECTS_BATCH = 1
FILE_projects_list = project-list.txt

[Folders/Files]
PATH_stats_file_folder = file_block_stats
PATH_bookkeeping_proj_folder = bookkeeping_projs
PATH_tokens_file_folder = blocks_tokens
PATH_logs = logs

[Language]
separators = ; . [ ] ( ) ~ ! - + & * / % < > ^ | ? { } = # , " \ : $ ' ` @
comment_inline = //
comment_open_tag = /*
comment_close_tag = */
File_extensions = .java

[Config]
init_file_id = 3000000
init_proj_id = 1
proj_id_flag = 1

Step 3: Create Project List

echo "project1/sample-input.zip" > project-list.txt

Step 4: Run Block-Level Tokenizer

# Note the different command - use 'zipblocks' not 'zip'
python tokenizer.py zipblocks

For folder-based projects:

python tokenizer.py folderblocks

Output: Extracts individual functions from each file. For example:

  • 2 Java files → 308 function blocks extracted

Step 5: Verify Block Extraction

# Count extracted blocks
wc -l blocks_tokens/*

# Preview tokens
head -5 blocks_tokens/*

Step 6: Run Clone Detection on Blocks

cd ../../clone-detector

# Clean previous results
bash cleanup.sh

# Prepare block-level input
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file

# Run detection
python controller.py

Step 7: View Clone Results

cat NODE_*/output8.0/query_*

Sample output:

11,100403000001,11,100273000000
11,100433000001,11,100113000000
11,100573000001,11,100293000000

Each line shows: proj_id,block_id_1,proj_id,block_id_2


Interpreting Results

Understanding Block IDs

cd ../tokenizers/block-level

# Find details about a specific block
cat file_block_stats/* | grep 100403000001

Output format:

b11,100403000001,"304d6caf12650df646b19e7ea057c184",52,43,35,841,892

Fields explained:

  • b = block (vs f for file)
  • 11 = project ID
  • 100403000001 = block ID
  • "304d6c..." = block hash
  • 52 = total lines
  • 43 = lines of code (LOC)
  • 35 = source lines of code (SLOC)
  • 841,892 = start and end line numbers

Finding the Source File

# Find file information
cat file_block_stats/* | grep "^f"

This shows which file contains the blocks.

Analyzing a Clone Pair

Example: Clone pair 11,100403000001,11,100273000000

cat file_block_stats/* | grep 100403000001
# Output: b11,100403000001,"304d6c...",52,43,35,841,892

cat file_block_stats/* | grep 100273000000
# Output: b11,100273000000,"304d6c...",52,43,35,896,947

Analysis:

  • Both have identical hash = exact duplicates (Type-1 clones)
  • Same project ID (11) = same file
  • Lines 841-892 and 896-947 = consecutive functions
  • Conclusion: Copy-pasted function in the same file

Troubleshooting

Issue: "ant: command not found"

Solution:

sudo apt install ant

Issue: "No module named 'javalang'"

Solution:

pip install javalang

Issue: "Unable to access jarfile"

Solution:

cd clone-detector
ant cdi  # Build the JAR file

Issue: "Please insert archive format 'zipblocks' or 'folderblocks'"

Solution: Use correct command:

# For zip files
python tokenizer.py zipblocks

# For folders
python tokenizer.py folderblocks

Issue: Empty results at file-level

Solution: This is normal! Files as a whole may be different. Switch to block-level detection to find function clones.

Issue: Python version on Mac

Solution: Mac users must use Python 3.6:

conda create -n sourcerercc python=3.6

Quick Reference Commands

File-Level Detection

cd tokenizers/file-level
python tokenizer.py zip
cd ../../clone-detector
cat ../tokenizers/file-level/files_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*

Function-Level Detection

cd tokenizers/block-level
python tokenizer.py zipblocks
cd ../../clone-detector
bash cleanup.sh
cat ../tokenizers/block-level/blocks_tokens/* > input/dataset/blocks.file
python controller.py
cat NODE_*/output8.0/query_*

Analyzing Results

cd tokenizers/block-level
cat file_block_stats/* | grep <block_id>
cat file_block_stats/* | grep "^f"

Performance Tuning

For Large Datasets

Tokenizer (config.ini):

N_PROCESSES = 8        # Match CPU cores
PROJECTS_BATCH = 100   # Batch size

Clone Detector (JVM settings in runnodes.sh):

# For 257GB RAM system
-Xms128g -Xmx128g

# For 64GB RAM system
-Xms32g -Xmx32g

# For 16GB RAM system
-Xms6g -Xmx6g

Rule of thumb: Use 70-80% of available RAM for large datasets.


Key Takeaways

  1. Granularity matters: File-level detection finds duplicate files; block-level finds duplicate functions
  2. Start with block-level for most practical use cases
  3. Same hash = exact clone: Blocks with identical hashes are Type-1 (exact) clones
  4. Line numbers are gold: They tell you exactly where to find the duplicate code
  5. Build before run: Always run ant cdi before first clone detection
  6. Clean between runs: Use bash cleanup.sh when switching input datasets

Real-World Example Summary

Test case: 2 Java files with similar functionality

File-level results:

  • 0 clones found (files too different as a whole)

Function-level results:

  • 308 functions extracted
  • 7 clone pairs found at 80% similarity
  • Included exact duplicates (same hash) in the same file

Conclusion: Block-level detection is essential for finding function-level code duplication, which is the most common type of cloning in real projects.

About

Sourcerer's Code Clone project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 61.8%
  • Python 34.1%
  • Shell 2.5%
  • Other 1.6%