Documentation is loosely updated.
$ git clone https://github.com/BumpiestDig10/deepfake-detector.git
$ python -m venv venv
$ source venv/bin/activate # Linux
$ venv/Scripts/activate # Windows
$ pip install -r allRequirements.txtTo run the Deepfake Training Orchestrator (UI Dashboard with all the tools)
- On the dashboard, paramaeters for all tools will have "Browse Files" and "Browse Folders" buttons, be smart about what you should actually input.
python -m ui.dashboardTo use the detector
- Input may be a single image, multiple images, a folder containing image(s), or a csv file containing features (headers must be ranging from "feature_0" to "feature_2047").
- Use the showOutput tag with caution. It is not very refined and consumes a lot of RAM (depending on the number of images to and display)
python -m detectors.2048FeatureDetector --input "path/to/input" --modelPath "path/to/model.joblib" --featureExtractor "(OPTIONAL) ResNet50 OR InceptionV3" --weights "(OPTIONAL) imagenet" --output "(OPTIONAL) path/to/outputDirectory" [--showOutput]
# --modelPath is optional if "results/imageModels/ResNet50_imagenet/32kModel/randomForest/best_random_forest_model.joblib" exists.To calculate SHA256 hash of a file
python -m utils.filehash --input "path/to/inputFile"To run the InceptionV3 Feature Extractor
python -m utils.featureExtractor.InceptionV3_image_feature_extractor --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file" --weights "(OPTIONAL) imagenet"To run the ResNet50 Feature Extractor
python -m utils.featureExtractor.ResNet50_image_feature_extractor --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file" --weights "(OPTIONAL) imagenet"To run the Metadata Parser
python -m utils.metadata.metadata_parser --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file"To download image datasets from Hugging Face
python -m utils.preprocessor.hf_to_image --dataset "huggingFace/Dataset" --split "(OPTIONAL) train" --output "(OPTIONAL) relativePath/to/output_directory" --token "(OPTIONAL) huggingFaceAccessToken"To run the Instagram Profile Downloader (no private accounts)
- Change the username and password in the script if needed. The one mentioned is a burner and may or may not work for you.
- Create
instaProfile/usernames.txtto sequentially download for each username mentioned. Not required for single profile.
python -m utils.preprocessor.insta_profile_downloadTo run the Reddit Downloader
- This tool works as a Chrome extension to bypass Reddit login issues.
- Downloads everything to the
Downloads/folder. - Logs can be checked using Chrome's DevTools Console.
- Navigate to chrome://extensions/ using Google Chrome.
- Enable Developer Mode.
- Select "Load Unpacked".
- Select the folder with all files of the extension. Default should be utils/preprocessor/reddit_downloader.To find and delete duplicate files in a particular directory (Windows-only)
- Update the target folder in duplicate_finder.ps1.
- Execute the script in powershell.
$ ./utils/preprocessor/duplicate_finder.ps1 # From project root directory or $ ./duplicate_finder.ps1 # if CWD = utils/preprocessor/
To Train a Random Forest Model
- Make sure you check the parameter grid before running.
python -m trainers.RandomForestTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420To Train a XGBoost Model
- Make sure you check the parameter grid before running.
python -m trainers.XGBoostTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420To Train a CatBoost Model
- Make sure you check the parameter grid before running.
- Some issues with keyboardInterrupt.
python -m trainers.CatBoostTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420- Fix batch processing for 2048FeaturesDetector.py
- Add resource consumption checks for different stages of 2048FeaturesDetector.py
- Convert hard-coded or input based to args (for UI)
- insta_profile_download
- Target username (single and file)
- Output directory
- insta_profile_download
- Create a pipeline to generate deepfakes.
- Start working on the Detector
- Feature Extractors
- ResNet50
- imagenet
- Open Images Dataset (by Google)
- COCO (Common Objects in Context)
- InceptionV3 (different layers)
- imagenet
- Open Images Dataset (by Google)
- COCO (Common Objects in Context)
- VGG16
- imagenet
- Open Images Dataset (by Google)
- COCO (Common Objects in Context)
- Qwen3-VL (To extract text description of images)
- Detailed image description derived from image
- Analysis of the detailed image description
- ResNet50
- Feature Classifiers (Micro Learning)
- Random Forest
- XGBoost
- CatBoost
- KeyboardInterrupt errors
- Need to check for memory constraints
- LightGBM
- Linear SVM
- Regularized Logistic Regression
- Custom Neural Network
- Feature Clustering (with Principal Component Analysis) - Doesn't seem very useful - curves will be very similar for highly realistic deepfakes
- K Means Clustering
- Gaussian Mixture Models
- Hierarchical Clustering
- DBSCAN
- Image Classifiers
- Feature Extractors
- Add support for scalpel. If embedded files (steganography) found, this will be used to extract all files.
- Fix all README.
Note
Args:
- filehash.py: input
- 2048FeaturesDetector.py: input, modelPath, featureExtractor (optional), weights (optional), output (optional), showOutput (this is a toggle switch, don't include this flag if you don't want to view all photos with predictions)
- /utils/featureExtractor/
- InceptionV3_image_feature_extractor.py: input, output (optional), weights (optional)
- ResNet50_image_feature_extractor.py: input, output (optional), weights (optional)
- /utils/metadata/
- metadata_parser.py: input, output (optional)
- /utils/preprocessor/
- csv_mapNmerge.py: base, label
- hf_to_image.py: dataset, split (optional), output (optional), token (optional)
- real_fake_csv_merger.py: real, fake, output (optional)
- /trainers/
- RandomForestTrainer.py: input, output (optional), test_size (optional), random_state (optional)
- XGBoostTrainer.py: input, output (optional), test_size (optional), random_state (optional)
- CatBoostTrainer.py: input, output (optional), test_size (optional), random_state (optional)
Labels:
- Real = 1
- Fake = 0
| Model | Random Forest | XGBoost |
|---|---|---|
| Dataset Type | Images | Images |
| Dataset Size | Total: 31,762 Real: 15,364 Fake: 16,398 |
Total: 31,762 Real: 15,364 Fake: 16,398 |
| Feature Extractor | ResNet50 (imagenet) | ResNet50 (imagenet) |
| Data Split | Train: 80% (25,409 images) Test: 20% (6,353 images) |
Train: 80% (25,409 images) Test: 20% (6,353 images) |
| params | n_estimators: 150 max_depth: null min_samples_split: 5 min_samples_leaf: 1 max_features: 0.2 bootstrap: false |
n_estimators: 100 max_depth: 3 learning_rate: 0.01 min_child_weight: 1 gamma: 0 reg_alpha: 0 reg_lambda: 1 subsample: 0.8 colsample_bytree: 0.8 scale_pos_weight: 1 |
| Report Path | Classification Report | Full Results | Classification Report | Full Results |
| Accuracy | 0.854 | 0.869 |
| Precision | 0.854 | 0.869 |
| F1 Score | 0.854 | 0.869 |
| Matthews Correlation Coefficient | 0.7070566271285679 | 0.7389886538917796 |
| Cohen's Kappa | 0.7070160654228586 | 0.7388856764484775 |
| Balanced Accuracy | 0.8536314517473194 | 0.8696438988673973 |
Datasets Used:
- JamieWithofs/Deepfake-and-real-images-4
- StyleGan-StyleGan2 Deepfake Face Images
- Fake-Vs-Real-Faces (Hard)
- Images scraped from Instagram and Reddit
Important
This project is under the MIT license but the datasets used may be under different licenses. You must comply with them all when using any of the trained models from this repository.