Skip to content

BumpiestDig10/deepfake-detector

Repository files navigation

DEEPFAKE DETECTION

Documentation is loosely updated.

$ git clone https://github.com/BumpiestDig10/deepfake-detector.git
$ python -m venv venv
$ source venv/bin/activate  # Linux
$ venv/Scripts/activate     # Windows
$ pip install -r allRequirements.txt

To run the Deepfake Training Orchestrator (UI Dashboard with all the tools)

  • On the dashboard, paramaeters for all tools will have "Browse Files" and "Browse Folders" buttons, be smart about what you should actually input.
python -m ui.dashboard

To use the detector

  • Input may be a single image, multiple images, a folder containing image(s), or a csv file containing features (headers must be ranging from "feature_0" to "feature_2047").
  • Use the showOutput tag with caution. It is not very refined and consumes a lot of RAM (depending on the number of images to and display)
python -m detectors.2048FeatureDetector --input "path/to/input" --modelPath "path/to/model.joblib" --featureExtractor "(OPTIONAL) ResNet50 OR InceptionV3" --weights "(OPTIONAL) imagenet" --output "(OPTIONAL) path/to/outputDirectory" [--showOutput]
# --modelPath is optional if "results/imageModels/ResNet50_imagenet/32kModel/randomForest/best_random_forest_model.joblib" exists.

To calculate SHA256 hash of a file

python -m utils.filehash --input "path/to/inputFile"

To run the InceptionV3 Feature Extractor

python -m utils.featureExtractor.InceptionV3_image_feature_extractor --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file" --weights "(OPTIONAL) imagenet"

To run the ResNet50 Feature Extractor

python -m utils.featureExtractor.ResNet50_image_feature_extractor --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file" --weights "(OPTIONAL) imagenet"

To run the Metadata Parser

python -m utils.metadata.metadata_parser --input "relativePath/to/input_directory" --output "(OPTIONAL) relativePath/to/output_file"

To download image datasets from Hugging Face

python -m utils.preprocessor.hf_to_image --dataset "huggingFace/Dataset" --split "(OPTIONAL) train" --output "(OPTIONAL) relativePath/to/output_directory" --token "(OPTIONAL) huggingFaceAccessToken"

To run the Instagram Profile Downloader (no private accounts)

  • Change the username and password in the script if needed. The one mentioned is a burner and may or may not work for you.
  • Create instaProfile/usernames.txt to sequentially download for each username mentioned. Not required for single profile.
python -m utils.preprocessor.insta_profile_download

To run the Reddit Downloader

  • This tool works as a Chrome extension to bypass Reddit login issues.
  • Downloads everything to the Downloads/ folder.
  • Logs can be checked using Chrome's DevTools Console.
- Navigate to chrome://extensions/ using Google Chrome.
- Enable Developer Mode.
- Select "Load Unpacked".
- Select the folder with all files of the extension. Default should be utils/preprocessor/reddit_downloader.

To find and delete duplicate files in a particular directory (Windows-only)

  • Update the target folder in duplicate_finder.ps1.
  • Execute the script in powershell.
    $ ./utils/preprocessor/duplicate_finder.ps1 # From project root directory or
    $ ./duplicate_finder.ps1                    # if CWD = utils/preprocessor/

To Train a Random Forest Model

  • Make sure you check the parameter grid before running.
python -m trainers.RandomForestTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420

To Train a XGBoost Model

  • Make sure you check the parameter grid before running.
python -m trainers.XGBoostTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420

To Train a CatBoost Model

  • Make sure you check the parameter grid before running.
  • Some issues with keyboardInterrupt.
python -m trainers.CatBoostTrainer --input "path/to/features.csv" --output "(OPTIONAL) path/to/outputDirectory" --test_size (OPTIONAL) 0.2 --random_state (OPTIONAL) 420

TODO: (for images branch)

  • Fix batch processing for 2048FeaturesDetector.py
  • Add resource consumption checks for different stages of 2048FeaturesDetector.py
  • Convert hard-coded or input based to args (for UI)
  • Create a pipeline to generate deepfakes.
  • Start working on the Detector
    • Feature Extractors
      • ResNet50
        • imagenet
        • Open Images Dataset (by Google)
        • COCO (Common Objects in Context)
      • InceptionV3 (different layers)
        • imagenet
        • Open Images Dataset (by Google)
        • COCO (Common Objects in Context)
      • VGG16
        • imagenet
        • Open Images Dataset (by Google)
        • COCO (Common Objects in Context)
      • Qwen3-VL (To extract text description of images)
        • Detailed image description derived from image
        • Analysis of the detailed image description
    • Feature Classifiers (Micro Learning)
      • Random Forest
      • XGBoost
      • CatBoost
        • KeyboardInterrupt errors
        • Need to check for memory constraints
      • LightGBM
      • Linear SVM
      • Regularized Logistic Regression
      • Custom Neural Network
    • Feature Clustering (with Principal Component Analysis) - Doesn't seem very useful - curves will be very similar for highly realistic deepfakes
      • K Means Clustering
      • Gaussian Mixture Models
      • Hierarchical Clustering
      • DBSCAN
    • Image Classifiers
  • Add support for scalpel. If embedded files (steganography) found, this will be used to extract all files.
  • Fix all README.

Note

Args:

Labels:

  • Real = 1
  • Fake = 0

RESULTS

32k Models

Model Random Forest XGBoost
Dataset Type Images Images
Dataset Size Total: 31,762
Real: 15,364
Fake: 16,398
Total: 31,762
Real: 15,364
Fake: 16,398
Feature Extractor ResNet50 (imagenet) ResNet50 (imagenet)
Data Split Train: 80% (25,409 images)
Test: 20% (6,353 images)
Train: 80% (25,409 images)
Test: 20% (6,353 images)
params n_estimators: 150
max_depth: null
min_samples_split: 5
min_samples_leaf: 1
max_features: 0.2
bootstrap: false
n_estimators: 100
max_depth: 3
learning_rate: 0.01
min_child_weight: 1
gamma: 0
reg_alpha: 0
reg_lambda: 1
subsample: 0.8
colsample_bytree: 0.8
scale_pos_weight: 1
Report Path Classification Report | Full Results Classification Report | Full Results
Accuracy 0.854 0.869
Precision 0.854 0.869
F1 Score 0.854 0.869
Matthews Correlation Coefficient 0.7070566271285679 0.7389886538917796
Cohen's Kappa 0.7070160654228586 0.7388856764484775
Balanced Accuracy 0.8536314517473194 0.8696438988673973

Datasets Used:


Important

This project is under the MIT license but the datasets used may be under different licenses. You must comply with them all when using any of the trained models from this repository.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Contributors