Inductive Representation Learning on Large Graphs
This project implements the GraphSAGE model with four types of aggregation (mean, max, sum, GCN) and allows comparison of its performance with DeepWalk across multiple datasets (Citeseer, PPI, and OpenAlex π). The project includes a comprehensive evaluation based on F1 score, recall, precision, accuracy, confusion matrix, and classification report, along with embedding visualization.
Implementation of GraphSAGE based on the paper "Inductive Representation Learning on Large Graphs" by Hamilton et al., 2017
GraphSage/
βββ Figures/ # All generated figures
βββ graphvenv/ # Virtual environment to isolate dependencies
βββ config.py # Parameters and hyperparameters (dataset, learning rate, number of layers, etc.)
βββ dataloader.py # Data loading and preprocessing
βββ models.py # Model definitions: GraphSAGE (mean, max, LSTM aggregations), GCN, and DeepWalk
βββ train.py # Training loop with early stopping and logging
βββ evaluation.py # Evaluation and visualization functions (F1 score, recall, precision, confusion matrix, classification report)
βββ utils.py # Utility functions (embedding visualization, model saving, etc.)
βββ requirements.txt # Python dependencies (torch, torch-geometric, etc.)
βββ .gitignore # Files and folders to ignore in Git (e.g., __pycache__, checkpoints, logs, etc.)
βββ README.md # Project documentation (description, installation, usage, etc.)
βββ main.py # Entry point for training and evaluation
- Operating System: Windows (tested)
- Hardware: A GPU is recommended for full training; compatible with platforms like SageMaker
- Use a CUDA-compatible GPU for faster training
- Python 3.12.6+
git clone https://github.com/SamarKri/GraphSage.git
cd GraphSagepython -m venv graphvenv
graphvenv\Scripts\activatepip install -r requirements.txtpython main.py --model graphsage --dataset citeseer- Add unit tests for dataloader, model, train, evaluation, and utils
- Test implementation on other datasets like Cora, PubMed, or Reddit
- Add a directory for dataset storage/reference (Citeseer, Cora, PubMed, Reddit, PPI, OpenAlex)
- Optimize hyperparameters using Optuna
- Define additional performance metrics such as AUC-ROC for multi-label problems