This project demonstrates the process of misuse detection in container environments using a dataset obtained from Kaggle. The main steps include downloading the dataset, data preprocessing, training a Random Forest classifier, and evaluating its performance.
The dataset used in this project is the Misuse Detection in Containers Dataset. It contains labeled data indicating whether container misuse has occurred.
The dataset is downloaded from Kaggle using the following command:
!curl -L -o misuse-detection-in-containers-dataset.zip https://www.kaggle.com/api/v1/datasets/download/yigitsever/misuse-detection-in-containers-dataset
After downloading, it is extracted using:
!unzip misuse-detection-in-containers-dataset.zip
import numpy as np
import pandas as pd
# Load the dataset
data_file_path = 'dataset.csv'
df = pd.read_csv(data_file_path)
# Replace inf values with NaN and drop NaN values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df = df.dropna()
# Drop irrelevant columns
df = df.drop(['Flow ID', 'Src IP', 'Dst IP', 'Timestamp'], axis=1)
# Define features (X) and target (y)
X = df.drop('Label', axis=1)
y = df['Label']
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
from sklearn.ensemble import RandomForestClassifier
# Initialize and train the classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Generate classification report
report = classification_report(y_test, y_pred, target_names=list(map(str, range(12)))) # Adjust target_names as needed
print(report)
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
import joblib
# Save the model
joblib.dump(clf, 'random_forest_model.pkl')
print("Model saved as 'random_forest_model.pkl'")
# Load the model
loaded_model = joblib.load('random_forest_model.pkl')
# Use the loaded model for predictions
y_pred_loaded = loaded_model.predict(X_test)
print("Predictions:", y_pred_loaded)
- Python 3.x
- NumPy
- Pandas
- Scikit-learn
- Joblib
- Clone the repository.
- Download the dataset using the provided command.
- Run the provided script to preprocess data, train the model, and evaluate its performance.
- Use the saved model for predictions on new data.
- The model achieved an accuracy of approximately 99%.
- The classification report provides detailed precision, recall, and F1-score metrics for each class.
This project illustrates the end-to-end process of building a misuse detection system for container environments using machine learning.