GitHub - tan-yong-sheng/gcp-htcondor-workflow-management: This is the group assignment of WQD7008

Machine Learning Workflow Management via HTCondor

Project Background

Our team is training a loan prediction model to classify which customer has a risk of default. Because the data might be growing large in the future, our team is preparing to move the machine learning training task to a distributed environment.

Introduction to our Project Architecture

To manage the training of our loan prediction machine learning model, we've implemented a workflow using HTCondor. Before delving into our machine learning workflow, let's first examine the architecture of HTCondor itself.

Component	Role	Description
HTCondor Submit Node	Job Submission & Development	Where jobs are submitted using submit files (e.g., Python scripts), also used for Jupyter Notebooks.
HTCondor Central Manager	Task Allocation & Resource Management	Includes Negotiator (matches jobs to execution nodes) and Collector (monitors resources).
HTCondor Execution Node	Job Execution	Executes training scripts on worker nodes after tasks are submitted.
Shared Storage (NFS)	Data Access	Provides a shared location accessible by all nodes, used for data input, output and file access ensuring data accessibility for all nodes.

Such a setup easily allows collaboration between submission, execution, and management components while maintaining shared-resource integrity. This architecture is particularly well-suited for very scalable computational tasks that require distributed resource utilization. The employment of Jupyter Notebooks on the submission node provides an intuitive, interactive interface to the system, which is ideal for both research and production environments.

Machine Learning Workflow Managment via HTCondor

Our final output is 06-ml-pipeline-in-htcondor-executor.ipynb, where other jupyter notebook are the intermediate notebook that we divide the job by the expected challenges, as follows:

01-intro-to-htcondor-python.ipynb: Trial to use Python bindings for HTCondor
02-running-sklearn-in-condor-executor.ipynb: Trial to execute Python script with Python libraries like pandas and sklearn in HTCondor Executor
03-running-sklearn-with-docker-in-condor-executor.ipynb: Initiative to explore Docker runtime for HTCondor jobs, we're not using this in the final as still stuck with the access to the NFS filesystem if using this method. See more info at ./setup_docs/Extra/Docker Runtime Setup Guide on.md
04-create-dag-file-via-python.ipynb: Trial to create DAG file via Python library: htcondor.dags API
05-loading-csv-from-nfs-in-htcondor-executor.ipynb: Trial to read and write CSV in HTCondor Executor from shared NFS folder

If you're interested with our step-by-step setup guide, see here:

And here are the challenges we've met during the setup. Please feel free to read here: Part 5 - Challenges we have met

So, back to our topic:

The workflow for training machine learning model for loan prediction in HTCondor, shown in the figure above, includes the following essential steps:

Data Input & Preprocessing: NFS stores unprocessed loan data (CSV). Several compute nodes are coordinated by HTCondor to retrieve this data, perform preprocessing (one-hot encoding, feature engineering, cleaning, and handling missing values), and then save the processed data back to NFS.
Data Partition: The task of splitting the produced data into training (80%) and testing (20%) groups is assigned to HTCondor. Executor Nodes perform the split, retrieve the preprocessed data, and save the processed data in NFS.
Training Models in a Distributed Manner: One of the Executor nodes will be assigned by HTCondor for Logistic Regression model training. While another Executor node will be assigned by HTCondor for Decision Tree’s training.
Evaluating Models in a Distributed Manner: Several nodes use the test data to evaluate the trained logistic regression and decision tree models. Metrics like accuracy and recall are evaluated, and the results are aggregated to provide a comprehensive assessment.
Model Deployment: The highest-performing model, according to evaluation metrics, is saved to the shared NFS folder and thus could be implemented across all nodes.
Loan Forecasting: Using the developed model, new loan data can be examined across several Executor nodes in order to make predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Workflow Management via HTCondor

Project Background

Introduction to our Project Architecture

Machine Learning Workflow Managment via HTCondor

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
images		images
mnt/nfs		mnt/nfs
setup_docs		setup_docs
01-intro-to-htcondor-python.ipynb		01-intro-to-htcondor-python.ipynb
02-running-sklearn-in-condor-executor.ipynb		02-running-sklearn-in-condor-executor.ipynb
03-running-sklearn-with-docker-in-condor-executor.ipynb		03-running-sklearn-with-docker-in-condor-executor.ipynb
04-create-dag-file-via-python.ipynb		04-create-dag-file-via-python.ipynb
05-loading-csv-from-nfs-in-htcondor-executor.ipynb		05-loading-csv-from-nfs-in-htcondor-executor.ipynb
06-ml-pipeline-in-htcondor-executor.ipynb		06-ml-pipeline-in-htcondor-executor.ipynb
README.md		README.md

tan-yong-sheng/gcp-htcondor-workflow-management

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Workflow Management via HTCondor

Project Background

Introduction to our Project Architecture

Machine Learning Workflow Managment via HTCondor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages