-
Notifications
You must be signed in to change notification settings - Fork 0
GETTING STARTED
The purpose of this guide is to illustrate the main features that ml-dsl provides. It assumes a very basic working knowledge of machine learning practices (data processing, fitting, predicting, etc.).
Please refer to our installation instructions for installing ml-dsl.
As a rule, the standard working process of a data scientist includes such steps as data processing, training, deployment and evaluation models. Sometimes the resources of a desktop/laptop are enough for execution. But in some cases the more resources are needed. This case data specialist has resources to cloud platforms. A lot of additional knowledge is necessary: preparation and deployment of code to cloud platforms, familiarity with some SDK cloud libraries or command-line tools.
The main idea of ml-dsl is to simplify this process for data specialists. ml-dsl lets submit jobs and run your code on cloud platforms from a jupyter notebook.
Let’s see an example. We have a movie review dataset and want to build a model for classification the reviews as positive or negative. We are going to build a simple LSTM network. Text sample of movie review:
First of all we need to prepare sequences from original text. We are going to use glove vector representation. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
Import necessary modules:
Read glove.6B.50d.txt using pyspark.
Function to read review, tokenize every one and replace words with it’s indices in GloVe:
Join together positive and negative reviews in the train/test dataset and save it for further work.
Using the functions to prepare train/test dataset and save it:
Using ml-dsl you can put it all together and register your code as python script using magic cell %%py_script.
First you need to import magic functions:
The necessary arguments here are:
--name (-n) Name of the python script.
--path (-p) Path to the folder where the python script is saved.
--output_path (-o) Path to the folder where the results of the job are saved.
If you use flag --exec (-e) the cell with your code will be run immediately locally. If you are going to run the cell you need to add all arguments for your job. Below is an example (for the sake of concision function’s code is not given completely)
We have run the script locally and now need to run it for a full dataset which is located on some cloud storage (GS or S3). Spark jobs are typically running on the Dataproc cluster in case of Google Cloud Platform and on the EMR cluster of Amazon Web Services. As a rule you can submit jobs to the services using some SDK with client libraries or command-line tools. The information you need to run a job is the name of a cluster, bucket, additional information, files and packages for the job etc.
ml-dsl offers such a tool as Profile to describe all information you need to submit jobs on Cloud Platforms. For describing spark job’s profile import PySparkJobProfile class and create an instance of it. Also you need to set up on which platform spark job will be running.
To start the job use magic function %py_data:
Output in the case of both platforms is similar: json with some job information, links to output etc.
So train and test datasets are ready. Next step is to create a simple LSTM classifier and start training. We need to define functions for reading train dataset, define a model, saving results etc. As with described above for data processing script you can write all functionality you need, register it with %py_script and run locally.
For the sake of concision let’s show only some parts of the training script here (for example, please, refer to examples):
The same way you define a setup.py file for packaging code for AI Platform. Example:
We build a train script and check it works locally. Now let’s start training on AI Platform or Sagemaker.
Let’s set a type of platform (in case it wasn’t done before) and a profile for training.
Start the training job using magic function %py_train. It starts with packaging your custom code and uploading to storage.
Output contains some useful information about the job.
Trained model and additional stuff was saved in output_path.
Now it’s time to deploy the model. Let's define special Profile for this.
Deployment of models can be completed using %py_deploy magic function.
And, finally, it’s time to predict. DeployAIProfile defined above is suitable for prediction goals.
Test predictions can be run using %py_test magic function with argument --test -t
We have briefly covered the main possibilities ml-dsl offers. This guide should give you an overview of some of the main features of the library.