Skip to content

Introduction

MLOps is your one-stop-shop for machine learning. We put focus on creating business value through data science by allowing for quick, KPI-oriented, iterations. Like you know, the machine learning process from data discovery to deployment is a waterfall process with a, usually huge, iterative cycle mid-way. In MLOps, we optimize for smoothness, transparency and speed in this waterfall process.

Create a project

In MLOps, to help you stay focused on the business problem at hand, everything revolves around projects. For each Project you can set one, and only one, KPI. Let's call it the projects star metric. You can of course measure your models by many more metrics during training, but star metric is how you are going to evaluate models against each other in terms of project achievement, keeping everyone aligned on what matters.

Import data

Head over to Datasources in the console. We define a datasource as any data that is already existing on your AWS account. Currently, we only support structured data in most formats (parquet, avro, orc, csv, json, xml), basically anything that Spark can ingest. We are however very pro-binary, and you will see that there are some features that are only supported if you use parquet, which is what we recommend for all datasources and datasets.

Note

Currently we define a datasource as any structured dataset with the same schema. This means that if you for example have labels in one file and data in another, you should create two separate datasources.

Note

If your datasource for some reason fails to get classified (no schema, records, classification etc) you can still use the data in your Spark scripts by calling the method read_undiscovered() in your initiated SparkProcessor.

When you create a new Datasource, we will run Glue Crawlers and some extra Lambdas to catalogue your data, calculate statistics, and start versioning it. From this point on, whenever new data arrives it will be automatically processed and available to you in the platform.

Process data

This is where the iterations begin. MLOps is strongly for testing many hypotheses, learning by increments. Currently, we support building datasets in PySpark, and Scikit/Pandas processing soon to come for smaller datasets.
When building datasets, you have the option of joining multiple Datasources or reading directly from S3 resources.

Using the SDK

You can and should utilize the MLOps SDK for reading and writing data, as this assures the data is versioned and partitions etc. are automatically recognized as the dataset grows over time. The SDK also assures that when you write it to disk, it will be ready for ML training at the next stage. The SDK is available by default in all your processing jobs and can be imported with import mlops, or more specifically for PySpark processing, you want to initiate an instance of mlops.spark.SparkPreprocessing().

Building datasets

In machine learning, data is everything. As you know, there are many combinations of features to try with most datasets. It can therefor be beneficial to create datasets using only subsets of the total dataset until your confidence is large enough that the features+algorithm combination is a winner. In the dialog of creating datasets, you can create an arbitrary amount of smaller batches of your dataset by putting in an array like 0.1, 0.3, 0.8, 1.0, which in this case would create randomly sampled datasets of 10 %, 30 %, 80 % and 100 %.

Train models

Training models in MLOps is easy. Start off by choosing a subset from any of your produced datasets (might be good to start with a smaller batch before you know your features are superb), and then continue with selecting whether or not to run a hyperparameter tuning job or just a one-off.
MLOps currently supports TensorFlow training, with Scikit and PyTorch soon to come. As with data preprocessing, you are expected to use the MLOps SDK for reading, passing hyperparameters, using callbacks, predicting and saving models. This way, we keep everything versioned and readily available for you in the console.

Using the SDK

The SDK comes with some neat features that aims to remove all the hustle around making production ready model training. If you are training your model on a dataset that contains all three dataset splits [train, validation test], the your data will be loaded and ready in mlops.{split}_data and mlops.{split}_labels an numpy matrices.
Your hyperparameters will all be available under mlops.hyperparameters['my_parameter'] no matter if you are running hyperparameter tuning or not.
Finally, we expect you to use the mlops.callback during your training in order to log metrics back to the console. This way, it's easy for you to observe and compare your results at scale together with your colleagues.

Deploy

When deploying models you can currently choose from two different options: hosted endpoint or batch inference.

Note

A hosted endpoint is active until you delete it and will cost money during all the time it is active.

Batch inference

When using batch inference, you set a schedule for when you want the pipeline (preprocessing + inference) to occur accourding to Cron standards. At the trigger time, MLOps will look if there is any new data that has not yet been processed, transform it, run inference against the model and then output the data to a designated S3 bucket or DynamoDB table.