Since development of preprocessing scripts and models can be tedious and extremely iterative, just to get to code that is running, the MLOps platform comes with an option to run your code locally before submitting it to the cloud. This way, you don't spend time waiting for jobs, that ultimately will fail due to some stupid syntax error.
On your laptop or workstation, all you got to do is run
pip3 install mlops-local
Once that is done, you control the local testing environment with two commands:
In addition to having local mode launched, you also need to create a folder on your computer under
~/mlops where you store the data samples you want to test your scripts on. Currently this is a manual download process, but a Download sample button will soon appear in the UI.
Testing PySpark scripts¶
Alright, so with local mode up and running, we can head over to Datasets in the console and press Create Datasets View, which will take us to the same dataset input parameters as usual, with the exception of the in-browser code editor and log window. At this point, you can start developing your script, assuming you have a data sample stored under for example
~/mlops/datasets/my_dataset.parquet. Typically, you want to start with picking the datasources you want to work with and then press Generate Script Template to get a good skeletton. It should look something like this:
If you observe the above screenshot, you see that there are three input fields at the bottom. These help us with reading the local data while being able to preserve the code, as is, so that it at a later stage can be submitted to the cloud without any modifications.
You must set Local path in the input field relative to
~/mlops/ before running the script locally, for it to understand the path to your dataset.
So once we have written some cool preprocessing transformations and given the local mode a translation to where the sample data is, we can start playing the Run and Stop, until we reach a script we are happy with. If everything goes as it should, logs from your job will appear at each run to your right, like this:
Deploying to the cloud¶
At the point where you have a script you are happy with, just leave it in the code editor, as is. Fill in the number of workers, worker type and the dataset splits and subsets you want and then press Create. If you also want us to run feature analysis on the output DataFrame, checkbox Calculate column metrics.
Calculating column metrics is time consuming and might very well double the total execution time of the job. The metrics you get per feature can be seen below.
Splits tells Spark how we want to divide the data between training, evaluation and testing. You can give maximum 3 comma-separated values, and minimum 1.
Subsets are used to generate randomly sampled datasets of smaller size from the total DataFrame, this can be beneficial both for cost management but also for controlling and inspecting information saturation in you ML algorithms.
The script above ran on the Kickstarter dataset, with Calculate column metrics checked, and here is the result after a run in the cloud with a single worker.