Lab 4: Detective and Corrective Controls

In this lab you will test a remediating detective control that was deployed by the cloud platform engineering team in Lab 1. The control is designed to detect the creation of Amazon SageMaker training jobs outside of the secure data science VPC and terminate them. To do this, you will go through the Jupyter notebook kernel 00_SageMaker-SysOps-Workflow and execute the cells up to and including the creation of a training job. You’ll notice, as the training job is provisioned and begins to execute, that it is terminated by the corrective aspect of the control in the environment.


Start a training job

As the data scientist, read through and execute the cells in the Jupyter Notebook kernel named 00_SageMaker-SysOps-Workflow.ipynb. These various cells will:

  • set variables to reference the data and model Amazon S3 buckets that the data science administrator created for your team
  • copy training data to your local Jupyter notebook
  • preprocess the data locally and using SageMaker Processing and
  • push processed training data to your data S3 bucket
  • deploy your training container to ECR
  • configure a training job to be executed by you
  • Use SageMaker Experiments to track metadata of your training jobs
  • Use shap library to explain feature importances from your trained model

When you reach the cell titled Train without a VPC configured, execute the cell and take note of the output. After a few minutes you should notice that the training job was terminated. The output should resemble the below which indicates that the training job did not complete its bootstrap.

Step-by-step instructions
2019-10-11 13:50:37 Starting - Starting the training job...
2019-10-11 13:51:01 Starting - Launching requested ML instances...
2019-10-11 13:51:35 Starting - Preparing the instances for training......
2019-10-11 13:52:29 Downloading - Downloading input data
2019-10-11 13:52:29 Stopping - Stopping the training job
2019-10-11 13:52:29 Stopped - Training job stopped
..Training seconds: 1
Billable seconds: 1
training completed.

Detective control explained

The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. The Lambda function inspected the training job, saw that it was NOT attached to a VPC and stopped the training job from executing.

Assume the role of the Data Science Administrator and review the code of the AWS Lambda function SagemakerTrainingJobVPCEnforcer. Also review the CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule and take note of the event which triggers execution of the Lambda function.


Start a compliant training job

To succesfully run your training job you will need to configure the training job to run within your VPC. To do this you will pass a collection of subnet IDs and security groups that we imported earlier.

The following sample code shows how these can be specified:

TensorFlow(entry_point='predictor.py',
            ...,
            train_instance_count=1,
            train_instance_type=instance_type,
            subnets = ['subnet-0fc1ed6b334bd4cfd','subnet-0f398485e991f8333'],
            security_group_ids = ['sg-0da87d40633b8f922'],
            ...
            )

When the code has been modified execute the cell, the training job should complete successfully, producing output similar to the following:

2019-10-16 19:57:54 Starting - Starting the training job...
2019-10-16 19:57:56 Starting - Launching requested ML instances......
2019-10-16 19:58:59 Starting - Preparing the instances for training...
2019-10-16 19:59:46 Downloading - Downloading input data...
2019-10-16 20:00:25 Training - Training image download completed. Training in progress..
2019-10-16 20:00:25,711 INFO - root - running container entrypoint
2019-10-16 20:00:25,711 INFO - root - starting train task
2019-10-16 20:00:25,727 INFO - container_support.training - Training starting

In this lab, you experienced a remediating detective control deployed by the cloud platform engineering team and reconfigured the SageMaker training job to run connected to your VPC. But waiting minutes to find out that your training job is going to error out is a slow and painful way to iterate during development.

Finish executing the rest of the cells in the notebook.

Once you are done with this notebook, run through the second notebook 02_SageMaker-DevOps-Workflow. In this notebook, we will explore a different form of detective control: Model Monitoring.

Model Monitoring is key to ensuring that our production endpoints are monitored against data drift or concept drift where the data or label distribution changes over time, degrading model performance. Run through the cells in the notebook to set up a production endpoint and deploy ModelMonitoring against this endpoint.

Once you are done with this, be sure to delete your endpoints and ModelMonitoring to avoid racking up any costs.

You may also proceed to the optional Lab on Preventative Controls.