Lab 4: Detective and Corrective Controls

In this lab you will test a remediating detective control that was deployed by the cloud platform engineering team in Lab 1. The control is designed to detect the creation of Amazon SageMaker training jobs outside of the secure data science VPC and terminate them. To do this, you will go through the Jupyter notebook kernel stockmarket_predictor_v5 and execute the cells up to and including the creation of a training job. You’ll notice, as the training job is provisioned and begins to execute, that it is terminated by the corrective aspect of the control in the environment.


Start a training job

As the data scientist, read through and execute the cells in the Jupyter Notebook kernel named stockmarket_predictor_v5.ipynb. These various cells will:

  • set variables to reference the data and model Amazon S3 buckets that the data science administrator created for your team
  • copy training data to your local Jupyter notebook
  • push processed training data to your data S3 bucket
  • create a TensorFlow script to train a model
  • configure a training job to be executed by you

Be sure and place the names of your data and model buckets into the TEAM_DATA_BUCKET and TEAM_MODEL_BUCKET variables in the cell titled Pre-requisites. If you don’t know the name of your Amazon S3 buckets, created earlier by the Data Science Administrator, they should follow a pattern of sagemakerworkshop-data-<YOUR-TEAM-NAME> and sagemakerworkshop-model-<YOUR-TEAM-NAME>. To confirm your bucket names visit the S3 console and find your data and model buckets.

When you reach the cell titled Train without a VPC configured, execute the cell and take note of the output. After a few minutes you should notice that the training job was terminated. The output should resemble the below which indicates that the training job did not complete its bootstrap.

Step-by-step instructions

```Access log 2019-10-11 13:50:37 Starting - Starting the training job… 2019-10-11 13:51:01 Starting - Launching requested ML instances… 2019-10-11 13:51:35 Starting - Preparing the instances for training…… 2019-10-11 13:52:29 Downloading - Downloading input data 2019-10-11 13:52:29 Stopping - Stopping the training job 2019-10-11 13:52:29 Stopped - Training job stopped ..Training seconds: 1 Billable seconds: 1 training completed.


### Detective control explained

The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created.  The Lambda function inspected the training job, saw that it was NOT attached to a VPC and stopped the training job from executing.  

Assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). Also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.

---

## Start a compliant training job

To succesfully run your training job you will need to configure the training job to run within your VPC.  To do this you will pass a collection of subnet IDs and security groups to the training job using the SageMaker SDK from your notebook. 

Visit the [Parameter Store](https://console.aws.amazon.com/systems-manager/parameters) and copy the values stored for `PrivateSubnetAId`, `PrivateSubnetBId`, and `SageMakerSecurityGroupId`.  Then, using the [SageMaker SDK documentation](https://sagemaker.readthedocs.io/en/stable/estimators.html), modify the Python code in the cell titled **Training with VPC attachment** to configure the estimator with the subnet and security group values captured from the Parameter Store.  

```python
TensorFlow(entry_point='predictor.py',
            ...,
            train_instance_count=1,
            train_instance_type=instance_type,
            subnets = ['subnet-0fc1ed6b334bd4cfd','subnet-0f398485e991f8333'],
            security_group_ids = ['sg-0da87d40633b8f922'],
            ...
            )

When the code has been modified execute the cell, the training job should complete successfully, producing output similar to the following:

2019-10-16 19:57:54 Starting - Starting the training job...
2019-10-16 19:57:56 Starting - Launching requested ML instances......
2019-10-16 19:58:59 Starting - Preparing the instances for training...
2019-10-16 19:59:46 Downloading - Downloading input data...
2019-10-16 20:00:25 Training - Training image download completed. Training in progress..
2019-10-16 20:00:25,711 INFO - root - running container entrypoint
2019-10-16 20:00:25,711 INFO - root - starting train task
2019-10-16 20:00:25,727 INFO - container_support.training - Training starting
Step-by-step instructions

In this lab, you experienced a remediating detective control deployed by the cloud platform engineering team and reconfigured the SageMaker training job to run connected to your VPC. But waiting minutes to find out that your training job is going to error out is a slow and painful way to iterate during development.

In the next lab we will implement a preventive control to make such an error immediately evident.