Amazon SageMaker is a powerful enabler and a key component of a data science environment, but it’s only part of what is required to build a complete secure data science environment. For more robust security you will need other AWS services such as Amazon CloudWatch, Amazon S3, and AWS VPC. To begin, let’s talk about some of the key things you will want in place, working in concert, with Amazon SageMaker.
Let’s begin with your Virtual Private Cloud (VPC) which will be used to host Amazon SageMaker and other components of your data science environment. Your VPC provides a familiar set of network-level controls to allow you to govern ingress and egress of data. We will begin this workshop by creating a VPC with no Internet Gateway (IGW), therefore all subnets will be private, without Internet connectivity. Network connectivity with AWS services or your own shared services will be provided using VPC endpoints and PrivateLink. Security Groups will be used to control traffic between different resources, allowing you to group like resources together and manage their ingress and egress traffic.
In these labs you will create a VPC that:
AWS Identity and Access Management (IAM) can help you create preventive controls for many aspects of your data science enviroment. They can control access to your data in Amazon S3, control who can access SageMaker resources like Notebook servers, and even be applied as VPC endpoint policies to put explicit controls around the API endpoints you create in your data science environment.
There are several IAM roles you will need in order to manage permissions and ensure separation of concerns at scale. Those roles are:
Data scientist user role
Granting Console access, start/stop Jupyter notebook, open Jupyter notebook
Notebook creation role
Used by CI/CD pipeline or Service Catalog to create a Jupyter Notebook
Notebook execution role
Used by a Jupyter Notebook to access AWS resources
Training / Transform job execution role
Used by Training Job or Batch Transform job to access AWS resources like Amazon S3
Endpoint creation role
Used by your CI/CD pipeline to create hosted ML model endpoints
Endpoint hosting role
Used by a hosted ML model to access AWS resources such as Amazon S3
Endpoint invocation role
Used by an application to call a hosted ML model endpoint
There are also many IAM conditions you can apply in your policies to begin to grant powerful permissions but only under certain conditions. You will learn more about these in the labs.
In a data science environment there is the highly sensitive data you are using to train your ML models, but there is also the sensitive intellectual property you are developing in the form of algorithms, libraries, and trained models. There are many ways to protect data such as the preventive controls described above, defined as IAM policies. In addition you have the ability to encrypt data at rest using managed encryption keys.
Many AWS services, including Amazon S3 and Amazon SageMaker, are integrated with AWS Key Management Service (KMS) to make it very easy to encrypt your data at rest. You can take advantage of these integrations to ensure that your data is encrypted in the data lake AND in the data science environment, end to end. This encryption also applies to your intellectual property as it is being developed in the many places it may be stored such as Amazon S3, EC2 EBS volumes, or AWS CodeCommit git repository.
Using cloud services in a safe and responsible manner is good, but being able to demonstrate to others that you are operating in a governed manner is even better. Developers and security officers alike will need to see activity logs for models being trained and persons interacting with the systems. Amazon CloudWatch Logs and CloudTrail are there to help, receiving logs from many different parts of your data science environment to include:
Let’s now dive into implementing these key areas. You will start by creating a secure, private, network environment.