Lab 1: Best Practice as Code

As the cloud platform engineering team, begin by deploying a shared services VPC which will host a PyPI mirror of approved Python packages, for consumption by data science project teams. Next create a Service Catalog Portfolio which the project administrators can use to easily deploy data science environments in support of new projects.

This lab assumes other recommended security practices such as enabling AWS CloudTrail and capturing VPC Flow Logs. The contents of this lab focus soley on controls and guard rails directly related to data science resources.

Shared Services architecture

In this section you will quickly get started by deploying a shared PyPI mirror for use by data science project teams. In addition to deploying a shared service this template will also create an IAM role for use by Service Catalog and for use by project administrators who are responsible for creating data science project environments.

The shared PyPI mirror will be hosted in a shared services VPC and exposed to project environments using a PrivateLink-powered endpoint. The mirror will host approved Python packages that were retrieved from public package repositories and can be used by all internal Python applications, such as machine learning code running on SageMaker.

The architecture will look like this:

Shared Services Architecture

Deploy your shared service

As a cloud platform engineering team member, deploy the CloudFormation template linked below to provision a shared service VPC and IAM roles.

Region Launch Template
Oregon (us-west-2) Deploy to AWS Oregon
Ohio (us-east-2) Deploy to AWS Ohio
N. Virginia (us-east-1) Deploy to AWS N. Virginia
Ireland (eu-west-1) Deploy to AWS Ireland
London (eu-west-2) Deploy to AWS London
Sydney (ap-southeast-2) Deploy to AWS Sydney

Deployment should take around 5 minutes.

Step-by-Step instructions

Create Project Portfolio

  1. Access the Service Catalog console
  2. Click Portfolios on the left
  3. Click Create portfolio
  4. Enter a Portfolio name of Data Science Project Portfolio
  5. Enter a Owner of Cloud Operations Team
  6. Click the link for your new portfolio to view the portfolio’s details
  7. Click the Groups, roles, and users tab
  8. Click Add groups, roles, users
  9. Click Roles and type DataScienceAdmin into the search field
  10. Tick the box next to your DataScienceAdministrator role and click Add access
  11. Click Products
  12. Click Upload new product
  13. Enter a Product name of Data Science Environment
  14. For Owner enter Cloud Operations Team
  15. Click Use a CloudFormation template
  16. For the CloudFormation template URL enter the appropriate URL from the list below:
    • Region ap-southeast-2,
    • Region eu-west-1,
    • Region eu-west-2,
    • Region us-east-1,
    • Region us-east-2,
    • Region us-west-2,
  17. For Version title enter v1
  18. Click Review and Create product
  19. Click the radio button next to the new product and from the Actions drop down select Add product to portfolio
  20. Click the radio button for your product portfolio and click Add Product to Portfolio
  21. Return to the list of Portfolios by clicking Portfolios on the left
  22. Click the link for the data science portfolio
  23. Click the Constraints tab in the portfolio detail page
  24. Click Create constraint
  25. From the Product drop down select your product
  26. Select Launch for the Constraint type
  27. Under Launch Constraint click Select IAM role
  28. From the IAM role drop down select ServiceCatalogLaunchRole
  29. Click Create

Review team resources

You have now created the following AWS resources to support the project administration team. Please take a moment and review these resources and their configuration.

  • Amazon S3 buckets for training data and trained models

    Visit the S3 console and see the Amazon S3 buckets that have been created for the team. Take note of the bucket policy that has been applied to the data bucket.

  • AWS KMS key for encrypting data at rest

    A KMS key to encrypt data at rest in the data science environment. Visit the console, who is allowed to take what actions on the keys created?

  • Parameters added to Parameter Store

    A parameter has been added to the collection in Parameter Store. Can you see what parameter has been added? How would you use this value?

  • Service Catalog Jupyter Notebook product

    A Service Catalog Portfolio containing a best practice Jupyter notebook product has been configured to give the data science team members the ability to create resources on demand.

  • Shared Services VPC

    The template has created a VPC that will house our shared applications. Visit the console and see what services are accessible from within the VPC?

  • PyPI Mirror Service

    A service has been created in Shared Services VPC that hosts a PyPI mirror server. This service is running on a cluster managed by Amazon Elastic Container Service (ECS). The actual server is running as a serverless container task on AWS Fargate. Visit the ECS console to check whether the service is up and running. You can also see the task logs from the container through the ECS console to check its status.

With the resources created let’s move on to Lab 2 where we will, as a project administrator, deploy a secure data science environment for a new project team.