Fire Integration with SageMaker

Fire is fully integrated with AWS SageMaker. Fire provides a number of processors for doing model building with SageMaker.

You can do Data Preparation and Feature Engineering with Sparkflows doing compute with Apache Spark. Sparkflows then seamlessly enables you to do your model training and deployment with SageMaker.

The above forms a very powerful combinations for end to end Machine Learning.

Spark Sagemaker Examples

There are a number of SageMaker-Spark examples by AWS here :

Fire SageMaker Processors

SageMaker Processors include:

  • KMeansSageMakerEstimator
  • XGBoostSageMakerEstimator
  • LDASageMakerEstimator
  • LinearLearnerBinaryClassifier
  • LinearLearnerRegressor
  • PCASageMakerEstimator
  • SaveSageMaker

AWS Provided Policies

AWS provides managed policies for SageMaker. Example : AmazonSageMakerFullAccess

Launching EMR

When launching the EMR Cluster make sure that the Role (eg: EMR_EC2_DefaultRole) used has the AmazonSageMakerFullAccess policy.

Now that the Roles and Policies are in place, start up your EMR cluser with the EMR_DefaultRole and EMR_EC2_DefaultRole Roles.

Create New Role

Create a new Role called aws-sagmaker-full-access with the below Policy. It would be used in the Apache Spark job when accessing SageMaker.

  • AmazonSageMakerFullAccess

Use ARN of the new Role in the Workflow

We now use the ARN of the new Role when we use the SageMaker KMeans Estimator Node in the Workflow.

arn:aws:iam::account_id:role/aws-sagemaker-full-access

AWS Instance Types

AWS has various instance types:

  • p : GPU Instances
  • c : Compute Instances
  • r : Memory Optimized Instances
  • m : General Instances

Amazon SageMaker Instance Types details are here : https://aws.amazon.com/sagemaker/pricing/instance-types/

Dataset Column Names for Training with Sagemaker

Sagemaker needs the following columns to exist in the Dataset.

  • label : label column
  • features : features column, this column can also be set

Flow with Sparkflows and AWS

  • We do the Data Preparation and Feature Generation in EMR with Sparkflows.
  • When Sparkflows invokes the SageMakerEstimator, it calls SageMaker for Training and Deployment.
  • Once the model is deployed on SageMaker, the endpoint can be used for realtime predictions.

XGBoost Sagemaker Workflow

Below is a workflow which:

  • Reads in a libsvm file as input
  • Performs XGBoost Modeling
  • Reads in another libsvm file
  • Performs predictions with the model built in the previous step
  • Prints out the result
SageMaker XGBoost

XGBoost Configuration

Below are the configuration setup details of the XGBoost Processor.

SageMaker XGBoost Configuration

Executing the Workflow

Below are the results of executing the workflow.

SageMaker XGBoost Execution