# Deploy a machine learning pipeline with AWS Step Functions Data Science SDK ## CDK development environment setting The `cdk.json` file tells the CDK Toolkit how to execute your app. This project is set up like a standard Python project. The initialization process also creates a virtualenv within this project, stored under the `.venv` directory. To create the virtualenv it assumes that there is a `python3` (or `python` for Windows) executable in your path with access to the `venv` package. If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv manually. To manually create a virtualenv on MacOS and Linux: ``` $ python3 -m venv .venv ``` After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv. ``` $ source .venv/bin/activate ``` If you are a Windows platform, you would activate the virtualenv like this: ``` % .venv\Scripts\activate.bat ``` Once the virtualenv is activated, you can install the required dependencies. ``` $ pip install -r requirements.txt ``` ## Deploy a Step Function Pipeline At this point you can now synthesize the CloudFormation template for this code. ``` $ cdk synth ``` To add additional dependencies, for example other CDK libraries, just add them to your `setup.py` file and rerun the `pip install -r requirements.txt` command. We define 2 parameters in the CDK code, and deploy CDK application with the following code. - `bucket_name` is the existing S3 Bucket in your account and region should be the same region as your CDK environment. - `prefix` is the existing directory in `bucket_name` ``` cdk deploy --StepFunctionsDataScienceStack --parameters BucketName={bucket_name} --parameters Prefix={prefix} ``` We can see the arn of Step Functions we create in the outputs followed by the above command. ## Data preparation Once you succeed to deploy Step Functions pipe, upload the sample data to the S3 Bucket (`bucket_name` and `prefix` are same as we used in `cdk deploy`). ```bash aws s3 cp {project_root}/data/churn_processed.csv s3://{bucket_name}/{prefix}/input/churn_processed.csv ``` ## Run Step Functions Pipeline In Step Functions console, we select the Step Functions we created and create a execution with the input parameters as below. Note that we need to use different `RunJobName` each time when executing the Step Function Pipelines. ```json { "TrainInstanceType": "ml.m5.xlarge", "RunJobName": "stepfunctionsTraining148", "hyperparameters": { "max_depth": "5", "eta": "0.2", "gamma": "4", "min_child_weight": "6", "subsample": "0.8", "silent": "0", "objective": "binary:logistic", "num_round": "100", "eval_metric": "auc" } } ``` ## Check the real-time inference endpoint Once the pipeline is finished, you can check the real-time inference endpoint created by the step function pipeline in SageMaker console. --- ## Useful commands * `cdk ls` list all stacks in the app * `cdk synth` emits the synthesized CloudFormation template * `cdk deploy` deploy this stack to your default AWS account/region * `cdk diff` compare deployed stack with current state * `cdk docs` open CDK documentation