# Step 4: Add a model building CI/CD pipeline

In this step you create an automated CI/CD pipeline for model building using [Amazon SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects.html). 

![](img/six-steps-4.png)

You are going to use a [SageMaker-provided MLOps project template for model building and training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html#sagemaker-projects-templates-code-commit) to provision a CI/CD workflow automation with [AWS CodePipeline](https://aws.amazon.com/codepipeline/) and an [AWS CodeCommit](https://aws.amazon.com/codecommit/) code repository.

SageMaker project templates offer you the following choice of code repositories, workflow automation tools, and pipeline stages:
- **Code repository**: AWS CodeCommit or third-party Git repositories such as GitHub and Bitbucket
- **CI/CD workflow automation**: AWS CodePipeline or Jenkins
- **Pipeline stages**: Model building and training, model deployment, or both

In [None]:
import boto3
import sagemaker 
from time import gmtime, strftime, sleep

In [None]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")

## Create an MLOps project
‚≠ê You can create a project programmatically in this notebook - Option 1 or in Studio - Option 2. Option 1 is recommended as it requires no manual input. Option 2 is given to demonstrate [**Create Project** UX flow](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-create.html).

###¬†Option 1: Create project programmatically
Use `boto3` to create an MLOps project via a SageMaker API.

In [None]:
sm = boto3.client("sagemaker")
sc = boto3.client("servicecatalog")

sc_provider_name = "Amazon SageMaker"
sc_product_name = "MLOps template for model building and training"

In [None]:
p_ids = [p['ProductId'] for p in sc.search_products(
    Filters={
        'FullTextSearch': [sc_product_name]
    },
)['ProductViewSummaries'] if p["Name"]==sc_product_name]

In [None]:
p_ids

In [None]:
# If you get any exception from this code, go to the Option 2 and create a project in Studio UX
if not len(p_ids):
    raise Exception("No Amazon SageMaker ML Ops products found!")
elif len(p_ids) > 1:
    raise Exception("Too many matching Amazon SageMaker ML Ops products found!")
else:
    product_id = p_ids[0]
    print(f"ML Ops product id: {product_id}")

In [None]:
provisioning_artifact_id = sorted(
    [i for i in sc.list_provisioning_artifacts(
        ProductId=product_id
    )['ProvisioningArtifactDetails'] if i['Guidance']=='DEFAULT'],
    key=lambda d: d['Name'], reverse=True)[0]['Id']

In [None]:
provisioning_artifact_id

In [None]:
project_name = f"model-build-{strftime('%m-%d-%H-%M-%S', gmtime())}"
project_parameters = [] # This SageMaker built-in project template doesn't have any parameters

Finally, create a SageMaker project from the service catalog product template:

In [None]:
# create SageMaker project
r = sm.create_project(
    ProjectName=project_name,
    ProjectDescription="Model build project",
    ServiceCatalogProvisioningDetails={
        'ProductId': product_id,
        'ProvisioningArtifactId': provisioning_artifact_id,
    },
)

print(r)
project_id = r["ProjectId"]

<div class="alert alert-info"> üí° <strong> Wait until project creation is completed </strong>
</div>




In [None]:
while sm.describe_project(ProjectName=project_name)['ProjectStatus'] != 'CreateCompleted':
    print("Waiting for project creation completion")
    sleep(10)
    
print(f"MLOps project {project_name} creation completed")



### End of Option 1: Create project programmatically
Now you have instanciated a project template in your SageMaker environment. You can go to the section **Configure the MLOps project**.

---

### Option 2: Create a project in Studio
<div class="alert alert-info"> üí° <strong> Skip this section if you created a project programmatically </strong>

To [create a project](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-create.html) in Studio:

1. In the Studio sidebar, choose the **Home** icon.
2. Select **Deployments** from the menu, and then select **Projects**.
3. Choose **Create project**.
    - The **Create project** tab opens displaying a list of available project templates.
4. For **SageMaker project templates**, choose **SageMaker templates**. 
5. Choose **MLOps template for model building and training**
6. Choose **Select project template**.

![](img/create-mlops-project.png)

![](img/create-mlops-project-2.png)

The **Create project** tab changes to display **Project details**.

![](img/project-details.png)

Enter the following information:
- For **Project details**, enter a name and description for your project. Note the name requirements.
- Optionally, add tags, which are key-value pairs that you can use to track your projects.

Choose **Create project** and wait for the project to appear in the Projects list.

### Resolve issues with project creation

#### Project creation process stuck in pending
If after 5 minutes the project creation banner is still on, close the Studio browser window and sign in Studio again.

![](img/project-creation-pending.png)

#### Error messages
‚ùó If you see an error message similar to:
```
Your project couldn't be created
Studio encountered an error when creating your project. Try recreating the project again.

CodeBuild is not authorized to perform: sts:AssumeRole on arn:aws:iam::XXXX:role/service-role/AmazonSageMakerServiceCatalogProductsCodeBuildRole (Service: AWSCodeBuild; Status Code: 400; Error Code: InvalidInputException; Request ID: 4cf59a54-0c59-476a-a970-0ac656db4402; Proxy: null)
```

see steps 5-6 of [SageMaker Studio Permissions Required to Use Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-studio-updates.html). Make sure you have all required project roles listed in the **Apps** card under **Projects**. 

üí° If you don't have these roles, you must follow [Shut Down and Update SageMaker Studio and Studio Apps](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update.html) instructions to update the domain. You must shutdown both JupyterServer and KernelGateway apps. After you shutdown all apps, go to Amazon SageMaker console, choose **Domains**, click on your domain from the list, and choose **Domain Settings**. Choose **Configure app** on the **Apps** card. Click through all **Next** in configuration panes and choose **Submit**. This will update the domain and create all needed project roles automatically.

Alternatively, you can create the required roles programmaticaly by using the provided CloudFormation template [`cfn-templates/sagemaker-project-templates-roles.yaml`](cfn-templates/sagemaker-project-templates-roles.yaml). 
Run in the repository clone directory from the command line terminal where you have the corresponding permissions:

```sh
aws cloudformation deploy \
    --template-file cfn-templates/sagemaker-project-templates-roles.yaml \
    --stack-name sagemaker-project-template-roles \
    --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
    --parameter-overrides \
    CreateCloudFormationRole=YES \
    CreateCodeBuildRole=YES \
    CreateCodePipelineRole=YES \
    CreateEventsRole=YES \
    CreateProductsExecutionRole=YES 
```

###¬†End of Option 2: Create a project in Studio
Now when you have the project created, move to the section **Configure the MLOps project**.

---

## Configure the MLOps project
The project takes about 3-5 min to be created. The project runs a provided default model building pipeline automatically as soon as it has been created.
The project templates deploys the following architecture in your AWS account:

![](img/mlops-model-build-train.png)

The main components are:
1. The project template is made available through SageMaker Projects and AWS Service Catalog portfolio
2. A CodePipeline pipeline with two stages - `Source` to download the source code from a CodeCommit repository and `Build` to create and execute a SageMaker pipeline
3. A default SageMaker pipeline with model build, train, and register workflow
4. A seed code repository in CodeCommit with a provided default version of the scaffolding code

This project contains all the required code and the insfrastructure to implement an automated CI/CD pipeline. 
To start using the project with your pipeline, you need to complete the following steps:
1. Clone the project CodeCommit repository to your home directory on Studio EFS
2. Replace the ML pipeline implementation sample code with your pipeline construction code, as implemented in the step 3 notebook
3. Modify the `codebuild-buildspec.yml` file to reference the correct Python module name and to set project parameters

Next sections guide you through these steps. For detailed instructions and a hands-on example, refer to the development guide [SageMaker MLOps Project Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough.html).

### 1. Clone the project seed code to the Studio file system
1. Choose **Home** in the Studio sidebar
2. Select **Deployments** and then select **Projects**
3. Click on the name of the project you created to open the project details tab
4. In the project tab, choose **Repositories**
5. In the **Local path** column for the repository choose **clone repo....**
6. In the dialog box that appears choose **Clone Repository**

![](img/select-project.png)

![](img/clone-project-repo.png)

When clone of the repository is complete, the local path appears in the **Local path** column. Choose the path to open the local folder that contains the repository code in Studio.

### 2. Replace pipeline construction code
If you used the option 1 `boto3` to create an MLOps project, the `project_name` and `project_id` are set automatically. You can run the following code cell to print the values. If you followed the UX instructions to create a project, you must set the `project_name` manually.

In [None]:
try:
    print(project_name)
    print(project_id)
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("You must set the project_name manually in the following code cell")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [None]:
# project_name = "<ENTER THE NAME OF THE CREATED PROJECT>"
r = sm.describe_project(ProjectName=project_name)
project_id = r['ProjectId']
project_arn = r['ProjectArn']
project_folder = f"{project_name}-{project_id}/sagemaker-{project_name}-{project_id}-modelbuild"

print(project_folder)

The following steps are required to customize the project. The next code cell executes the required steps, you don't need to do anything manually, the following text for your information only.

- In Studio file browser navigate to the project's code repository folder, which looks like `<project-name>-<project-id>/sagemaker-<project-name>-<project-id>-modelbuild`.
- Rename the file `codebuild-buildspec.yml` to `codebuild-buildspec-original.yml`.
- Navigate to the `pipelines` folder inside the project's code repository folder and rename the `abalone` folder to `fromideatoprod`.
- Rename the `pipeline.py` file in the `fromideatoprod` folder to `pipeline-original.py`.
- Copy the `preprocessing.py` and `evaluation.py` files that you created in the step 2 and 3 notebooks from the `amazon-sagemaker-from-idea-to-production` folder to the `pipelines/fromideatoprod` folder in the project's code repository folder.

In [None]:
#¬†if you local path for the workshop folder is different, set the correct absolute path here
workshop_folder = "amazon-sagemaker-from-idea-to-production"

In [None]:
!mv ~/{project_folder}/codebuild-buildspec.yml ~/{project_folder}/codebuild-buildspec-original.yml
!mv ~/{project_folder}/pipelines/abalone ~/{project_folder}/pipelines/fromideatoprod
!mv ~/{project_folder}/pipelines/fromideatoprod/pipeline.py ~/{project_folder}/pipelines/fromideatoprod/pipeline-original.py
!cp ~/{workshop_folder}/preprocessing.py ~/{project_folder}/pipelines/fromideatoprod/
!cp ~/{workshop_folder}/evaluation.py ~/{project_folder}/pipelines/fromideatoprod/

Execute the following cell to write pipeline construction code to the file `pipeline.py`. Re-use the code from the step 3 notebook as the function `get_pipeline()`.

In [None]:
%%writefile pipeline.py

import pandas as pd
import json
import boto3
import pathlib
import io
import sagemaker

from sagemaker.deserializers import CSVDeserializer
from sagemaker.serializers import CSVSerializer

from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import (
    ProcessingInput, 
    ProcessingOutput, 
    ScriptProcessor
)
from sagemaker.inputs import TrainingInput

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import (
    ProcessingStep, 
    TrainingStep, 
    CreateModelStep,
    CacheConfig
)
from sagemaker.workflow.check_job_config import CheckJobConfig
from sagemaker.workflow.parameters import (
    ParameterInteger, 
    ParameterFloat, 
    ParameterString, 
    ParameterBoolean
)
from sagemaker.workflow.clarify_check_step import (
    ModelBiasCheckConfig, 
    ClarifyCheckStep, 
    ModelExplainabilityCheckConfig
)
from sagemaker import Model
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.conditions import (
    ConditionGreaterThan,
    ConditionGreaterThanOrEqualTo
)
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import (
    Join,
    JsonGet
)
from sagemaker.workflow.lambda_step import (
    LambdaStep,
    LambdaOutput,
    LambdaOutputTypeEnum,
)
from sagemaker.lambda_helper import Lambda

from sagemaker.model_metrics import (
    MetricsSource, 
    ModelMetrics, 
    FileSource
)
from sagemaker.drift_check_baselines import DriftCheckBaselines

from sagemaker.image_uris import retrieve
import os

BASE_DIR = os.path.dirname(os.path.realpath(__file__))

def get_sagemaker_client(region):
     """Gets the sagemaker client.

        Args:
            region: the aws region to start the session
            default_bucket: the bucket to use for storing the artifacts

        Returns:
            `sagemaker.session.Session instance
        """
     boto_session = boto3.Session(region_name=region)
     sagemaker_client = boto_session.client("sagemaker")
     return sagemaker_client


def get_session(region, default_bucket):
    """Gets the sagemaker session based on the region.

    Args:
        region: the aws region to start the session
        default_bucket: the bucket to use for storing the artifacts

    Returns:
        `sagemaker.session.Session instance
    """

    boto_session = boto3.Session(region_name=region)

    sagemaker_client = boto_session.client("sagemaker")
    runtime_client = boto_session.client("sagemaker-runtime")
    return sagemaker.session.Session(
        boto_session=boto_session,
        sagemaker_client=sagemaker_client,
        sagemaker_runtime_client=runtime_client,
        default_bucket=default_bucket,
    )

def get_pipeline_session(region, default_bucket):
    """Gets the pipeline session based on the region.

    Args:
        region: the aws region to start the session
        default_bucket: the bucket to use for storing the artifacts

    Returns:
        PipelineSession instance
    """

    boto_session = boto3.Session(region_name=region)
    sagemaker_client = boto_session.client("sagemaker")

    return PipelineSession(
        boto_session=boto_session,
        sagemaker_client=sagemaker_client,
        default_bucket=default_bucket,
    )

def get_pipeline_custom_tags(new_tags, region, sagemaker_project_arn=None):
    try:
        sm_client = get_sagemaker_client(region)
        response = sm_client.list_tags(
            ResourceArn=sagemaker_project_arn)
        project_tags = response["Tags"]
        for project_tag in project_tags:
            new_tags.append(project_tag)
    except Exception as e:
        print(f"Error getting project tags: {e}")
    return new_tags


def get_pipeline(
    region,
    sagemaker_project_arn=None,
    sagemaker_project_id=None,
    role=None,
    default_bucket=None,
    input_data_url=None,
    bucket_prefix="from-idea-to-prod/xgboost",
    model_package_group_name="from-idea-to-prod-model-group",
    pipeline_name="from-idea-to-prod-pipeline",
    base_job_prefix="from-idea-to-prod-pipeline",
    processing_instance_type="ml.c5.xlarge",
    training_instance_type="ml.m5.xlarge",
    test_score_threshold=0.75,
):
    """Gets a SageMaker ML Pipeline instance.

    Args:
        region: AWS region to create and run the pipeline.
        role: IAM role to create and run steps and pipeline.
        default_bucket: the bucket to use for storing the artifacts

    Returns:
        an instance of a pipeline
    """
    sagemaker_session = get_session(region, default_bucket)
    if role is None:
        role = sagemaker.session.get_execution_role(sagemaker_session)

    session = get_pipeline_session(region, default_bucket)
    sm = session.sagemaker_client

    # Set S3 urls for processed data
    train_s3_url = f"s3://{default_bucket}/{bucket_prefix}/train"
    validation_s3_url = f"s3://{default_bucket}/{bucket_prefix}/validation"
    test_s3_url = f"s3://{default_bucket}/{bucket_prefix}/test"
    baseline_s3_url = f"s3://{default_bucket}/{bucket_prefix}/baseline"
    evaluation_s3_url = f"s3://{default_bucket}/{bucket_prefix}/evaluation"
    prediction_baseline_s3_url = f"s3://{default_bucket}/{bucket_prefix}/prediction_baseline"
    
    # Set S3 url for model artifact
    output_s3_url = f"s3://{default_bucket}/{bucket_prefix}/output"

    # Parameters for pipeline execution
    # Set processing instance type
    process_instance_type_param = ParameterString(
        name="ProcessingInstanceType",
        default_value=processing_instance_type,
    )

    # Set training instance type
    train_instance_type_param = ParameterString(
        name="TrainingInstanceType",
        default_value=training_instance_type,
    )

    # Set training instance count
    train_instance_count_param = ParameterInteger(
        name="TrainingInstanceCount",
        default_value=1
    )

    # Set model approval param
    model_approval_status_param = ParameterString(
        name="ModelApprovalStatus", default_value="PendingManualApproval"
    )

    #¬†Minimal threshold for model performance on the test dataset
    test_score_threshold_param = ParameterFloat(
        name="TestScoreThreshold", 
        default_value=0.75
    )

    # Set S3 url for input dataset
    input_s3_url_param = ParameterString(
        name="InputDataUrl",
        default_value=input_data_url,
    )
    
    # Define step cache config
    cache_config = CacheConfig(
        enable_caching=True,
        expire_after="P30d" # 30-day
    )
    
    # processing step for feature engineering
    sklearn_processor = SKLearnProcessor(
        framework_version="0.23-1",
        role=role,
        instance_type=process_instance_type_param,
        instance_count=1,
        base_job_name=f"{pipeline_name}/preprocess",
        sagemaker_session=session,
    )
    
    processing_inputs=[
        ProcessingInput(source=input_s3_url_param, destination="/opt/ml/processing/input")
    ]

    processing_outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/output/train", 
                         destination=train_s3_url),
        ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation",
                         destination=validation_s3_url),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test",
                         destination=test_s3_url),
        ProcessingOutput(output_name="baseline_data", source="/opt/ml/processing/output/baseline", 
                         destination=baseline_s3_url),
    ]

    processor_args = sklearn_processor.run(
        inputs=processing_inputs,
        outputs=processing_outputs,
        code=os.path.join(BASE_DIR, "preprocessing.py"),
        # arguments = ['arg1', 'arg2'],
    )

    # Define processing step
    step_process = ProcessingStep(
        name=f"{pipeline_name}-preprocess-data",
        step_args=processor_args,
        cache_config = cache_config
    )

    # Training step for generating model artifacts
    xgboost_image_uri = sagemaker.image_uris.retrieve(
        "xgboost",
        region=region, 
        version="1.5-1")

    # Instantiate an XGBoost estimator object
    estimator = sagemaker.estimator.Estimator(
        image_uri=xgboost_image_uri,
        role=role, 
        instance_type=train_instance_type_param,
        instance_count=train_instance_count_param,
        output_path=output_s3_url,
        sagemaker_session=session,
        base_job_name=f"{pipeline_name}/train",
    )

    # Define algorithm hyperparameters
    estimator.set_hyperparameters(
        num_round=150, # the number of rounds to run the training
        max_depth=5, # maximum depth of a tree
        eta=0.5, # step size shrinkage used in updates to prevent overfitting
        alpha=2.5, # L1 regularization term on weights
        objective="binary:logistic",
        eval_metric="auc", # evaluation metrics for validation data
        subsample=0.8, # subsample ratio of the training instance
        colsample_bytree=0.8, # subsample ratio of columns when constructing each tree
        min_child_weight=3, # minimum sum of instance weight (hessian) needed in a child
        early_stopping_rounds=10, # the model trains until the validation score stops improving
        verbosity=1, # verbosity of printing messages
    )

    training_inputs = {
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train_data"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation_data"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    }

    training_args = estimator.fit(training_inputs)

    # Define training step
    step_train = TrainingStep(
        name=f"{pipeline_name}-train",
        step_args=training_args,
        cache_config = cache_config
    )
    
    # Evaluation step
    script_processor = ScriptProcessor(
        image_uri=xgboost_image_uri,
        role=role,
        command=["python3"],
        instance_type=process_instance_type_param,
        instance_count=1,
        base_job_name=f"{pipeline_name}/evaluate",
        sagemaker_session=session,
    )

    eval_inputs=[
        ProcessingInput(source=step_train.properties.ModelArtifacts.S3ModelArtifacts, 
                        destination="/opt/ml/processing/model"),
        ProcessingInput(source=step_process.properties.ProcessingOutputConfig.Outputs["test_data"].S3Output.S3Uri, 
                        destination="/opt/ml/processing/test"),
    ]

    eval_outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation", 
                         destination=evaluation_s3_url),
        ProcessingOutput(output_name="prediction_baseline_data", source="/opt/ml/processing/output/prediction_baseline", 
                         destination=prediction_baseline_s3_url),
    ]

    eval_args = script_processor.run(
        inputs=eval_inputs,
        outputs=eval_outputs,
        code=os.path.join(BASE_DIR, "evaluation.py"),
    )

    evaluation_report = PropertyFile(
        name="ModelEvaluationReport", output_name="evaluation", path="evaluation.json"
    )

    step_eval = ProcessingStep(
        name=f"{pipeline_name}-evaluate-model",
        step_args=eval_args,
        property_files=[evaluation_report],
        cache_config = cache_config
    )
    
    # Define register step
    model = Model(
        image_uri=xgboost_image_uri,        
        model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
        sagemaker_session=session,
        role=role,
    )

    model_metrics = ModelMetrics(
        model_statistics=MetricsSource(
            s3_uri="{}/evaluation.json".format(
                step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
            ),
            content_type="application/json",
        )
    )

    register_args = model.register(
        content_types=["text/csv"],
        response_types=["text/csv"],
        inference_instances=["ml.t2.medium", "ml.m5.xlarge", "ml.m5.large"],
        transform_instances=["ml.m5.xlarge", "ml.m5.large"],
        model_package_group_name=model_package_group_name,
        approval_status=model_approval_status_param,
        model_metrics=model_metrics,
    )

    step_register = ModelStep(
        name=f"{pipeline_name}-register",
        step_args=register_args
    )

    # Fail step
    step_fail = FailStep(
        name=f"{pipeline_name}-fail",
        error_message=Join(on=" ", values=["Execution failed due to AUC Score >", test_score_threshold_param]),
    )
    
    # Condition step
    cond_lte = ConditionGreaterThan(
        left=JsonGet(
            step_name=step_eval.name,
            property_file=evaluation_report,
            json_path="classification_metrics.auc_score.value",
        ),
        right=test_score_threshold_param,
    )

    step_cond = ConditionStep(
        name=f"{pipeline_name}-check-test-score",
        conditions=[cond_lte],
        if_steps=[step_register],
        else_steps=[step_fail],
    )
    
    # Pipeline instance
    pipeline = Pipeline(
        name=pipeline_name,
        parameters=[
            process_instance_type_param,
            train_instance_type_param,
            train_instance_count_param,
            model_approval_status_param,
            test_score_threshold_param,
            input_s3_url_param,
        ],
        steps=[step_process, step_train, step_eval, step_cond],
        sagemaker_session=session,
    )
    
    return pipeline

Copy this `pipeline.py` file from the workshop folder to the `pipelines/fromideatoprod` folder in the project's code repository folder:

In [None]:
!cp ~/{workshop_folder}/pipeline.py ~/{project_folder}/pipelines/fromideatoprod/

#### Attach the model package group to the project
Project-owned resources are automatically tagged with `sagemaker:project-name` and `sagemaker:project-id` tags for cost control, attribute-based security control, and governance. 
Since the model package group already exists in the model registry, you need to tag it to attach to this project. The following code cell calls `AddTags` API to set project tags to the model package group.

In [None]:
model_package_group_arn = sm.describe_model_package_group(ModelPackageGroupName=model_package_group_name).get("ModelPackageGroupArn")

if model_package_group_arn:
    print(f"Adding tags {project_arn.split('/')[-1]} and {project_id} for model package group {model_package_group_arn}")
    r = sm.add_tags(
        ResourceArn=model_package_group_arn,
        Tags=[
            {
                'Key': 'sagemaker:project-name',
                'Value': project_arn.split("/")[-1]
            },
            {
                'Key': 'sagemaker:project-id',
                'Value': project_id
            },
        ]
    )
    print(r)
else:
    print(f"The model package group {model_package_group_name} doesn't exist")
    
sm.list_tags(ResourceArn=model_package_group_arn)["Tags"]

### 3. Modify the build specification file
You must modify the `codebuild-buildspec.yml` file in the project folder to reflect the new name of Python module with your pipeline and set project-specific parameters.

First, print the value of `input_s3_url` variable with the S3 path to the source dataset. You must pass this value to the pipeline:

In [None]:
input_s3_url

Second, replace the value of the `input_data_url` parameter in the following code cell with the printed value of `input_s3_url`. 

First, locate the parameter `kwargs` in the code snippet in the following code cell starting with `%%writefile codebuild-buildspec.yml`:

```
--kwargs "{\"region\":\"${AWS_REGION}\",\"sagemaker_project_arn\":\"${SAGEMAKER_PROJECT_ARN}\",
\"role\":\"${SAGEMAKER_PIPELINE_ROLE_ARN}\",\"default_bucket\":\"${ARTIFACT_BUCKET}\",
\"input_data_url\":\"s3://sagemaker-us-east-1-906545278380/from-idea-to-prod/xgboost/input/bank-additional-full.csv\"}"
```

and replace the value of `input_data_url` at the very end of the string with the value printed by the previous cell.

Finally, execute the cell to create a build spec file.

In [None]:
%%writefile codebuild-buildspec.yml

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.8
    commands:
      - pip install --upgrade --force-reinstall . "awscli>1.20.30"
  
  build:
    commands:
      - export PYTHONUNBUFFERED=TRUE
      - export SAGEMAKER_PROJECT_NAME_ID="${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}"
      - |
        run-pipeline --module-name pipelines.fromideatoprod.pipeline \
          --role-arn $SAGEMAKER_PIPELINE_ROLE_ARN \
          --tags "[{\"Key\":\"sagemaker:project-name\", \"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \
          --kwargs "{\"region\":\"${AWS_REGION}\",\"sagemaker_project_arn\":\"${SAGEMAKER_PROJECT_ARN}\",\"sagemaker_project_id\":\"${SAGEMAKER_PROJECT_ID}\",\"role\":\"${SAGEMAKER_PIPELINE_ROLE_ARN}\",\"default_bucket\":\"${ARTIFACT_BUCKET}\",\"input_data_url\":\"s3://sagemaker-us-east-1-462832133259/from-idea-to-prod/xgboost/input/bank-additional-full.csv\"}"
      - echo "Create/Update of the SageMaker Pipeline and execution completed."

Copy the `codebuild-buildspec.yml` file from the workshop folder to the project's code repository folder:

In [None]:
!cp ~/{workshop_folder}/codebuild-buildspec.yml ~/{project_folder}/codebuild-buildspec.yml

To summurize, you have just done three changes in the build spec file:
1. Modified the `run-pipeline` `--module-name` parameter value from `pipelines.abalone.pipeline` to the new path `pipelines.fromideatoprod.pipeline`
2. Removed some parameters from the `kwargs` list to make use of `get_pipeline()` function default parameter values
3. Added an Amazon S3 url to the source data to the `kwargs` parameter list

###¬†4. Fix the `setup.py` file
Finally, open the `setup.py` file in the project's code repository folder and replace the line `required_packages = ["sagemaker==2.XX.0"]` with `required_packages = ["sagemaker"]`. Save your changes.

Why did you do this change? The pinned sagemaker library version is a bug and is going to be fixed in future releases of the built-in SageMaker project templates. For now you fix this template file manually. Keep in mind, that the built-in project templates are for your convenience only and to demostrate how to use SageMaker project mechanism to package and provision your own custom MLOps projects.

Now you are ready to launch the CI/CD model building pipeline.

---

## Run CI/CD model building pipeline
‚ùó Make sure you are in the local folder that contains the **repository code** in Studio file browser. The folder name looks like `sagemaker-<project-name>-<project-id>-modelbuild`.

Open a Studio system terminal via the Studio menu **File** > **New** > **Terminal** and enter the following commands. Keep `user.email` and `user.name` or replace with your data.
```sh
cd ~/<PROJECT-FOLDER>/<PROJECT-CODE-REPOSITORY-FOLDER>

git config --global user.email "you@example.com"
git config --global user.name "Your Name"
  
git add -A
git commit -am "customize project"
git push
```

You an also work with git command via Git option on the Studio [left sidebar](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-ui.html).

After pushing your code changes, the MLOps project initiates a run of the CodePipeline pipeline that updates and executes the SageMaker model building pipeline. This new pipeline execution creates a new model version in the model package group in the SageMaker model registry.

You can follow up the execution of the pipeline in **Home** > **Pipelines**.

Wait until the pipeline execution finishes. The execution takes about 12 minutes to complete.

## View the details of a new model version
After the pipeline execution finished, a new model version must be registered in the model registry. To see the model version details:

1. In the Studio sidebar, choose the **Home** icon
2. Chose **Models** and then elect **Model registry** from the list
3. Click on the name of the model package group you created in the step 3 notebook (`from-idea-to-prod-model-group`) to open the model group
4. In the list of model versions, double-click on the latest version of the model

![](img/model-package-group.png)

![](img/model-package-group-2.png)

On the model version tab that opens, you can browse activity, [model version details](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-details.html), and [data lineage](https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html). 

![](img/model-version-details.png)

In a real-world project you add various model attributes and additional model version metadata such as [model quality metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html), [explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-explainability.html) and [bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) reports, load test data, and [inference recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html).

##¬†Summary
In this notebook you implement a CI/CD pipeline with the following features:
- Model building ML pipeline is under the source control in a CodeCommit repository
- Every push into the code commit repository launches a new CodeBuild build which upserts and executes the ML pipeline
- SageMaker project is a logical construct in Studio which has the metadata about related ML pipelines, repositories, models, experiments, and inference endpoints

## Continue with the step 5
open the step 5 [notebook](05-deploy.ipynb).

## Further development ideas for your real-world projects
- You can use a SageMaker-provided [MLOps template for model building, training, and deployment with third-party Git repositories using Jenkins](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html#sagemaker-projects-templates-git-jenkins)
- Create a [custom SageMaker project template](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-custom.html) to cover your specific project requirements

## Additional resources
- [Amazon SageMaker Pipelines lab in SageMaker Immersion Day](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab6)
- [Enhance your machine learning development by using a modular architecture with Amazon SageMaker projects](https://aws.amazon.com/blogs/machine-learning/enhance-your-machine-learning-development-by-using-a-modular-architecture-with-amazon-sagemaker-projects/)
- [Dive deep into automating MLOps](https://www.youtube.com/watch?v=3_cHnk9VSfQ)
- [SageMaker MLOps Project Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough.html)
- [`aws-samples` GitHub repository with custom project templates examples](https://github.com/aws-samples/sagemaker-custom-project-templates)

# Shutdown kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>