# Create Asynchronous SageMaker endpoint to generate synthetic defects with missing components

###### Amazon SageMaker Asynchronous Inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. Asynchronous inference enables users to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

In [1]:
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorchModel
import boto3

### 1 Get the execution role.

In [None]:
role = get_execution_role()

### 2 Set up environment variables 

In [3]:
env = dict()
env['TS_MAX_REQUEST_SIZE'] = '1000000000'
env['TS_MAX_RESPONSE_SIZE'] = '1000000000'
env['TS_DEFAULT_RESPONSE_TIMEOUT'] = '1000000'
env['DEFAULT_WORKERS_PER_MODEL'] = '1'

### 3 Set up pytorch docker image. 
#### First step is to download the big lama model from https://github.com/saic-mdal/lama and save it in S3.

#### Reference deep learning containers

https://github.com/aws/deep-learning-containers

In [4]:
model = PyTorchModel(
 entry_point="./inference_defect_gen.py",
 role=role,
 source_dir = './',
 model_data='s3://qualityinspection/model/big-lama.tar.gz',
 image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker',
 framework_version="1.7.1",
 py_version="py3",
 env = env,
 model_server_workers=1
)

### 4 Specify additional asynchronous inference specific configuration parameters and deploy the endpoint.

In [5]:
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
#option2:async_inference
bucket = 'qualityinspection'
prefix = 'async-endpoint'

async_config = AsyncInferenceConfig(output_path=f"s3://{bucket}/{prefix}/output", 
 max_concurrent_invocations_per_instance=10
 )

predictor = model.deploy(
 initial_instance_count=1,
 instance_type='ml.g4dn.xlarge',
 model_server_workers=1, 
 async_inference_config=async_config
)

-----------!

##### It takes about 6-8 minutes to finish the endpoint deployment.

### 5 Invoke the endpoint with input file in s3

In [7]:
import boto3

In [8]:
runtime= boto3.client('runtime.sagemaker')



In [9]:
response = runtime.invoke_endpoint_async(EndpointName='endpoint name',
 InputLocation='s3://qualityinspection/input/input.txt'
 )

##### The inference takes 3-4 minutes to finish in a ml.g4dn.xlarge instance.

### 6 Clean Up

##### Delete the endpoint if you no longer use it

In [11]:
import boto3
sm_boto3 = boto3.client("sagemaker")
sm_boto3.delete_endpoint(EndpointName='endpoint name')

{'ResponseMetadata': {'RequestId': 'e5f2a83c-85ab-4952-9289-cf6efb13062f',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'x-amzn-requestid': 'e5f2a83c-85ab-4952-9289-cf6efb13062f',
 'content-type': 'application/x-amz-json-1.1',
 'content-length': '0',
 'date': 'Sat, 01 Oct 2022 03:14:38 GMT'},
 'RetryAttempts': 0}}