# Kor LLM 모델 서빙

### 참조: 
- Model 정보
 - beomi/KoAlpaca-Polyglot-12.8B
 - This model is a fine-tuned version of EleutherAI/polyglot-ko-12.8b on a KoAlpaca Dataset v1.1b
 - https://huggingface.co/beomi/KoAlpaca-Polyglot-12.8B
 - EleutherAI/polyglot-ko-12.8b
 - Polyglot-Ko-12.8B was trained for 167 billion tokens over 301,000 steps on 256 A100 GPUs with the GPT-NeoX framework. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token.
 - License: Apache 2.0
 - https://huggingface.co/EleutherAI/polyglot-ko-12.8b
 
- 블로그
 - https://aws.amazon.com/ko/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/
- 코드
 - Boto3
 - https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb
 - Python SDK
 - https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/deepspeed/GPT-J-6B_DJLServing_with_PySDK.ipynb

# 1. 기본 환경 설정

In [2]:
%load_ext autoreload
%autoreload 2

# src 폴더 경로 설정
import sys
sys.path.append('../common_code')

# 2. SageMaker endpoint 의 추론 도커 이미지 인 DLC image URL 가져오기
- We get DLC image URL for djl-deepspeed 0.21.0 and set SageMaker settings

In [3]:
import sagemaker, boto3
from sagemaker import image_uris


role = sagemaker.get_execution_role() # execution role for the endpoint
session = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
region = session._region_name
bucket = session.default_bucket() # bucket to house artifacts

img_uri = image_uris.retrieve(framework="djl-deepspeed", region=region, version="0.21.0")
img_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117'

# 3. Set configuration

## 테스트 모델 지정

In [4]:
serve_model = 'KoAlpaca-12-8B'
# serve_model = 'Polyglot-Kor-5-8B'
# serve_model = 'Kullm-polyglot-12-8b-v2'

In [5]:
model_artifact_name = f'{serve_model}.tar.gz'

instance_type = "ml.g5.12xlarge"
# instance_type = "ml.g5.48xlarge"

print("instance_type :", instance_type) 


instance_type : ml.g5.12xlarge


In [6]:
s3_location = f"s3://{bucket}/{serve_model}/"
s3_location

's3://sagemaker-us-east-1-057716757052/KoAlpaca-12-8B/'

# 4. 모델 추론 코드 및 모델 설정 파일을 패키징
- `model.py` and `serving.properties`
- The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. 

In [7]:
%%sh -s {serve_model} {model_artifact_name}
serve_model=$1
model_artifact_name=$2
echo $serve_model
echo $model_artifact_name

rm -rf $serve_model/.ipynb_checkpoints

tar -czvf $model_artifact_name $serve_model/



KoAlpaca-12-8B
KoAlpaca-12-8B.tar.gz
KoAlpaca-12-8B/
KoAlpaca-12-8B/model.py
KoAlpaca-12-8B/serving.properties


## mode.tar.gz 를 S3 업로드

In [8]:
model_tar_url = sagemaker.s3.S3Uploader.upload(model_artifact_name, s3_location)

# 5. SageMaker endpoint 생성

- Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. 

In [9]:
from datetime import datetime

sm_client = boto3.client("sagemaker")

time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_name = f"{serve_model}-" + time_stamp

create_model_response = sm_client.create_model(
 ModelName=model_name,
 ExecutionRoleArn=role,
 PrimaryContainer={"Image": img_uri, "ModelDataUrl": model_tar_url},
)

Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. 

In [10]:
initial_instance_count = 1
variant_name = "AllTraffic"
endpoint_config_name = f"{serve_model}-config-" + time_stamp

production_variants = [
 {
 "VariantName": variant_name,
 "ModelName": model_name,
 "InitialInstanceCount": initial_instance_count,
 "InstanceType": instance_type,
 "ModelDataDownloadTimeoutInSeconds": 300,
 "ContainerStartupHealthCheckTimeoutInSeconds": 300,
 }
]

endpoint_config = {
 "EndpointConfigName": endpoint_config_name,
 "ProductionVariants": production_variants,
}

ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)

We are ready to create an endpoint using the model and the endpoint configuration created from above steps. 

In [11]:
endpoint_name = f"{serve_model}-" + time_stamp
ep_res = sm_client.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

In [12]:
print("endpoint_name: ", endpoint_name)

endpoint_name: KoAlpaca-12-8B-2023-06-04-11-40-34


In [13]:
%%time 

from inference_lib import descirbe_endpoint
descirbe_endpoint(endpoint_name) 

Endpoint is Creating
Endpoint is Creating
Endpoint is Creating
Endpoint is Creating
Endpoint is Creating
Endpoint is Creating
Endpoint is Creating
Endpoint is InService
CPU times: user 977 ms, sys: 55.6 ms, total: 1.03 s
Wall time: 7min 2s


# 6. 엔드포인트 추론 

In [20]:
from inference_lib import invoke_inference_DJ

### options for generation
* **temperature**: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
* **max_new_tokens**: The maximum number of tokens to generate. Default value is 20, max value is 512.
* **repetition_penalty**: Controls the likelihood of repetition, defaults to null.
* **seed**: The seed to use for random generation, default is null.
* **stop**: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
* **top_k**: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is null, which disables top-k-filtering.
* **top_p**: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to null
* **do_sample**: Whether or not to use sampling ; use greedy decoding otherwise. Default value is false.
* **best_of**: Generate best_of sequences and return the one if the highest token logprobs, default to null.
* **details**: Whether or not to return details about the generation. Default value is false.
* **return_full_text**: Whether or not to return the full text or only the generated part. Default value is false.
* **truncate**: Whether or not to truncate the input to the maximum length of the model. Default value is true.
* **typical_p**: The typical probability of a token. Default value is null.
* **watermark**: The watermark to use for the generation. Default value is false.

In [21]:
params = {
 "do_sample":False, 
 "max_new_tokens":128,
 "temperature":1.0,
 "top_k":0,
 "top_p":0.9,
 "return_full_text":False,
 "repetition_penalty":1.1,
 "presence_penalty":None,
 "eos_token_id":2,
}

## (1) 맥락 (Context) 없이 질문

In [22]:
q = "홈플러스 중계점은 몇시까지 장사해?"
c = ""#"홈플러스 영업시간은 오전 10시 부터 오후 12시까지 입니다."
prompt_wo_c = f"### 질문: {q}\n\n### 맥락: {c}\n\n### 답변:" if c else f"### 질문: {q}\n\n### 답변:" 
data = {
 "prompt": [prompt_wo_c,],
 "params": params
}
print("prompt_wo_c: \n", data)

prompt_wo_c: 
 {'prompt': ['### 질문: 홈플러스 중계점은 몇시까지 장사해?\n\n### 답변:'], 'params': {'do_sample': False, 'max_new_tokens': 128, 'temperature': 1.0, 'top_k': 0, 'top_p': 0.9, 'return_full_text': False, 'repetition_penalty': 1.1, 'presence_penalty': None, 'eos_token_id': 2}}


In [23]:
%%time 
invoke_inference_DJ(endpoint_name, data)

[
 [
 {
 "generated_text":"홈플러스의 매장 운영 시간은 점포마다 다릅니다. 대부분의 매장은 10시에 문을 닫으며, 일부 매장은 11시까지 영업합니다. 예를 들어, 홈플러스 동대문점은 11시까지 영업하며, 홈플러스 영등포점과 강서점은 9시에 문을 닫습니다. 또한, 홈플러스 김해점과 밀양점은 8시에 문을 닫고 있습니다. "
 }
 ]
]
CPU times: user 37.2 ms, sys: 366 µs, total: 37.6 ms
Wall time: 4.89 s


'[\n [\n {\n "generated_text":"홈플러스의 매장 운영 시간은 점포마다 다릅니다. 대부분의 매장은 10시에 문을 닫으며, 일부 매장은 11시까지 영업합니다. 예를 들어, 홈플러스 동대문점은 11시까지 영업하며, 홈플러스 영등포점과 강서점은 9시에 문을 닫습니다. 또한, 홈플러스 김해점과 밀양점은 8시에 문을 닫고 있습니다. "\n }\n ]\n]'

## (2) 맥락 (Context) 가지고 질문

In [24]:
q = "홈플러스 중계점은 몇시까지 장사해?"
c = "홈플러스 영업시간은 오전 10시 부터 오후 10시까지 입니다. 홈플러스 매장 찾기(영업시간 확인)는 이 주소를 이용하세요: http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 "
prompt_w_c = f"### 질문: {q}\n\n### 맥락: {c}\n\n### 답변:" if c else f"### 질문: {q}\n\n### 답변:" 
data = {
 "prompt": [prompt_w_c,],
 "params": params
}
print("prompt_w_c:\n", prompt_w_c)

prompt_w_c:
 ### 질문: 홈플러스 중계점은 몇시까지 장사해?

### 맥락: 홈플러스 영업시간은 오전 10시 부터 오후 10시까지 입니다. 홈플러스 매장 찾기(영업시간 확인)는 이 주소를 이용하세요: http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%B5%B5%B4%F6 

### 답변:


In [25]:
%%time 
invoke_inference_DJ(endpoint_name, data)

[
 [
 {
 "generated_text":"홈플러스의 영업 시간은 오전 10시부터 오후 10시까지입니다. 각 지점마다 약간씩 차이가 있을 수 있으므로, 방문 전에 영업 시간을 꼭 확인해보시기 바랍니다. 아래 링크에서 홈플러스 매장 찾기를 통해 영업 시간을 확인하실 수 있습니다: http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%"
 }
 ]
]
CPU times: user 37.4 ms, sys: 0 ns, total: 37.4 ms
Wall time: 7.58 s


'[\n [\n {\n "generated_text":"홈플러스의 영업 시간은 오전 10시부터 오후 10시까지입니다. 각 지점마다 약간씩 차이가 있을 수 있으므로, 방문 전에 영업 시간을 꼭 확인해보시기 바랍니다. 아래 링크에서 홈플러스 매장 찾기를 통해 영업 시간을 확인하실 수 있습니다: http://corporate.homeplus.co.kr/Store.aspx?isA=%C1%F6%BF%B4%C7%B0%BF%AE%C0%C7%C1%F2%"\n }\n ]\n]'

# 7. [중요] 클린업 엔트포인트 

In [107]:
# sm_client.delete_endpoint(EndpointName=endpoint_name)
# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '995450b5-7fa3-4810-be70-967959b79ebd',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'x-amzn-requestid': '995450b5-7fa3-4810-be70-967959b79ebd',
 'content-type': 'application/x-amz-json-1.1',
 'content-length': '0',
 'date': 'Thu, 01 Jun 2023 14:03:56 GMT'},
 'RetryAttempts': 0}}