{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [모듈 00] 데이터 변환 및 수집 워크플로를 위한 설정 환경\n", "- 이 노트북은 데이터 변환 및 수집 워크플로를 프로비저닝하는 사용자 지정 SageMaker 프로젝트에 필요한 리소스와 매개변수를 설정합니다.\n", "- 특정 사용 사례 및 요구 사항에 따라 고유한 사용자 지정 프로젝트의 경우 프로젝트 프로비저닝의 일부로 이러한 모든 리소스를 생성하는 것을 고려할 수 있습니다.\n", "\n", "노트북은 필수 리소스를 생성하기 위한 다음 활동을 안내합니다.\n", "\n", "1. 아키텍쳐 개요\n", "2. 기본 변수 설정\n", " - 데이터 업로드를 위한 Amazon S3 버킷 가져오기\n", " - 데이타 랭글러 흐름 이름 설정 \n", "3. 데이터 세트 준비\n", " - 데이터 세트 다운로드 및 데이터 탐색\n", "4. 데이터 랭글러 흐름 (Flow) 생성\n", " - 데이터 변환 및 피쳐 수집을 위한 Amazon Data Wrangler 흐름 생성\n", "5. 피쳐 스토어\n", " - 피쳐가 저장되는 피쳐 저장소에 새 피쳐 그룹 생성\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. 아키텍쳐 개요\n", "\n", " $\"solution$ \n", "\n", "1. Amazon S3 버킷에 업로드된 데이터 파일\n", "2. 데이터 처리 및 변환 프로세스 시작\n", "3. 추출, 처리 및 변환된 기능은 Feature Store에서 지정된 기능 그룹으로 수집됩니다.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. 기본 변수 설정 " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.70.0\n" ] } ], "source": [ "import os\n", "import json\n", "import boto3\n", "import pandas as pd\n", "import sagemaker\n", "from sagemaker.session import Session\n", "from sagemaker.feature_store.feature_definition import FeatureDefinition\n", "from sagemaker.feature_store.feature_definition import FeatureTypeEnum\n", "from sagemaker.feature_store.feature_group import FeatureGroup\n", "\n", "import time\n", "from time import gmtime, strftime\n", "import uuid\n", "\n", "\n", "print(sagemaker.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2.1. `domain_id` and `execution_role` 얻기\n", "- 세이지 메이커 스튜디오의 domain_id 및 스튜디오 실행을 하는 execution_role 가져옵니다." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SageMaker domain id: d-uf1m5smfybav\n", "Stored 'domain_id' (str)\n" ] } ], "source": [ "NOTEBOOK_METADATA_FILE = \"/opt/ml/metadata/resource-metadata.json\"\n", "domain_id = None\n", "\n", "if os.path.exists(NOTEBOOK_METADATA_FILE):\n", " with open(NOTEBOOK_METADATA_FILE, \"rb\") as f:\n", " domain_id = json.loads(f.read()).get('DomainId')\n", " print(f\"SageMaker domain id: {domain_id}\")\n", "\n", "%store domain_id" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stored 'execution_role' (str)\n" ] } ], "source": [ "r = boto3.client(\"sagemaker\").describe_domain(DomainId=domain_id)\n", "execution_role = r[\"DefaultUserSettings\"][\"ExecutionRole\"]\n", "\n", "%store execution_role" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2. 데이터용 S3 버킷 가져오고 변수 설정\n", "- 모든 솔루션 아티팩트 및 데이터를 저장하기 위해 SageMaker 기본 버킷을 사용합니다. 고유한 버킷을 생성하거나 사용하도록 선택할 수 있습니다. \n", "- SageMaker 실행 역할 및 `AmazonSageMakerServiceCatalogProductsUseRole` 역할에 연결된 해당 권한이 있는지 확인하여 객체를 나열하고 읽고 버킷에 넣을 수 있습니다." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data_bucket = None # you can use your own S3 bucket name\n", "sagemaker_session = Session()\n", "\n", "if data_bucket is None:\n", " data_bucket = sagemaker_session.default_bucket()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sagemaker-us-east-1-569441333767\n" ] } ], "source": [ "print(data_bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⭐ 다음 변수의 기본값을 유지하거나 원하는 경우 변경할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# set some literals\n", "s3_data_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/dataset/\"\n", "s3_flow_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/dw-flow/\"\n", "s3_fs_query_output_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/fs_query_results/\"\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data Wrangler flow upload and a feature group will have this unique suffix: 07-04-20-04-a58253ce\n" ] } ], "source": [ "unique_suffix = f\"{strftime('%d-%H-%M-%S', gmtime())}-{str(uuid.uuid4())[:8]}\"\n", "abalone_dataset_file_name = \"abalone.csv\"\n", "abalone_dataset_local_path = \"../dataset/\"\n", "abalone_dataset_local_url = f\"{abalone_dataset_local_path}{abalone_dataset_file_name}\"\n", "\n", "print(f\"Data Wrangler flow upload and a feature group will have this unique suffix: {unique_suffix}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 데이타 랭글러 흐름 이름 설정\n", "

💡 데이터 랭글러 흐름 생성시에 이름을 바꿀 예정입니다. dw2-flow 로 이름을 기억 해주세요.\n", "

\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "dw_flow_name = \"dw2-flow\" # 원하시는 이름으로 수정 해주세요.\n", "# dw_flow_name = \"dw-flow\" # change to your custom file name if you use a different one" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. 데이터 세트 준비" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1. 데이터세트 다운로드\n", "이 솔루션에서는 잘 알려진 [Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#abalone)을 사용합니다. 데이터 세트에는 4177개의 데이터 행과 8개의 기능이 있습니다.\n", "\n", "Dua, D. 및 Graff, C. (2019). UCI 기계 학습 저장소 [http://archive.ics.uci.edu/ml]. 캘리포니아 어바인: 캘리포니아 대학교 정보 및 컴퓨터 과학 학교." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!mkdir -p ../dataset\n", "!rm -fr ../dataset/*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[UCI 웹사이트](http://archive.ics.uci.edu/ml/datasets/Abalone)에서 데이터세트를 다운로드합니다." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-02-07 04:20:08-- http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data\n", "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n", "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 191873 (187K) [application/x-httpd-php]\n", "Saving to: ‘abalone.data’\n", "\n", "abalone.data 100%[===================>] 187.38K 914KB/s in 0.2s \n", "\n", "2022-02-07 04:20:09 (914 KB/s) - ‘abalone.data’ saved [191873/191873]\n", "\n", "--2022-02-07 04:20:09-- http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names\n", "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n", "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 4319 (4.2K) [application/x-httpd-php]\n", "Saving to: ‘abalone.names’\n", "\n", "abalone.names 100%[===================>] 4.22K --.-KB/s in 0s \n", "\n", "2022-02-07 04:20:10 (213 MB/s) - ‘abalone.names’ saved [4319/4319]\n", "\n" ] } ], "source": [ "!cd {abalone_dataset_local_path} && wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data\n", "!cd {abalone_dataset_local_path} && wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "데이터세트를 로드하고 처음 5개 행을 인쇄합니다." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# dictionary of dataset columns and data types\n", "columns = {\n", " \"sex\":\"string\", \n", " \"length\":\"float\", \n", " \"diameter\":\"float\", \n", " \"height\":\"float\", \n", " \"whole_weight\":\"float\", \n", " \"shucked_weight\":\"float\", \n", " \"viscera_weight\":\"float\", \n", " \"shell_weight\":\"float\",\n", " \"rings\":\"long\"\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data shape: (4177, 9)\n" ] }, { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	sex	length	diameter	height	whole_weight	shucked_weight	viscera_weight	shell_weight	rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

\n", "

" ], "text/plain": [ " sex length diameter height whole_weight shucked_weight viscera_weight \\\n", "0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 \n", "1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 \n", "2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 \n", "3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 \n", "4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 \n", "\n", " shell_weight rings \n", "0 0.150 15 \n", "1 0.070 7 \n", "2 0.210 9 \n", "3 0.155 10 \n", "4 0.055 7 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_df = pd.read_csv(f\"{abalone_dataset_local_path}abalone.data\", names=columns.keys())\n", "print(f\"Data shape: {data_df.shape}\")\n", "data_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2. 데이터세트를 로컬에 저장하고 S3에 업로드\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# save the dataframe as CSV with the header and index\n", "data_df.to_csv(abalone_dataset_local_url, index_label=\"record_id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upload the data to the data S3 bucket." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upload: ../dataset/abalone.names to s3://sagemaker-us-east-1-569441333767/feature-store-ingestion-pipeline/dataset/abalone.names\n", "upload: ../dataset/abalone.csv to s3://sagemaker-us-east-1-569441333767/feature-store-ingestion-pipeline/dataset/abalone.csv\n", "upload: ../dataset/abalone.data to s3://sagemaker-us-east-1-569441333767/feature-store-ingestion-pipeline/dataset/abalone.data\n" ] } ], "source": [ "!aws s3 cp {abalone_dataset_local_path}. s3://{s3_data_prefix} --recursive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

💡 아래의 데이터 위치 출력 결과를 복사 해놓으세요.데이터 랭글러의 소스 지정시에 사용합니다.\n", "

\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data uploaded to : \n", "\n", "s3://sagemaker-us-east-1-569441333767/feature-store-ingestion-pipeline/dataset/\n" ] } ], "source": [ "print(f\"Data uploaded to : \\n\\ns3://{s3_data_prefix}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. 데이터 랭글러 흐름 (Flow) 생성\n", "- 여기서는 직접 작성을 합니다. 원문에 미리 만들어진 dw-flow.flow 사용시에는 이미 소스가 원작자의 경로로 지정되어 있기에 에러가 발생 합니다.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.1. 데이터 랭글러 흐름 만들기\n", "\n", "\n", "이 단계별 지침에 따라 새 Data Wrangler 흐름을 만들고 흐름에 데이터 변환 단계를 추가하십시오. 자세한 내용은 [데이터 랭글러 문서](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html)를 참조하십시오.\n", "\n", "1. **SageMaker 리소스** 위젯에서 **Data Wrangler**를 선택합니다.\n", "\n", "

\n", "\n", "2. **새 흐름**을 클릭합니다.\n", "\n", "

\n", "\n", "\n", "3. **Amazon S3**를 소스로 선택합니다.\n", "\n", "

\n", "\n", "4. S3 버킷 경로로 이동하여 이전 섹션에서 S3 접두사에 업로드한 데이터 세트를 가져옵니다.\n", "\n", "

\n", "\n", "5. `abalone.csv` 파일을 선택하고 **First row is header**가 선택되어 있고 **Delimiter**가 `COMMA`로 설정되어 있는지 확인합니다. **가져오기** 클릭:\n", "\n", "

\n", "\n", "6. 'gs-dw-flow.flow'로 이름을 바꾸려면 untitled.flow 흐름을 마우스 오른쪽 버튼으로 클릭합니다. ⭐ 이 경우 `dw_flow_name` 변수의 값을 그에 맞게 변경해야 합니다.\n", "

\n", "\n", "7. 다음에 대한 몇 가지 사용자 지정 Python Pandas 명령이 포함된 Data Wrangler 변환을 추가합니다.\n", " - sklearn `StandardScaler`를 사용하여 모든 숫자 열 크기 조정\n", " - 범주형 열 `sex`의 원-핫 인코딩\n", " - 기능 저장소에 필요한 타임스탬프 열 `record_time` 추가\n", " \n", "변환을 추가하려면 **데이터 흐름** 탭으로 이동하여 **날짜 유형** 상자 옆에 있는 + 기호를 클릭하고 **변환 추가**를 선택합니다.\n", "\n", "

\n", "\n", "8. **+ 단계 추가**를 클릭하고 선택 상자에서 **사용자 지정 변환** 및 **Python(Pandas)**을 선택합니다.\n", "\n", "

\n", "\n", "

\n", "\n", "편집기에 다음 코드를 입력합니다.\n", "```python\n", "import pandas as pd\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "df_scaled = df.drop(['record_id', 'sex','rings'], axis=1)\n", "df_scaled = StandardScaler().fit_transform(df_scaled.to_numpy())\n", "df_scaled = pd.DataFrame(df_scaled, columns=['length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight'])\n", "\n", "df = pd.concat((df_scaled, df[['record_id', 'sex','rings']]), 1)\n", "```\n", "\n", "Python(Pandas) 및 sklearn에서 이 Custom Transform을 사용하여 모든 숫자 열을 한 번에 확장합니다.\n", "\n", "**미리보기**를 클릭한 다음 **추가**를 클릭하여 데이터 흐름에 변환을 추가합니다.\n", "\n", "9. Data Wrangler의 기본 **Encode Categorical** 기능을 사용하여 `Sex` 변수를 핫 인코딩합니다. **+ 단계 추가**를 클릭하고 오른쪽의 **변환 추가** 아래에서 **범주형 인코딩**을 선택합니다.\n", "

\n", "\n", "변환에는 '원 핫 인코딩', 입력 열에는 '성별', 출력 스타일에는 '열'을 선택합니다.\n", " \n", "**미리보기**를 클릭하여 변경 사항을 확인한 다음 **추가**를 클릭합니다.\n", "\n", "10. 마지막으로 **+ Add step**을 클릭하고 **Custom transform** 및 **Python(Pandas)**을 선택합니다.\n", "\n", "

\n", "\n", "편집기에 다음 코드를 입력합니다.\n", "```python\n", "import time\n", "import pandas as pd\n", "\n", "record_time_feature_name = 'record_time'\n", "current_time_sec = int(round(time.time()))\n", "df[record_time_feature_name] = pd.Series([current_time_sec]*len(df), dtype=\"float\")\n", "```\n", "\n", "**미리보기**를 클릭한 다음 **추가**를 클릭하여 데이터 흐름에 변환을 추가합니다.\n", "\n", "11. 이제 **변환** 개요에 세 가지 변환 단계가 있습니다.\n", "\n", "

\n", "\n", "12. 데이터 랭글러 흐름을 저장합니다. **파일**을 선택한 다음 **데이터 랭글러 흐름 저장**을 선택합니다.\n", "\n", "**데이터 흐름으로 돌아가기**를 클릭하고 (1) 아래와 같이 \"Step(3)\"의 박스를 클릭, (2) \"3 Custom Pandas 옆의 : \" 클릭, (3) \"Export to\" 클릭, (4) SageMaker Feature Store (via Jupyter Notebook) 클릭 합니다.\n", "\n", "\n", "

\n", "\n", "\n", "13. 새로 생성된 노트북이 새 창에서 열립니다. 노트북에서 **Output: Feature Store** 섹션으로 이동하여 `output_name` 변수를 찾습니다.\n", "\n", "

\n", "\n", "'output_name' 변수의 값을 복사하여 이 노트북의 다음 코드 셀에 붙여넣습니다.\n", "\n", "### Data Wrangler 흐름 수동 생성 끝\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2. 출력 이름 설정\n", "- 추후에 사용할 Data Wrangler 프로세서는 마지막 변환 단계의 'node_id'가 필요하며, 그 후에 변환된 데이터는 출력 대상으로 내보내집니다.\n", "- 고유한 Data Wrangler 흐름을 생성했거나 흐름에 더 많은 변환 단계를 추가한 경우 12단계와 13단계의 이전 섹션에서 설명한 대로 `dw_output_name`을 올바른 `node_id` 값으로 설정해야 합니다. 그렇지 않으면 다음 코드 셀을 실행하세요. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataWrangler Data flow name: dw2-flow\n", "DataWrangler flow output name: c8880ed5-b8a0-4375-899b-1c4d86828152.default\n" ] } ], "source": [ "# Set the dw_output_name to your export node_id, otherwise keep None if you use the provided flow\n", "dw_output_name = None\n", "\n", "if dw_output_name is None:\n", " flow_content = json.loads(open(f\"{dw_flow_name}.flow\").read())\n", " dw_output_name = f\"{flow_content['nodes'][len(flow_content['nodes'])-1]['node_id']}.default\"\n", "\n", "print(f\"DataWrangler Data flow name: {dw_flow_name}\")\n", "print(f\"DataWrangler flow output name: {dw_output_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3. S3 버킷에 DataWrangler 흐름 업로드\n", "마지막으로 Data Wrangler 흐름을 S3 버킷에 업로드합니다. 데이터 처리 파이프라인은 이 흐름 파일을 사용하여 데이터 변환을 실행합니다." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "dw_flow_file_url = f\"s3://{s3_flow_prefix}{dw_flow_name}-{unique_suffix}.flow\"" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upload: ./dw2-flow.flow to s3://sagemaker-us-east-1-569441333767/feature-store-ingestion-pipeline/dw-flow/dw2-flow-07-04-20-04-a58253ce.flow\n" ] } ], "source": [ "!aws s3 cp {dw_flow_name}.flow {dw_flow_file_url}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. 피쳐 스토어" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 데이터 피쳐를 저장하려면 SageMaker 피쳐 저장소에 새 피쳐 그룹을 생성해야 합니다. 피쳐 그룹의 각 피쳐는 지정된 데이터 유형과 이름이 있습니다.\n", "- 피쳐 그룹의 단일 레코드는 데이터 프레임의 행에 해당합니다. \n", "- SageMaker 피쳐 저장소에 대해 자세히 알아보려면 다음을 참조하십시오.\n", " - [Amazon Feature Store 설명서](http://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 레코드 식별자 및 시간 피쳐 이름 선택" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "레코드 식별자 및 레코드 시간 피쳐 이름을 선택합니다. 피쳐 그룹 생성에 필요한 매개변수입니다." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "record_identifier_feature_name = 'record_id'\n", "if record_identifier_feature_name is None:\n", " raise SystemExit(\"Select a column name as the feature group record identifier.\")\n", "\n", "record_time_feature_name = 'record_time'\n", "if record_time_feature_name is None:\n", " raise SystemExit(\"Select a column name as the event time feature name.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2. 피쳐 이름 및 데이터 유형 정의" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "다음은 데이터 흐름을 사용하여 입력 데이터 세트를 처리할 때 생성될 최종 데이터 세트의 피쳐 이름 및 데이터 유형 목록입니다." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "try:\n", " columns\n", " \n", "except NameError:\n", " # dictionary of dataset columns and data types\n", " columns = {\n", " \"sex\":\"string\", \n", " \"length\":\"float\", \n", " \"diameter\":\"float\", \n", " \"height\":\"float\", \n", " \"whole_weight\":\"float\", \n", " \"shucked_weight\":\"float\", \n", " \"viscera_weight\":\"float\", \n", " \"shell_weight\":\"float\",\n", " \"rings\":\"long\"\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3. sex 컬럼 변경\n", "- 데이터 랭글러의 변환 작업에서 원핫 인코딩을 사용하여 Sex 관련 세 개의 컬럼을 생성 하였습니다. 이를 반영하기 위해서 기존 Sex 를 제거하고, 3개의 컬럼을 생성 합니다." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'length': 'float',\n", " 'diameter': 'float',\n", " 'height': 'float',\n", " 'whole_weight': 'float',\n", " 'shucked_weight': 'float',\n", " 'viscera_weight': 'float',\n", " 'shell_weight': 'float',\n", " 'rings': 'long',\n", " 'sex_M': 'float',\n", " 'sex_I': 'float',\n", " 'sex_F': 'float'}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# since we added one-hot encoding for the categorical column `sex`, adjust the column list for the feature group\n", "if columns.get(\"sex\") is not None: \n", " del columns[\"sex\"]\n", " \n", "for i in ('M', 'I', 'F'):\n", " columns[f\"sex_{i}\"] = \"float\"\n", "\n", "columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4. 피쳐 스토어에 맞는 데이터 타입으로 변경 \n", "- 위의 정의된 데이터 타입(예: float) 를 피쳐 스토어에서 제공하는 데이터 타입에 맞게 변한 합니다." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'name': 'length', 'type': 'float'},\n", " {'name': 'diameter', 'type': 'float'},\n", " {'name': 'height', 'type': 'float'},\n", " {'name': 'whole_weight', 'type': 'float'},\n", " {'name': 'shucked_weight', 'type': 'float'},\n", " {'name': 'viscera_weight', 'type': 'float'},\n", " {'name': 'shell_weight', 'type': 'float'},\n", " {'name': 'rings', 'type': 'long'},\n", " {'name': 'sex_M', 'type': 'float'},\n", " {'name': 'sex_I', 'type': 'float'},\n", " {'name': 'sex_F', 'type': 'float'},\n", " {'name': 'record_id', 'type': 'long'},\n", " {'name': 'record_time', 'type': 'float'}]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_schemas = [\n", " *[{\"name\":c[0], \"type\":c[1]} for c in columns.items()],\n", " {\n", " \"name\": record_identifier_feature_name,\n", " \"type\": \"long\"\n", " },\n", " {\n", " \"name\": record_time_feature_name,\n", " \"type\": \"float\"\n", " },\n", "]\n", "\n", "column_schemas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "아래에서 해당 피쳐 정의에 대한 SDK 입력을 생성합니다. Data Wrangler의 일부 스키마 유형은\n", "피쳐 저장소에서 지원합니다. 다음은 이러한 유형에 대해 String으로 설정된 `default_feature_type`을 생성합니다." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feature definitions: [FeatureDefinition(feature_name='length', feature_type=), FeatureDefinition(feature_name='diameter', feature_type=), FeatureDefinition(feature_name='height', feature_type=), FeatureDefinition(feature_name='whole_weight', feature_type=), FeatureDefinition(feature_name='shucked_weight', feature_type=), FeatureDefinition(feature_name='viscera_weight', feature_type=), FeatureDefinition(feature_name='shell_weight', feature_type=), FeatureDefinition(feature_name='rings', feature_type=), FeatureDefinition(feature_name='sex_M', feature_type=), FeatureDefinition(feature_name='sex_I', feature_type=), FeatureDefinition(feature_name='sex_F', feature_type=), FeatureDefinition(feature_name='record_id', feature_type=), FeatureDefinition(feature_name='record_time', feature_type=)]\n" ] } ], "source": [ "default_feature_type = FeatureTypeEnum.STRING\n", "column_to_feature_type_mapping = {\n", " \"float\": FeatureTypeEnum.FRACTIONAL,\n", " \"long\": FeatureTypeEnum.INTEGRAL\n", "}\n", "\n", "feature_definitions = [\n", " FeatureDefinition(\n", " feature_name=column_schema['name'], \n", " feature_type=column_to_feature_type_mapping.get(column_schema['type'], default_feature_type)\n", " ) for column_schema in column_schemas\n", "]\n", "print(f\"feature definitions: {feature_definitions}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5. 피쳐 그룹 이름, 기타 변수 정의 및 피쳐 그룹 생성" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "feature_group_name_prefix = \"FG-abalone\"\n", "feature_group_name = f\"{feature_group_name_prefix}-{unique_suffix}\"\n", "feature_store_offline_s3_uri = f\"s3://{data_bucket}\"\n", "\n", "# controls if online store is enabled. Enabling the online store allows quick access to \n", "# the latest value for a Record via the GetRecord API.\n", "enable_online_store = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SageMaker Python SDK를 사용하여 피쳐 그룹을 생성합니다." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:569441333767:feature-group/fg-abalone-07-04-20-04-a58253ce',\n", " 'ResponseMetadata': {'RequestId': 'c31f9f26-b523-43de-bbf9-3400fe81df7d',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'c31f9f26-b523-43de-bbf9-3400fe81df7d',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '108',\n", " 'date': 'Mon, 07 Feb 2022 04:21:01 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_group = FeatureGroup(\n", " name=feature_group_name,\n", " sagemaker_session=sagemaker_session,\n", " feature_definitions=feature_definitions)\n", "\n", "feature_group.create(\n", " s3_uri=feature_store_offline_s3_uri,\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=record_time_feature_name,\n", " role_arn=execution_role,\n", " enable_online_store=enable_online_store\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "피쳐 그룹이 준비될 때까지 기다리십시오. 약 1분이 소요됩니다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Waiting for Feature Group Creation\n", "Waiting for Feature Group Creation\n", "Waiting for Feature Group Creation\n" ] } ], "source": [ "while feature_group.describe().get(\"FeatureGroupStatus\") == \"Creating\":\n", " print(\"Waiting for Feature Group Creation\")\n", " time.sleep(5)\n", "\n", "if feature_group.describe().get(\"FeatureGroupStatus\") != \"Created\":\n", " raise SystemExit(f\"Failed to create feature group {feature_group.name}: {status}\")\n", "print(f\"FeatureGroup {feature_group.name} successfully created.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

💡 AccessDenied 예외 처리에 대한 처리 방법

\n", "\n", "- 피쳐 그룹 생성 시 `sagemaker_featurestore` 데이터베이스에 대한 Lake Formation 권한으로 인해 발생할 수 있습니다.\n", "- Lake Formation의 SageMaker 실행 역할(또는 피쳐 저장소에 액세스하는 데 사용하는 역할)에 해당 데이터베이스에 대한 권한을 부여해야 합니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.6. 피쳐 그룹의 쿼리 데이터\n", "피쳐 생성 시 피쳐 저장소의 피쳐 그룹은 비어 있으며 데이터가 포함되어 있지 않습니다. **SageMaker 리소스** 위젯에서 **피쳐 저장소**를 선택하여 피쳐 그룹 메타 데이터를 탐색할 수 있습니다.\n", "\n", "

\n", "\n", "혹은 SageMaker SDK 를 사용하세요." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "다음 두 코드 셀에 설명된 대로 Athena 쿼리를 사용하여 피쳐 그룹의 데이터를 쿼리할 수 있습니다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build SQL query to features group\n", "fs_query = feature_group.athena_query()\n", "\n", "query_string = f'SELECT * FROM \"{fs_query.table_name}\"'\n", "print(f'Prepared query {query_string}')\n", "print(fs_query)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run Athena query. The output is loaded to a Pandas dataframe.\n", "fs_query.run(\n", " query_string=query_string, \n", " output_location=f\"s3://{s3_fs_query_output_prefix}\"\n", ")\n", "\n", "fs_query.wait()\n", "fs_df = fs_query.as_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fs_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "예상대로 피쳐 그룹에는 데이터가 포함되어 있지 않습니다.\n", "이제 피쳐 그룹에 피쳐를 수집할 데이터 변환 및 수집 파이프라인을 배포할 모든 준비가 되었습니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. 매개변수 저장\n", "`%store` 매직을 사용하여 데이터 변환 및 수집 파이프라인에 대한 매개변수를 저장합니다. 다음 노트북에서 이 매개변수를 사용할 것입니다." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store data_bucket\n", "%store dw_flow_file_url\n", "%store dw_output_name\n", "%store feature_group_name\n", "%store s3_data_prefix\n", "%store s3_flow_prefix \n", "%store s3_fs_query_output_prefix\n", "%store abalone_dataset_file_name\n", "%store abalone_dataset_local_url\n", "\n", "%store" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Next\n", "다음 노트북으로 이동 하세요 [`01-feature-store-ingest-pipeline` notebook](01-feature-store-ingest-pipeline.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }