# Amazon SageMaker と Amazon QuickSight による自然言語処理ダッシュボードの作成

オープンソースの形態素解析ツールである GiNZA を Amazon SageMaker ノートブックに導入し、テキストからワード、係り受けを抽出、抽出した結果を Amazon QuickSight に取り込み、分析可能なダッシュボードを作成します。 詳細は以下のブログを参照してください。

https://aws.amazon.com/jp/blogs/news/amazon-sagemaker-amazon-quicksight-nlp-dashboard/

## 0. 準備

#### GiNZA のインストール
- 形態素解析で使用するライブラリをインストールします。インストール後にKernelを再起動します。

In [1]:
!pip install -U ginza ja-ginza
!pip install -U awscli
!pip install -U boto3

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting ginza
  Using cached ginza-5.1.2-py3-none-any.whl (20 kB)
Collecting ja-ginza
  Using cached ja_ginza-5.1.2-py3-none-any.whl (59.1 MB)
Collecting spacy<3.5.0,>=3.2.0
  Using cached spacy-3.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
Collecting plac>=1.3.3
  Using cached plac-1.3.5-py2.py3-none-any.whl (22 kB)
Collecting SudachiPy<0.7.0,>=0.6.2
  Using cached SudachiPy-0.6.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.2 MB)
Collecting SudachiDict-core>=20210802
  Using cached SudachiDict-core-20221021.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4
  Using cached pydantic-1.10.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.6 MB)
Collecting pathy>=0.3.5
  Using cached pathy-0.10.0-py3-none-any.whl (48 kB)
Collecting srsly<3.0.0,>=

## 1. データの取得
- データをCSV形式で取得します。
- データはテキストフィールドが1つ、日付フィールドが1つと、複数の定型フィールドがあることが前提となります。

In [1]:
!pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
import urllib.request
import os
import gzip
import shutil

download_url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz" 
dir_name = "data"
file_name = "amazon_review.tsv.gz"
tsv_file_name = "amazon_review.tsv"
file_path = os.path.join(dir_name,file_name)
tsv_file_path = os.path.join(dir_name,tsv_file_name)

os.makedirs(dir_name, exist_ok=True)

if os.path.exists(file_path):
    print("File {} already exists. Skipped download.".format(file_name))
else:
    urllib.request.urlretrieve(download_url, file_path)
    print("File downloaded: {}".format(file_path))
    
if os.path.exists(tsv_file_path):
    print("File {} already exists. Skipped unzip.".format(tsv_file_name))
else:
    with gzip.open(file_path, mode='rb') as fin:
        with open(tsv_file_path, 'wb') as fout:
            shutil.copyfileobj(fin, fout)
            print("File uznipped: {}".format(tsv_file_path))

File downloaded: data/amazon_review.tsv.gz
File uznipped: data/amazon_review.tsv


In [3]:
import pandas as pd

df = pd.read_csv(tsv_file_path, sep ='\t')
df.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,JP,65317,R33RSUD4ZTRKT7,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,1,15,N,Y,残念ながら…,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,2012-12-05
1,JP,65317,R2U1VB8GPZBBEH,B000YPWBQ2,904244932,鏡の中の鏡‾ペルト作品集(SACD)(Arvo Part:Spiegel im Spiegel),Music,1,4,20,N,Y,残念ながら…,残念ながら…趣味ではありませんでした。正直退屈…眠気も起きない…,2012-12-05
2,JP,65696,R1IBRCJPPGWVJW,B0002E5O9G,108978277,Les Miserables 10th Anniversary Concert,Music,5,2,3,N,Y,ドリームキャスト,素晴らしいパフォーマンス。ミュージカル映画版の物足りない歌唱とは違います。,2013-03-02
3,JP,67162,RL02CW5XLYONU,B00004SRJ5,606528497,It Takes a Nation of Millions to Hold Us Back,Music,5,6,9,N,Y,やっぱりマスト,専門的な事を言わずにお勧めレコメを書きたいのですが、文才が無いので無理でした。ヒップホップが...,2013-08-11
4,JP,67701,R2LA2SS3HU3A3L,B0093H8H8I,509738390,Intel CPU Core I3-3225 3.3GHz 3MBキャッシュ LGA1155...,PC,4,2,4,N,Y,コスパ的には十分,今までの環境（Core2 Duo E4600)に比べれば十分に快適になりました。<br />...,2013-02-10


レビューコメントからHTMLタグを除去する

In [4]:
from bs4 import BeautifulSoup

def filterHtmlTag(txt):
    soup = BeautifulSoup(txt, 'html.parser')
    txt = soup.get_text(strip=True)
    
    return txt

In [5]:
df['review_body'] = df['review_body'].map(filterHtmlTag)



## 2. 初期設定項目
以下の項目を指定します。
- project_name:適当な名前
- timestamp_field: 日付フィールド
- structured_fields: 定型フィールド
- text_field: テキストフィールド

In [6]:
import sagemaker

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

In [7]:
project_name = 'amazon_review'
timestamp_field = 'review_date'
structured_fields = ['product_id', 'product_parent','product_title', 'product_category', 'star_rating']
text_field = 'review_body'

In [8]:
num = pd.RangeIndex(start=0, stop=len(df.index), step=1)
df['id'] = num
all_fields = ['id'] + [timestamp_field] + [text_field] + structured_fields 

In [9]:
df1 = df.loc[:, all_fields]
df1.columns = ['id','ts', 'txt'] + list(map(lambda x: f'col{str(x)}',list(range(len(structured_fields)))))
df1.head()

Unnamed: 0,id,ts,txt,col0,col1,col2,col3,col4
0,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1
1,1,2012-12-05,残念ながら…趣味ではありませんでした。正直退屈…眠気も起きない…,B000YPWBQ2,904244932,鏡の中の鏡‾ペルト作品集(SACD)(Arvo Part:Spiegel im Spiegel),Music,1
2,2,2013-03-02,素晴らしいパフォーマンス。ミュージカル映画版の物足りない歌唱とは違います。,B0002E5O9G,108978277,Les Miserables 10th Anniversary Concert,Music,5
3,3,2013-08-11,専門的な事を言わずにお勧めレコメを書きたいのですが、文才が無いので無理でした。ヒップホップが...,B00004SRJ5,606528497,It Takes a Nation of Millions to Hold Us Back,Music,5
4,4,2013-02-10,今までの環境（Core2 Duo E4600)に比べれば十分に快適になりました。動画のエンコ...,B0093H8H8I,509738390,Intel CPU Core I3-3225 3.3GHz 3MBキャッシュ LGA1155...,PC,4


In [10]:
if(os.path.isdir('input/') == True):
    shutil.rmtree('input/')
if(os.path.isdir('output/') == True):
    shutil.rmtree('output/')
if(os.path.isdir('code/') == True):
    shutil.rmtree('code/')

!aws s3 rm s3://{bucket}/{project_name}/input --recursive
!aws s3 rm s3://{bucket}/{project_name}/output --recursive

delete: s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input.csv
delete: s3://sagemaker-ap-northeast-1-797821601610/amazon_review/output/output.csv


## 3. ワード、係り受けの抽出
3.1, 3.2 のどちらかを実行して、ワード、係り受けを抽出します。
- 3.1 実行中のNotebookインスタンスで処理を実行します（データが少ない場合）
- 3.2 SageMaker Processingで処理を実行します（データが多い場合）

### 3.1. Notebookインスタンスで実行

In [11]:
import spacy
nlp = spacy.load('ja_ginza')

In [12]:
df2 = df1[:20000]

In [13]:
os.makedirs('input', exist_ok=True)
df2.to_csv(f'input/input.csv', index=False)
!aws s3 cp ./input s3://{bucket}/{project_name}/input --recursive

upload: input/input.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input.csv


In [14]:
%%time
pos_id = []
pos_token_no = []
pos_word = []
pos_pos = []

dep_id = []
dep_token_no_1 = []
dep_token_no_2 = []
dep_words_pair = []
dep_dep = []

deptypes = {'advmod':'副詞修飾子', 'amod':'形容詞修飾子', 'nmod': '名詞修飾子', 'nsubj':'主語名詞'}

for index, row in df2.iterrows():
    doc = nlp(row['txt'])
    id = row['id']
    for sent in doc.sents:
        for token in sent:
            lemma = token.lemma_
            pos = token.tag_.split('-')[0]
            dep = token.dep_
            
            if pos in ('名詞','動詞','形容詞','副詞'):
                pos_id += [id]
                pos_token_no += [token.i]
                pos_word += [lemma]
                pos_pos += [pos]

            if dep in deptypes.keys():
                dep_id += [id]
                dep_token_no_1 += [token.i]
                dep_token_no_2 += [token.head.i]
                dep_words_pair += [token.lemma_+' - '+token.head.lemma_]
                dep_dep += [dep]

df_pos = pd.DataFrame(
    data = {'id':pos_id, 'token_no':pos_token_no, 'word':pos_word, 'pos':pos_pos},
    columns= ['id','token_no', 'word', 'pos']
)  

df_dep = pd.DataFrame(
    data = {'id':dep_id, 'token_no_1':dep_token_no_1, 'token_no_2':dep_token_no_2, 'words_pair':dep_words_pair, 'dep':dep_dep},
    columns= ['id', 'token_no_1', 'token_no_2', 'words_pair', 'dep']
)   

CPU times: user 13min 46s, sys: 3.7 s, total: 13min 49s
Wall time: 13min 50s


#### データのマージ

In [15]:
df_dep2 = pd.melt(df_dep, id_vars=['id', 'words_pair','dep'], value_vars=['token_no_1','token_no_2'], value_name='token_no' )
df_pos_dep = pd.merge(df_pos, df_dep2, how='left', on=['id','token_no'])
df3 = pd.merge(df2, df_pos_dep, how='right', on=['id'])
df3.head()

Unnamed: 0,id,ts,txt,col0,col1,col2,col3,col4,token_no,word,pos,words_pair,dep,variable
0,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,3,趣味,名詞,,,
1,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,6,ある,動詞,,,
2,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,12,ケルト,名詞,,,
3,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,13,音楽,名詞,音楽 - 範疇,nmod,token_no_1
4,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,15,範疇,名詞,音楽 - 範疇,nmod,token_no_2


In [16]:
os.makedirs('output', exist_ok=True)
df3.to_csv('output/output.csv', index=False)
!aws s3 cp ./output s3://{bucket}/{project_name}/output --recursive

upload: output/output.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/output/output.csv


### 3.2. SageMaker Processingで実行

In [17]:
import os
import boto3
import datetime
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import FrameworkProcessor

region = sagemaker.Session().boto_region_name
role = get_execution_role()

est_cls = sagemaker.sklearn.estimator.SKLearn
framework_version_str = "0.20.0"
base_job_name='job' + datetime.datetime.now().strftime('%Y%m%d%H%M%S')

script_processor = FrameworkProcessor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    estimator_cls=est_cls,
    framework_version=framework_version_str,
    code_location = f's3://{bucket}/{project_name}/code',
    base_job_name=base_job_name
)

In [18]:
df2 = df1[:260000]

In [19]:
k = 40000 #各CSVの行数
n = df2.shape[0]
dfs = [df2.loc[i:i+k-1, :] for i in range(0, n, k)]

os.makedirs('input', exist_ok=True)
for i,df_i in enumerate(dfs):
    df_i.to_csv(f'input/input_{i}.csv', index=False)

In [20]:
!aws s3 cp ./input s3://{bucket}/{project_name}/input --recursive
!aws s3 rm s3://{bucket}/{project_name}/output --recursive

upload: input/input_0.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_0.csv
upload: input/input.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input.csv
upload: input/input_1.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_1.csv
upload: input/input_2.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_2.csv
upload: input/input_3.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_3.csv
upload: input/input_6.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_6.csv
upload: input/input_4.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_4.csv
upload: input/input_5.csv to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/input/input_5.csv
delete: s3://sagemaker-ap-northeast-1-797821601610/amazon_review/output/output.csv


In [21]:
os.makedirs('code', exist_ok=True)

In [22]:
%%writefile code/preprocessing.py
import pandas as pd
import os
import pandas as pd
import spacy
import argparse
nlp = spacy.load('ja_ginza')

parser = argparse.ArgumentParser()
parser.add_argument('--sequence_num') 
args = parser.parse_args()
sequence_num = args.sequence_num
input_data_path = f"/opt/ml/processing/input/input_{sequence_num}.csv"
df2 = pd.read_csv(input_data_path)

pos_id = []
pos_token_no = []
pos_word = []
pos_pos = []

dep_id = []
dep_token_no_1 = []
dep_token_no_2 = []
dep_words_pair = []
dep_dep = []

deptypes = {'advmod':'副詞修飾子', 'amod':'形容詞修飾子', 'nmod': '名詞修飾子', 'nsubj':'主語名詞'}
text_field = 'txt'

for index, row in df2.iterrows():
    doc = nlp(str(row[text_field]))
    id = row['id']
    for sent in doc.sents:
        for token in sent:
            lemma = token.lemma_
            pos = token.tag_.split('-')[0]
            dep = token.dep_
            
            if pos in ('名詞','動詞','形容詞','副詞'):
                pos_id += [id]
                pos_token_no += [token.i]
                pos_word += [lemma]
                pos_pos += [pos]

            if dep in deptypes.keys():
                dep_id += [id]
                dep_token_no_1 += [token.i]
                dep_token_no_2 += [token.head.i]
                dep_words_pair += [token.lemma_+' - '+token.head.lemma_]
                #dep_dep += [deptypes[dep]]
                dep_dep += [dep]

df_pos = pd.DataFrame(
    data = {'id':pos_id, 'token_no':pos_token_no, 'word':pos_word, 'pos':pos_pos},
    columns= ['id','token_no', 'word', 'pos']
)  

df_dep = pd.DataFrame(
    data = {'id':dep_id, 'token_no_1':dep_token_no_1, 'token_no_2':dep_token_no_2, 'words_pair':dep_words_pair, 'dep':dep_dep},
    columns= ['id', 'token_no_1', 'token_no_2', 'words_pair', 'dep']
)   

df_dep2 = pd.melt(df_dep, id_vars=['id', 'words_pair','dep'], value_vars=['token_no_1','token_no_2'], value_name='token_no' )
df_pos_dep = pd.merge(df_pos, df_dep2, how='left', on=['id','token_no'])
df2 = pd.merge(df2, df_pos_dep, how='right', on=['id'])
df2.to_csv(f'/opt/ml/processing/output/output_{sequence_num}.csv', index=False)

print("Completed running the processing job")

Writing code/preprocessing.py


In [23]:
%%writefile code/requirements.txt 
ginza
ja-ginza

Writing code/requirements.txt


In [24]:
%%capture output
from sagemaker.processing import ProcessingInput, ProcessingOutput

for i in range(len(dfs)):
    script_processor.run(
        code="preprocessing.py",
        source_dir="code",
        inputs=[ProcessingInput(source=f's3://{bucket}/{project_name}/input/input_{i}.csv', destination="/opt/ml/processing/input")],
        outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination=f's3://{bucket}/{project_name}/output/')],
        arguments=['--sequence_num', str(i)],
        wait=False
    )

#### ジョブの実行待ち
以下のセルを実行し、ProcessingJobStatusがCompletedになるまで待ちます

In [25]:
sm = boto3.Session().client('sagemaker')
jobs = sm.list_processing_jobs(NameContains=base_job_name)
pd.DataFrame(jobs['ProcessingJobSummaries'])

Unnamed: 0,ProcessingJobName,ProcessingJobArn,CreationTime,ProcessingEndTime,LastModifiedTime,ProcessingJobStatus
0,job20221130105714-2022-11-30-10-57-27-940,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:28.968000+00:00,2022-11-30 11:29:36.669000+00:00,2022-11-30 11:29:37.040000+00:00,Completed
1,job20221130105714-2022-11-30-10-57-26-802,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:27.780000+00:00,2022-11-30 11:55:39.292000+00:00,2022-11-30 11:55:39.653000+00:00,Completed
2,job20221130105714-2022-11-30-10-57-25-541,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:26.634000+00:00,2022-11-30 11:51:51.534000+00:00,2022-11-30 11:51:52.082000+00:00,Completed
3,job20221130105714-2022-11-30-10-57-25-000,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:25.368000+00:00,2022-11-30 11:48:22.984000+00:00,2022-11-30 11:48:23.376000+00:00,Completed
4,job20221130105714-2022-11-30-10-57-23-829,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:24.890000+00:00,2022-11-30 11:51:01.282000+00:00,2022-11-30 11:51:01.816000+00:00,Completed
5,job20221130105714-2022-11-30-10-57-21-848,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:23.631000+00:00,2022-11-30 11:41:58.795000+00:00,2022-11-30 11:41:59.304000+00:00,Completed
6,job20221130105714-2022-11-30-10-57-21-103,arn:aws:sagemaker:ap-northeast-1:797821601610:...,2022-11-30 10:57:21.681000+00:00,2022-11-30 11:35:00.804000+00:00,2022-11-30 11:35:01.425000+00:00,Completed


In [26]:
os.makedirs('output', exist_ok=True)
!aws s3 cp s3://{bucket}/{project_name}/output/output_0.csv ./output
df3 = pd.read_csv("output/output_0.csv")
df3.head()

download: s3://sagemaker-ap-northeast-1-797821601610/amazon_review/output/output_0.csv to output/output_0.csv


Unnamed: 0,id,ts,txt,col0,col1,col2,col3,col4,token_no,word,pos,words_pair,dep,variable
0,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,3,趣味,名詞,,,
1,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,6,ある,動詞,,,
2,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,12,ケルト,名詞,,,
3,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,13,音楽,名詞,音楽 - 範疇,nmod,token_no_1
4,0,2012-12-05,残念ながら…趣味ではありませんでした。ケルト音楽の範疇にも幅があるのですね…,B000001GBJ,957145596,SONGS FROM A SECRET GARDE,Music,1,15,範疇,名詞,音楽 - 範疇,nmod,token_no_2


## 4. QuickSightデータセットの作成

#### Amazon QuickSight のサインアップ
- QuickSight に初めてアクセスする際には「Sing up for QuickSight」からサインアップを実施します。
- リソースへのアクセス権限の設定では、S3 バケット「sagemaker-リージョン名-アカウント名」を許可します。

#### IAM 権限設定

- SageMakerノートブックからQuickSightの操作を行うために 「IAM」-> 「ポリシー」->「ポリシーの作成」から以下のポリシー（sagemaker-quicksight-policy）を作成します

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "quicksight:CreateAnalysis",
                "quicksight:PassDataSet",
                "quicksight:CreateDataSet",
                "quicksight:PassDataSource",
                "quicksight:CreateDataSource",
                "quicksight:DescribeTemplate"
            ],
            "Resource": "*"
        }
    ]
}
```

- 作成したポリシーを SageMaker ノートブックの IAM ロール（AmazonSageMaker-ExecutionRole-xxxxx）にアタッチします。

In [27]:
import json
import boto3
import uuid

quicksight = boto3.client('quicksight')
account_id = boto3.client('sts').get_caller_identity().get('Account')

In [28]:
manifest = {
    "fileLocations": [
        {
            "URIPrefixes": [
                f's3://{bucket}/{project_name}/output/'
            ]
        }
    ]
}

In [29]:
with open('manifest.json', 'w') as f:
    json.dump(manifest, f)
!aws s3 cp ./manifest.json s3://{bucket}/{project_name}/manifest/ 

upload: ./manifest.json to s3://sagemaker-ap-northeast-1-797821601610/amazon_review/manifest/manifest.json


In [30]:
response = quicksight.create_data_source(
    AwsAccountId=account_id,
    DataSourceId=str(uuid.uuid4()),
    Name=project_name,
    Type='S3',
    DataSourceParameters={
        'S3Parameters': {
            'ManifestFileLocation': {
                'Bucket': bucket,
                'Key': f'{project_name}/manifest/manifest.json'
            }
        }
    }
)

data_source_arn = response['Arn']
response

{'ResponseMetadata': {'RequestId': '7cbccc1e-3a73-4ab9-8ab4-036eeedd5671',
  'HTTPStatusCode': 202,
  'HTTPHeaders': {'date': 'Wed, 30 Nov 2022 12:27:06 GMT',
   'content-type': 'application/json',
   'content-length': '249',
   'connection': 'keep-alive',
   'x-amzn-requestid': '7cbccc1e-3a73-4ab9-8ab4-036eeedd5671'},
  'RetryAttempts': 0},
 'Status': 202,
 'Arn': 'arn:aws:quicksight:ap-northeast-1:797821601610:datasource/785e02b5-a706-4afa-bc94-bd007faa949a',
 'DataSourceId': '785e02b5-a706-4afa-bc94-bd007faa949a',
 'CreationStatus': 'CREATION_IN_PROGRESS',
 'RequestId': '7cbccc1e-3a73-4ab9-8ab4-036eeedd5671'}

以下のセルの quicksight_username、quicksight_region、date_format を指定します。quicksight_username は QuickSight 画面右上のメニューから Username を確認します。

In [31]:
quicksight_username = 'XXXX'
quicksight_region = 'ap-northeast-1'
date_format = 'yyyy-MM-dd'

response = quicksight.create_data_set(
    AwsAccountId=account_id,
    DataSetId=str(uuid.uuid4()),
    Name=project_name,
    PhysicalTableMap={
        'phsicalTable': {
            'S3Source': {
                'DataSourceArn': data_source_arn,
                'InputColumns': list(map(lambda x: {'Name': x, 'Type': 'STRING'}, df3.columns))
            }
        }
    },
    LogicalTableMap={
        'string': {
            'Alias': project_name,
            'DataTransforms': [
                {
                    'CastColumnTypeOperation': {
                        'ColumnName': 'ts',
                        'NewColumnType': 'DATETIME',
                        'Format': date_format
                    }
                },
                {
                    'CastColumnTypeOperation': {
                        'ColumnName': 'id',
                        'NewColumnType': 'INTEGER'
                    }
                },
            ],
            'Source': {
                'PhysicalTableId': 'phsicalTable'
            }
        }
    },
    ImportMode='SPICE',
    Permissions=[
        {
            'Principal': f'arn:aws:quicksight:{quicksight_region}:{account_id}:user/default/{quicksight_username}',
            'Actions': [
                'quicksight:PassDataSet',
                'quicksight:DescribeIngestion',
                'quicksight:CreateIngestion',
                'quicksight:UpdateDataSet',
                'quicksight:DeleteDataSet',
                'quicksight:DescribeDataSet',
                'quicksight:CancelIngestion',
                'quicksight:DescribeDataSetPermissions',
                'quicksight:ListIngestions',
                'quicksight:UpdateDataSetPermissions'
            ]
        },
    ]
)

data_set_arn = response['Arn']
response

{'ResponseMetadata': {'RequestId': '724dde7a-97a6-4c5a-b8a9-8302759c741f',
  'HTTPStatusCode': 201,
  'HTTPHeaders': {'date': 'Wed, 30 Nov 2022 12:27:32 GMT',
   'content-type': 'application/json',
   'content-length': '412',
   'connection': 'keep-alive',
   'x-amzn-requestid': '724dde7a-97a6-4c5a-b8a9-8302759c741f'},
  'RetryAttempts': 0},
 'Status': 201,
 'Arn': 'arn:aws:quicksight:ap-northeast-1:797821601610:dataset/7fa11e12-c798-478e-ad69-027060dbc633',
 'DataSetId': '7fa11e12-c798-478e-ad69-027060dbc633',
 'IngestionArn': 'arn:aws:quicksight:ap-northeast-1:797821601610:dataset/7fa11e12-c798-478e-ad69-027060dbc633/ingestion/f4fa27e7-df45-4442-9a95-e4bd2fb64eec',
 'IngestionId': 'f4fa27e7-df45-4442-9a95-e4bd2fb64eec',
 'RequestId': '724dde7a-97a6-4c5a-b8a9-8302759c741f'}

## 5. QuickSight分析の作成
- 別途作成済みの分析定義を元に、分析を作成します。

In [32]:
!wget -q https://raw.githubusercontent.com/aws-samples/aws-ml-jp/main/tasks/nlp/nlp_amazon_review/nlp_voc_dashboard/voc-analysis.json

In [33]:
source_str = ['$IDENTIFIER', '$DATASETARN', '$COL0', '$COL1', '$COL2', '$COL3', '$COL4']
target_str = [project_name] + [data_set_arn] + structured_fields

with open('voc-analysis.json') as f:
    voc_analysis_str = f.read()
    
for i in range(len(source_str)):
    voc_analysis_str = voc_analysis_str.replace(source_str[i],target_str[i])
    
voc_analysis_dict = json.loads(voc_analysis_str)

In [34]:
response = quicksight.create_analysis(
    AwsAccountId=account_id,
    AnalysisId=str(uuid.uuid4()),
    Name=project_name,
    Permissions=[
        {
            'Principal': f'arn:aws:quicksight:{quicksight_region}:{account_id}:user/default/{quicksight_username}',
            'Actions': [
                'quicksight:QueryAnalysis',
                'quicksight:DescribeAnalysis',
                'quicksight:UpdateAnalysis',
                'quicksight:DeleteAnalysis',
                'quicksight:RestoreAnalysis',
                'quicksight:DescribeAnalysisPermissions',
                'quicksight:UpdateAnalysisPermissions'
            ]
        },
    ],
    Definition = voc_analysis_dict["Definition"]
)
response

{'ResponseMetadata': {'RequestId': '421227ae-e882-4a0f-89d8-96192c36b992',
  'HTTPStatusCode': 202,
  'HTTPHeaders': {'date': 'Wed, 30 Nov 2022 12:27:55 GMT',
   'content-type': 'application/json',
   'content-length': '245',
   'connection': 'keep-alive',
   'x-amzn-requestid': '421227ae-e882-4a0f-89d8-96192c36b992'},
  'RetryAttempts': 0},
 'Status': 202,
 'Arn': 'arn:aws:quicksight:ap-northeast-1:797821601610:analysis/947e497c-2733-48b4-83c2-9689ad23210e',
 'AnalysisId': '947e497c-2733-48b4-83c2-9689ad23210e',
 'CreationStatus': 'CREATION_IN_PROGRESS',
 'RequestId': '421227ae-e882-4a0f-89d8-96192c36b992'}