{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retail Demo Store - Personalization Workshop - Lab 2\n",
    "\n",
    "In this lab we are going to build on the [prior lab](./Lab-1-Introduction-and-data-preparation.ipynb) by preparing an Amazon Personalize dataset group and importing our three datasets.\n",
    "\n",
    "## Lab 2 Objectives\n",
    "\n",
    "In this lab we will accomplish the following steps. This lab should take about 25 minutes to complete.\n",
    "\n",
    "- Create schema resources in Amazon Personalize that define the layout of our three dataset files (CSVs) created in the prior lab\n",
    "- Create a dataset group in Amazon Personalize that will be used to receive our datasets\n",
    "- Create a dataset in the Personalize dataset group for the three dataset types and schemas\n",
    "    - Items: information about the products in the Retail Demo Store\n",
    "    - Users: information about the users in the Retail Deme Store\n",
    "    - Interactions: user-item interactions representing typical storefront behavior such as viewing products, adding products to a shopping cart, purchasing products, and so on\n",
    "- Create dataset import jobs to import each of the three datasets into Personalize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Just as in the first lab, we have to prepare our environment by importing dependencies and creating clients.\n",
    "\n",
    "### Import dependencies\n",
    "\n",
    "The following libraries are needed for this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import uuid\n",
    "import time\n",
    "from botocore.exceptions import ClientError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create clients\n",
    "\n",
    "We will need the following AWS service clients in this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "personalize = boto3.client('personalize')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load variables saved in Lab 1\n",
    "\n",
    "At the end of Lab 1 we saved some variables that we'll need in this lab. The following cell will load those variables into this lab environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure Amazon Personalize\n",
    "\n",
    "Now that we've prepared our three datasets and uploaded them to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.\n",
    "\n",
    "Note: if you deployed the Retail Demo Store with the \"auto create Personalize resources\" flag set to \"Yes\", the following steps have already been automatically completed for you."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Schemas for Datasets\n",
    "\n",
    "Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.\n",
    "\n",
    "Let's define and create schemas in Personalize for our datasets.\n",
    "\n",
    "Note that categorical fields include an additional attribute of `\"categorical\": true` and the textual field has an additional attribute of `\"textual\": true`. Categorical fields are those where one or more values can be specified for the field value (i.e. enumerated values). For example, one or more category names/codes for the `CATEGORY_L1` field. A textual field indicates that Personalize should apply a natural language processing (NLP) model to the field's value to extract model features from unstructured text. In this case, we're using the product description as the textual field. You can only have one textual field in the items dataset. Finally, you will notice that the `PROMOTED` field does _not_ have `categorical` or `textual` specified. In this case, the `PROMOTED` column will not be included as a feature in the model but can be used for filtering. We'll see how it is used in the lab on inference.\n",
    "\n",
    "Another detail to note is that when we call the [CreateSchema](https://docs.aws.amazon.com/personalize/latest/dg/API_CreateSchema.html) API, we pass an optional `domain` parameter with a value of `ECOMMERCE`. This tells Personalize that we are creating a schema for Retail/E-commerce domain. We will do this for all three schemas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Items Datsaset Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "items_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Items\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"PRICE\",\n",
    "            \"type\": \"float\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"CATEGORY_L1\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True,\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"CATEGORY_L2\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True,\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"PRODUCT_DESCRIPTION\",\n",
    "            \"type\": \"string\",\n",
    "            \"textual\": True\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"GENDER\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True,\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"PROMOTED\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "try:\n",
    "    create_schema_response = personalize.create_schema(\n",
    "        name = \"retaildemostore-products-items\",\n",
    "        domain = 'ECOMMERCE',\n",
    "        schema = json.dumps(items_schema)\n",
    "    )\n",
    "    items_schema_arn = create_schema_response['schemaArn']\n",
    "    print(json.dumps(create_schema_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this schema, seemingly')\n",
    "    paginator = personalize.get_paginator('list_schemas')\n",
    "    for paginate_result in paginator.paginate():\n",
    "        for schema in paginate_result['schemas']:\n",
    "            if schema['name'] == 'retaildemostore-products-items':\n",
    "                items_schema_arn = schema['schemaArn']\n",
    "                print(f\"Using existing schema: {items_schema_arn}\")\n",
    "                break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Users Dataset Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Users\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"USER_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"AGE\",\n",
    "            \"type\": \"int\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"GENDER\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True,\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "try:\n",
    "    create_schema_response = personalize.create_schema(\n",
    "        name = \"retaildemostore-products-users\",\n",
    "        domain = \"ECOMMERCE\",\n",
    "        schema = json.dumps(users_schema)\n",
    "    )\n",
    "    print(json.dumps(create_schema_response, indent=2))\n",
    "    users_schema_arn = create_schema_response['schemaArn']\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this schema, seemingly')\n",
    "    paginator = personalize.get_paginator('list_schemas')\n",
    "    for paginate_result in paginator.paginate():\n",
    "        for schema in paginate_result['schemas']:\n",
    "            if schema['name'] == 'retaildemostore-products-users':\n",
    "                users_schema_arn = schema['schemaArn']\n",
    "                print(f\"Using existing schema: {users_schema_arn}\")\n",
    "                break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Interactions Dataset Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Interactions\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"USER_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"EVENT_TYPE\",  # \"View\", \"Purchase\", etc.\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"TIMESTAMP\",\n",
    "            \"type\": \"long\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"DISCOUNT\",  # This is the contextual metadata - \"Yes\" or \"No\".\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "try:\n",
    "    create_schema_response = personalize.create_schema(\n",
    "        name = \"retaildemostore-products-interactions\",\n",
    "        domain = \"ECOMMERCE\",\n",
    "        schema = json.dumps(interactions_schema)\n",
    "    )\n",
    "    print(json.dumps(create_schema_response, indent=2))\n",
    "    interactions_schema_arn = create_schema_response['schemaArn']\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this schema, seemingly')\n",
    "    paginator = personalize.get_paginator('list_schemas')\n",
    "    for paginate_result in paginator.paginate():\n",
    "        for schema in paginate_result['schemas']:\n",
    "            if schema['name'] == 'retaildemostore-products-interactions':\n",
    "                interactions_schema_arn = schema['schemaArn']\n",
    "                print(f\"Using existing schema: {interactions_schema_arn}\")\n",
    "                break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and Wait for Dataset Group\n",
    "\n",
    "Next we need to create the dataset group that will contain our three datasets. This is one of many Personalize operations that are asynchronous. That is, we call an API to create a resource and have to wait for it to become active."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Dataset Group\n",
    "\n",
    "Note that we are also passing `ECOMMERCE` for the `domain` parameter here too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    create_dataset_group_response = personalize.create_dataset_group(\n",
    "        name = 'retaildemostore-products',\n",
    "        domain = 'ECOMMERCE'\n",
    "    )\n",
    "    dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n",
    "    print(json.dumps(create_dataset_group_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this dataset group, seemingly')\n",
    "    paginator = personalize.get_paginator('list_dataset_groups')\n",
    "    for paginate_result in paginator.paginate():\n",
    "        for dataset_group in paginate_result['datasetGroups']:\n",
    "            if dataset_group['name'] == 'retaildemostore-products':\n",
    "                dataset_group_arn = dataset_group['datasetGroupArn']\n",
    "                break\n",
    "                \n",
    "print(f'DatasetGroupArn = {dataset_group_arn}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for Dataset Group to Have ACTIVE Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "status = None\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_group_response = personalize.describe_dataset_group(\n",
    "        datasetGroupArn = dataset_group_arn\n",
    "    )\n",
    "    status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n",
    "    print(\"DatasetGroup: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Items Dataset\n",
    "\n",
    "Next we will create the datasets in Personalize for our three dataset types. Let's start with the items dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    dataset_type = \"ITEMS\"\n",
    "    create_dataset_response = personalize.create_dataset(\n",
    "        name = \"retaildemostore-products-items\",\n",
    "        datasetType = dataset_type,\n",
    "        datasetGroupArn = dataset_group_arn,\n",
    "        schemaArn = items_schema_arn\n",
    "    )\n",
    "\n",
    "    items_dataset_arn = create_dataset_response['datasetArn']\n",
    "    print(json.dumps(create_dataset_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this dataset, seemingly')\n",
    "    paginator = personalize.get_paginator('list_datasets')\n",
    "    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):\n",
    "        for dataset in paginate_result['datasets']:\n",
    "            if dataset['name'] == 'retaildemostore-products-items':\n",
    "                items_dataset_arn = dataset['datasetArn']\n",
    "                break\n",
    "                \n",
    "print(f'Items dataset ARN = {items_dataset_arn}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Users Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    dataset_type = \"USERS\"\n",
    "    create_dataset_response = personalize.create_dataset(\n",
    "        name = \"retaildemostore-products-users\",\n",
    "        datasetType = dataset_type,\n",
    "        datasetGroupArn = dataset_group_arn,\n",
    "        schemaArn = users_schema_arn\n",
    "    )\n",
    "\n",
    "    users_dataset_arn = create_dataset_response['datasetArn']\n",
    "    print(json.dumps(create_dataset_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this dataset, seemingly')\n",
    "    paginator = personalize.get_paginator('list_datasets')\n",
    "    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):\n",
    "        for dataset in paginate_result['datasets']:\n",
    "            if dataset['name'] == 'retaildemostore-products-users':\n",
    "                users_dataset_arn = dataset['datasetArn']\n",
    "                break\n",
    "                \n",
    "print(f'Users dataset ARN = {users_dataset_arn}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Interactions Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    dataset_type = \"INTERACTIONS\"\n",
    "    create_dataset_response = personalize.create_dataset(\n",
    "        name = \"retaildemostore-products-interactions\",\n",
    "        datasetType = dataset_type,\n",
    "        datasetGroupArn = dataset_group_arn,\n",
    "        schemaArn = interactions_schema_arn\n",
    "    )\n",
    "\n",
    "    interactions_dataset_arn = create_dataset_response['datasetArn']\n",
    "    print(json.dumps(create_dataset_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this dataset, seemingly')\n",
    "    paginator = personalize.get_paginator('list_datasets')\n",
    "    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):\n",
    "        for dataset in paginate_result['datasets']:\n",
    "            if dataset['name'] == 'retaildemostore-products-interactions':\n",
    "                interactions_dataset_arn = dataset['datasetArn']\n",
    "                break\n",
    "                \n",
    "print(f'Interactions dataset ARN = {interactions_dataset_arn}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wait for datasets to become active\n",
    "\n",
    "It can take a minute or two for the datasets to be created. Let's wait for all three to become active."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "dataset_arns = [ items_dataset_arn, users_dataset_arn, interactions_dataset_arn ]\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    for dataset_arn in reversed(dataset_arns):\n",
    "        response = personalize.describe_dataset(\n",
    "            datasetArn = dataset_arn\n",
    "        )\n",
    "        status = response[\"dataset\"][\"status\"]\n",
    "\n",
    "        if status == \"ACTIVE\":\n",
    "            print(f'Dataset {dataset_arn} successfully completed')\n",
    "            dataset_arns.remove(dataset_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(f'Dataset {dataset_arn} failed')\n",
    "            if response['dataset'].get('failureReason'):\n",
    "                print('   Reason: ' + response['dataset']['failureReason'])\n",
    "            dataset_arns.remove(dataset_arn)\n",
    "\n",
    "    if len(dataset_arns) > 0:\n",
    "        print('At least one dataset is still in progress')\n",
    "        time.sleep(15)\n",
    "    else:\n",
    "        print(\"All datasets have completed\")\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Datasets to Personalize\n",
    "\n",
    "In [Lab 1](./Lab-1-Introduction-and-data-preparation.ipynb) we generated CSVs containing bulk data for our users, items, and interactions and staged them in an S3 bucket. So far in this Lab we have created schemas in Personalize that define the columns in our CSVs. Then we created a datset group and three datasets in Personalize that will receive our data. In the following steps we will create import jobs with Personalize that will import the datasets from our S3 bucket into the service."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect permissions\n",
    "\n",
    "By default, the Personalize service does not have permission to acccess the data we uploaded into the S3 bucket in our account. In order to grant access to the Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume.\n",
    "\n",
    "The deployment process for the Retail Demo Store has already setup these resources for you. However, let's take a look at the bucket policy and IAM role to see the required permissions.\n",
    "\n",
    "We'll start by displaying the bucket policy in the S3 staging bucket where we uploaded the CSVs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client(\"s3\")\n",
    "\n",
    "response = s3.get_bucket_policy(Bucket = bucket)\n",
    "print(json.dumps(json.loads(response['Policy']), indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the bucket policy above you will see that there are two policy statements. The first statement denies all S3 actions that are not made using secure transport. This statement is not required for Personalize access since it already uses secure transport but it is included as a security best practice. The second statement is where access is granted to Personalize. Note the service principal of `personalize.amazonaws.com` and the actions allowed on the staging bucket. The `s3:GetObject` is needed for import jobs to allow Personalize to read objects from the bucket and the `s3:PutObject` is used for export jobs, batch inference jobs, and batch segment jobs to allow Personalize to write output files to the bucket. The `s3:ListBucket` action allows Personalize to list the contents of a folder.\n",
    "\n",
    "Next, let's look at the IAM role that Personalize will need to assume to access the S3 bucket. Again, this role was created for you during the Retail Demo Store deployment. We'll start by inspecting the role itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iam = boto3.client(\"iam\")\n",
    "\n",
    "role_name = Uid+\"-PersonalizeS3\"\n",
    "\n",
    "response = iam.get_role(RoleName = role_name)\n",
    "role_arn = response['Role']['Arn']\n",
    "print(json.dumps(response['Role'], indent=2, default = str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the role has the same service principal as the bucket policy but this time with the `sts:AssumeRole` action. This is required so that Personalize can assume this role.\n",
    "\n",
    "Finally, we'll get the inline policy named `BucketAccess` that has the same S3 permissions as the bucket policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = iam.get_role_policy(RoleName = role_name, PolicyName = 'BucketAccess')\n",
    "print(json.dumps(response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Import Jobs\n",
    "\n",
    "With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take several minutes to complete so we'll create all three import jobs and then wait for them all to complete. This allows them to import in parallel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Items Dataset Import Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import_job_suffix = str(uuid.uuid4())[:8]\n",
    "\n",
    "items_create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"retaildemostore-products-items-\" + import_job_suffix,\n",
    "    datasetArn = items_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket, items_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "items_dataset_import_job_arn = items_create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(items_create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Users Dataset Import Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users_create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"retaildemostore-products-users-\" + import_job_suffix,\n",
    "    datasetArn = users_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket, users_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "users_dataset_import_job_arn = users_create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(users_create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Interactions Dataset Import Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"retaildemostore-products-interactions-\" + import_job_suffix,\n",
    "    datasetArn = interactions_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket, interactions_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(interactions_create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wait for Import Jobs to Complete\n",
    "\n",
    "It will take 10-15 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html\n",
    "\n",
    "We will wait for all three jobs to finish."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for Items Import Job to Complete"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import_job_arns = [ items_dataset_import_job_arn, users_dataset_import_job_arn, interactions_dataset_import_job_arn ]\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    for job_arn in reversed(import_job_arns):\n",
    "        import_job_response = personalize.describe_dataset_import_job(\n",
    "            datasetImportJobArn = job_arn\n",
    "        )\n",
    "        status = import_job_response[\"datasetImportJob\"]['status']\n",
    "\n",
    "        if status == \"ACTIVE\":\n",
    "            print(f'Import job {job_arn} successfully completed')\n",
    "            import_job_arns.remove(job_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(f'Import job {job_arn} failed')\n",
    "            if import_job_response[\"datasetImportJob\"].get('failureReason'):\n",
    "                print('   Reason: ' + import_job_response[\"datasetImportJob\"]['failureReason'])\n",
    "            import_job_arns.remove(job_arn)\n",
    "\n",
    "    if len(import_job_arns) > 0:\n",
    "        print('At least one dataset import job still in progress')\n",
    "        time.sleep(60)\n",
    "    else:\n",
    "        print(\"All import jobs have ended\")\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab 2 Summary - What have we accomplished?\n",
    "\n",
    "In this lab we created schemas in Amazon Personalize that mapped to the dataset CSVs we created in Lab 1. We also created a dataset group in Personalize as well as datasets to receive our CSVs. Since Personalize needs access to the staging S3 bucket where the CSVs were uploaded, we inspected the S3 bucket policy and IAM role that needs to be passed to Personalize. Finally, we create dataset import jobs in Personalize to upload the three datasets into Personalize.\n",
    "\n",
    "In the next lab we will create the recommenders and custom solutions and solution versions for our personalization use cases. This is where the machine learning models are trained and deployed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store variables needed in the next lab\n",
    "\n",
    "We will pass some variables initialized in this lab by storing them in the notebook environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store dataset_group_arn\n",
    "%store role_arn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Continue to [Lab 3](./Lab-3-Create-recommenders-and-custom-solutions.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}