{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[](https://github.com/aws/aws-sdk-pandas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 31 - OpenSearch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"* [1. Initialize](#initialize)\n",
" * [Connect to your Amazon OpenSearch domain](#connect)\n",
" * [Enter your bucket name](#bucket)\n",
" * [Initialize sample data](#sample-data)\n",
"* [2. Indexing (load)](#indexing)\n",
"\t* [Index documents (no Pandas)](#index-documents)\n",
"\t* [Index json file](#index-json)\n",
" * [Index CSV](#index-csv)\n",
"* [3. Search](#search)\n",
"\t* [3.1 Search by DSL](#search-dsl)\n",
"\t* [3.2 Search by SQL](#search-sql)\n",
"* [4. Delete Indices](#delete-index)\n",
"* [5. Bonus - Prepare data and index from DataFrame](#bonus)\n",
"\t* [Prepare the data for indexing](#prepare-data)\n",
" * [Create index with mapping](#create-index-w-mapping)\n",
" * [Index dataframe](#index-df)\n",
" * [Execute geo query](#search-geo)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Initialize"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install the optional modules first\n",
"!pip install 'awswrangler[opensearch]'"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import awswrangler as wr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to your Amazon OpenSearch domain"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"client = wr.opensearch.connect(\n",
" host='OPENSEARCH-ENDPOINT',\n",
"# username='FGAC-USERNAME(OPTIONAL)',\n",
"# password='FGAC-PASSWORD(OPTIONAL)'\n",
")\n",
"client.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enter your bucket name"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"bucket = 'BUCKET'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialize sample data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"sf_restaurants_inspections = [\n",
" {\n",
" \"inspection_id\": \"24936_20160609\",\n",
" \"business_address\": \"315 California St\",\n",
" \"business_city\": \"San Francisco\",\n",
" \"business_id\": \"24936\",\n",
" \"business_location\": {\"lon\": -122.400152, \"lat\": 37.793199},\n",
" \"business_name\": \"San Francisco Soup Company\",\n",
" \"business_postal_code\": \"94104\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-06-09T00:00:00.000\",\n",
" \"inspection_score\": 77,\n",
" \"inspection_type\": \"Routine - Unscheduled\",\n",
" \"risk_category\": \"Low Risk\",\n",
" \"violation_description\": \"Improper food labeling or menu misrepresentation\",\n",
" \"violation_id\": \"24936_20160609_103141\",\n",
" },\n",
" {\n",
" \"inspection_id\": \"60354_20161123\",\n",
" \"business_address\": \"10 Mason St\",\n",
" \"business_city\": \"San Francisco\",\n",
" \"business_id\": \"60354\",\n",
" \"business_location\": {\"lon\": -122.409061, \"lat\": 37.783527},\n",
" \"business_name\": \"Soup Unlimited\",\n",
" \"business_postal_code\": \"94102\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-11-23T00:00:00.000\",\n",
" \"inspection_type\": \"Routine\",\n",
" \"inspection_score\": 95,\n",
" },\n",
" {\n",
" \"inspection_id\": \"1797_20160705\",\n",
" \"business_address\": \"2872 24th St\",\n",
" \"business_city\": \"San Francisco\",\n",
" \"business_id\": \"1797\",\n",
" \"business_location\": {\"lon\": -122.409752, \"lat\": 37.752807},\n",
" \"business_name\": \"TIO CHILOS GRILL\",\n",
" \"business_postal_code\": \"94110\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-07-05T00:00:00.000\",\n",
" \"inspection_score\": 90,\n",
" \"inspection_type\": \"Routine - Unscheduled\",\n",
" \"risk_category\": \"Low Risk\",\n",
" \"violation_description\": \"Unclean nonfood contact surfaces\",\n",
" \"violation_id\": \"1797_20160705_103142\",\n",
" },\n",
" {\n",
" \"inspection_id\": \"66198_20160527\",\n",
" \"business_address\": \"1661 Tennessee St Suite 3B\",\n",
" \"business_city\": \"San Francisco Whard Restaurant\",\n",
" \"business_id\": \"66198\",\n",
" \"business_location\": {\"lon\": -122.388478, \"lat\": 37.75072},\n",
" \"business_name\": \"San Francisco Restaurant\",\n",
" \"business_postal_code\": \"94107\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-05-27T00:00:00.000\",\n",
" \"inspection_type\": \"Routine\",\n",
" \"inspection_score\": 56,\n",
" },\n",
" {\n",
" \"inspection_id\": \"5794_20160907\",\n",
" \"business_address\": \"2162 24th Ave\",\n",
" \"business_city\": \"San Francisco\",\n",
" \"business_id\": \"5794\",\n",
" \"business_location\": {\"lon\": -122.481299, \"lat\": 37.747228},\n",
" \"business_name\": \"Soup House\",\n",
" \"business_phone_number\": \"+14155752700\",\n",
" \"business_postal_code\": \"94116\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-09-07T00:00:00.000\",\n",
" \"inspection_score\": 96,\n",
" \"inspection_type\": \"Routine - Unscheduled\",\n",
" \"risk_category\": \"Low Risk\",\n",
" \"violation_description\": \"Unapproved or unmaintained equipment or utensils\",\n",
" \"violation_id\": \"5794_20160907_103144\",\n",
" },\n",
" \n",
" # duplicate record\n",
" {\n",
" \"inspection_id\": \"5794_20160907\",\n",
" \"business_address\": \"2162 24th Ave\",\n",
" \"business_city\": \"San Francisco\",\n",
" \"business_id\": \"5794\",\n",
" \"business_location\": {\"lon\": -122.481299, \"lat\": 37.747228},\n",
" \"business_name\": \"Soup-or-Salad\",\n",
" \"business_phone_number\": \"+14155752700\",\n",
" \"business_postal_code\": \"94116\",\n",
" \"business_state\": \"CA\",\n",
" \"inspection_date\": \"2016-09-07T00:00:00.000\",\n",
" \"inspection_score\": 96,\n",
" \"inspection_type\": \"Routine - Unscheduled\",\n",
" \"risk_category\": \"Low Risk\",\n",
" \"violation_description\": \"Unapproved or unmaintained equipment or utensils\",\n",
" \"violation_id\": \"5794_20160907_103144\",\n",
" },\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Indexing (load)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index documents (no Pandas)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:01"
]
},
{
"data": {
"text/plain": [
"{'success': 6, 'errors': []}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index documents w/o providing keys (_id is auto-generated)\n",
"wr.opensearch.index_documents(\n",
" client,\n",
" documents=sf_restaurants_inspections,\n",
" index=\"sf_restaurants_inspections\" \n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" business_name | \n",
" inspection_id | \n",
" business_location.lon | \n",
" business_location.lat | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 663dd72d-0da4-495b-b0ae-ed000105ae73 | \n",
" TIO CHILOS GRILL | \n",
" 1797_20160705 | \n",
" -122.409752 | \n",
" 37.752807 | \n",
"
\n",
" \n",
" 1 | \n",
" ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 | \n",
" Soup House | \n",
" 5794_20160907 | \n",
" -122.481299 | \n",
" 37.747228 | \n",
"
\n",
" \n",
" 2 | \n",
" b9e8f6a2-8fd1-4660-b041-2997a1a80984 | \n",
" San Francisco Soup Company | \n",
" 24936_20160609 | \n",
" -122.400152 | \n",
" 37.793199 | \n",
"
\n",
" \n",
" 3 | \n",
" 56b352e6-102b-4eff-8296-7e1fb2459bab | \n",
" Soup Unlimited | \n",
" 60354_20161123 | \n",
" -122.409061 | \n",
" 37.783527 | \n",
"
\n",
" \n",
" 4 | \n",
" 6fec5411-f79a-48e4-be7b-e0e44d5ebbab | \n",
" San Francisco Restaurant | \n",
" 66198_20160527 | \n",
" -122.388478 | \n",
" 37.750720 | \n",
"
\n",
" \n",
" 5 | \n",
" 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c | \n",
" Soup-or-Salad | \n",
" 5794_20160907 | \n",
" -122.481299 | \n",
" 37.747228 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id business_name \\\n",
"0 663dd72d-0da4-495b-b0ae-ed000105ae73 TIO CHILOS GRILL \n",
"1 ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 Soup House \n",
"2 b9e8f6a2-8fd1-4660-b041-2997a1a80984 San Francisco Soup Company \n",
"3 56b352e6-102b-4eff-8296-7e1fb2459bab Soup Unlimited \n",
"4 6fec5411-f79a-48e4-be7b-e0e44d5ebbab San Francisco Restaurant \n",
"5 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c Soup-or-Salad \n",
"\n",
" inspection_id business_location.lon business_location.lat \n",
"0 1797_20160705 -122.409752 37.752807 \n",
"1 5794_20160907 -122.481299 37.747228 \n",
"2 24936_20160609 -122.400152 37.793199 \n",
"3 60354_20161123 -122.409061 37.783527 \n",
"4 66198_20160527 -122.388478 37.750720 \n",
"5 5794_20160907 -122.481299 37.747228 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read all documents. There are total 6 documents\n",
"wr.opensearch.search(\n",
" client,\n",
" index=\"sf_restaurants_inspections\",\n",
" _source=[\"inspection_id\", \"business_name\", \"business_location\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index json file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.DataFrame(sf_restaurants_inspections)\n",
"path = f\"s3://{bucket}/json/sf_restaurants_inspections.json\"\n",
"wr.s3.to_json(df, path,orient='records',lines=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:00"
]
},
{
"data": {
"text/plain": [
"{'success': 6, 'errors': []}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index json w/ providing keys\n",
"wr.opensearch.index_json(\n",
" client,\n",
" path=path, # path can be s3 or local\n",
" index=\"sf_restaurants_inspections_dedup\",\n",
" id_keys=[\"inspection_id\"] # can be multiple fields. arg applicable to all index_* functions\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" business_name | \n",
" inspection_id | \n",
" business_location.lon | \n",
" business_location.lat | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 24936_20160609 | \n",
" San Francisco Soup Company | \n",
" 24936_20160609 | \n",
" -122.400152 | \n",
" 37.793199 | \n",
"
\n",
" \n",
" 1 | \n",
" 66198_20160527 | \n",
" San Francisco Restaurant | \n",
" 66198_20160527 | \n",
" -122.388478 | \n",
" 37.750720 | \n",
"
\n",
" \n",
" 2 | \n",
" 5794_20160907 | \n",
" Soup-or-Salad | \n",
" 5794_20160907 | \n",
" -122.481299 | \n",
" 37.747228 | \n",
"
\n",
" \n",
" 3 | \n",
" 60354_20161123 | \n",
" Soup Unlimited | \n",
" 60354_20161123 | \n",
" -122.409061 | \n",
" 37.783527 | \n",
"
\n",
" \n",
" 4 | \n",
" 1797_20160705 | \n",
" TIO CHILOS GRILL | \n",
" 1797_20160705 | \n",
" -122.409752 | \n",
" 37.752807 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id business_name inspection_id \\\n",
"0 24936_20160609 San Francisco Soup Company 24936_20160609 \n",
"1 66198_20160527 San Francisco Restaurant 66198_20160527 \n",
"2 5794_20160907 Soup-or-Salad 5794_20160907 \n",
"3 60354_20161123 Soup Unlimited 60354_20161123 \n",
"4 1797_20160705 TIO CHILOS GRILL 1797_20160705 \n",
"\n",
" business_location.lon business_location.lat \n",
"0 -122.400152 37.793199 \n",
"1 -122.388478 37.750720 \n",
"2 -122.481299 37.747228 \n",
"3 -122.409061 37.783527 \n",
"4 -122.409752 37.752807 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# now there are no duplicates. There are total 5 documents\n",
"wr.opensearch.search(\n",
" client,\n",
" index=\"sf_restaurants_inspections_dedup\",\n",
" _source=[\"inspection_id\", \"business_name\", \"business_location\"]\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index CSV"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Indexing: 100% (1000/1000)|##############################|Elapsed Time: 0:00:00"
]
},
{
"data": {
"text/plain": [
"{'success': 1000, 'errors': []}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.index_csv(\n",
" client, \n",
" index=\"nyc_restaurants_inspections_sample\", \n",
" path='https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD', # index_csv supports local, s3 and url path\n",
" id_keys=[\"CAMIS\"],\n",
" pandas_kwargs={'na_filter': True, 'nrows': 1000}, # pandas.read_csv() args - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\n",
" bulk_size=500 # modify based on your cluster size\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" CAMIS | \n",
" DBA | \n",
" BORO | \n",
" BUILDING | \n",
" STREET | \n",
" ZIPCODE | \n",
" PHONE | \n",
" CUISINE DESCRIPTION | \n",
" INSPECTION DATE | \n",
" ... | \n",
" RECORD DATE | \n",
" INSPECTION TYPE | \n",
" Latitude | \n",
" Longitude | \n",
" Community Board | \n",
" Council District | \n",
" Census Tract | \n",
" BIN | \n",
" BBL | \n",
" NTA | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 41610426 | \n",
" 41610426 | \n",
" GLOW THAI RESTAURANT | \n",
" Brooklyn | \n",
" 7107 | \n",
" 3 AVENUE | \n",
" 11209.0 | \n",
" 7187481920 | \n",
" Thai | \n",
" 02/26/2020 | \n",
" ... | \n",
" 10/04/2021 | \n",
" Cycle Inspection / Re-inspection | \n",
" 40.633865 | \n",
" -74.026798 | \n",
" 310.0 | \n",
" 43.0 | \n",
" 6800.0 | \n",
" 3146519.0 | \n",
" 3.058910e+09 | \n",
" BK31 | \n",
"
\n",
" \n",
" 1 | \n",
" 40811162 | \n",
" 40811162 | \n",
" CARMINE'S | \n",
" Manhattan | \n",
" 2450 | \n",
" BROADWAY | \n",
" 10024.0 | \n",
" 2123622200 | \n",
" Italian | \n",
" 05/28/2019 | \n",
" ... | \n",
" 10/04/2021 | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.791168 | \n",
" -73.974308 | \n",
" 107.0 | \n",
" 6.0 | \n",
" 17900.0 | \n",
" 1033560.0 | \n",
" 1.012380e+09 | \n",
" MN12 | \n",
"
\n",
" \n",
" 2 | \n",
" 50012113 | \n",
" 50012113 | \n",
" TANG | \n",
" Queens | \n",
" 196-50 | \n",
" NORTHERN BOULEVARD | \n",
" 11358.0 | \n",
" 7182797080 | \n",
" Korean | \n",
" 08/16/2018 | \n",
" ... | \n",
" 10/04/2021 | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.757850 | \n",
" -73.784593 | \n",
" 411.0 | \n",
" 19.0 | \n",
" 145101.0 | \n",
" 4124565.0 | \n",
" 4.055200e+09 | \n",
" QN48 | \n",
"
\n",
" \n",
" 3 | \n",
" 50014618 | \n",
" 50014618 | \n",
" TOTTO RAMEN | \n",
" Manhattan | \n",
" 248 | \n",
" EAST 52 STREET | \n",
" 10022.0 | \n",
" 2124210052 | \n",
" Japanese | \n",
" 08/20/2018 | \n",
" ... | \n",
" 10/04/2021 | \n",
" Cycle Inspection / Re-inspection | \n",
" 40.756596 | \n",
" -73.968749 | \n",
" 106.0 | \n",
" 4.0 | \n",
" 9800.0 | \n",
" 1038490.0 | \n",
" 1.013250e+09 | \n",
" MN19 | \n",
"
\n",
" \n",
" 4 | \n",
" 50045782 | \n",
" 50045782 | \n",
" OLLIE'S CHINESE RESTAURANT | \n",
" Manhattan | \n",
" 2705 | \n",
" BROADWAY | \n",
" 10025.0 | \n",
" 2129323300 | \n",
" Chinese | \n",
" 10/21/2019 | \n",
" ... | \n",
" 10/04/2021 | \n",
" Cycle Inspection / Re-inspection | \n",
" 40.799318 | \n",
" -73.968440 | \n",
" 107.0 | \n",
" 6.0 | \n",
" 19100.0 | \n",
" 1056562.0 | \n",
" 1.018750e+09 | \n",
" MN12 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 27 columns
\n",
"
"
],
"text/plain": [
" _id CAMIS DBA BORO BUILDING \\\n",
"0 41610426 41610426 GLOW THAI RESTAURANT Brooklyn 7107 \n",
"1 40811162 40811162 CARMINE'S Manhattan 2450 \n",
"2 50012113 50012113 TANG Queens 196-50 \n",
"3 50014618 50014618 TOTTO RAMEN Manhattan 248 \n",
"4 50045782 50045782 OLLIE'S CHINESE RESTAURANT Manhattan 2705 \n",
"\n",
" STREET ZIPCODE PHONE CUISINE DESCRIPTION \\\n",
"0 3 AVENUE 11209.0 7187481920 Thai \n",
"1 BROADWAY 10024.0 2123622200 Italian \n",
"2 NORTHERN BOULEVARD 11358.0 7182797080 Korean \n",
"3 EAST 52 STREET 10022.0 2124210052 Japanese \n",
"4 BROADWAY 10025.0 2129323300 Chinese \n",
"\n",
" INSPECTION DATE ... RECORD DATE INSPECTION TYPE \\\n",
"0 02/26/2020 ... 10/04/2021 Cycle Inspection / Re-inspection \n",
"1 05/28/2019 ... 10/04/2021 Cycle Inspection / Initial Inspection \n",
"2 08/16/2018 ... 10/04/2021 Cycle Inspection / Initial Inspection \n",
"3 08/20/2018 ... 10/04/2021 Cycle Inspection / Re-inspection \n",
"4 10/21/2019 ... 10/04/2021 Cycle Inspection / Re-inspection \n",
"\n",
" Latitude Longitude Community Board Council District Census Tract \\\n",
"0 40.633865 -74.026798 310.0 43.0 6800.0 \n",
"1 40.791168 -73.974308 107.0 6.0 17900.0 \n",
"2 40.757850 -73.784593 411.0 19.0 145101.0 \n",
"3 40.756596 -73.968749 106.0 4.0 9800.0 \n",
"4 40.799318 -73.968440 107.0 6.0 19100.0 \n",
"\n",
" BIN BBL NTA \n",
"0 3146519.0 3.058910e+09 BK31 \n",
"1 1033560.0 1.012380e+09 MN12 \n",
"2 4124565.0 4.055200e+09 QN48 \n",
"3 1038490.0 1.013250e+09 MN19 \n",
"4 1056562.0 1.018750e+09 MN12 \n",
"\n",
"[5 rows x 27 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.search(\n",
" client,\n",
" index=\"nyc_restaurants_inspections_sample\",\n",
" size=5\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Search\n",
"Search results are returned as Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Search by DSL"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _id | \n",
" business_name | \n",
" inspection_id | \n",
" business_location.lon | \n",
" business_location.lat | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 | \n",
" Soup House | \n",
" 5794_20160907 | \n",
" -122.481299 | \n",
" 37.747228 | \n",
"
\n",
" \n",
" 1 | \n",
" 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c | \n",
" Soup-or-Salad | \n",
" 5794_20160907 | \n",
" -122.481299 | \n",
" 37.747228 | \n",
"
\n",
" \n",
" 2 | \n",
" b9e8f6a2-8fd1-4660-b041-2997a1a80984 | \n",
" San Francisco Soup Company | \n",
" 24936_20160609 | \n",
" -122.400152 | \n",
" 37.793199 | \n",
"
\n",
" \n",
" 3 | \n",
" 56b352e6-102b-4eff-8296-7e1fb2459bab | \n",
" Soup Unlimited | \n",
" 60354_20161123 | \n",
" -122.409061 | \n",
" 37.783527 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id business_name \\\n",
"0 ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 Soup House \n",
"1 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c Soup-or-Salad \n",
"2 b9e8f6a2-8fd1-4660-b041-2997a1a80984 San Francisco Soup Company \n",
"3 56b352e6-102b-4eff-8296-7e1fb2459bab Soup Unlimited \n",
"\n",
" inspection_id business_location.lon business_location.lat \n",
"0 5794_20160907 -122.481299 37.747228 \n",
"1 5794_20160907 -122.481299 37.747228 \n",
"2 24936_20160609 -122.400152 37.793199 \n",
"3 60354_20161123 -122.409061 37.783527 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# add a search query. search all soup businesses \n",
"wr.opensearch.search(\n",
" client,\n",
" index=\"sf_restaurants_inspections\",\n",
" _source=[\"inspection_id\", \"business_name\", \"business_location\"],\n",
" filter_path=[\"hits.hits._id\",\"hits.hits._source\"],\n",
" search_body={\n",
" \"query\": {\n",
" \"match\": {\n",
" \"business_name\": \"soup\"\n",
" }\n",
" }\n",
" }\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Search by SQL"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" _index | \n",
" _type | \n",
" _id | \n",
" _score | \n",
" business_name | \n",
" inspection_score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" sf_restaurants_inspections_dedup | \n",
" _doc | \n",
" 5794_20160907 | \n",
" None | \n",
" Soup-or-Salad | \n",
" 96 | \n",
"
\n",
" \n",
" 1 | \n",
" sf_restaurants_inspections_dedup | \n",
" _doc | \n",
" 60354_20161123 | \n",
" None | \n",
" Soup Unlimited | \n",
" 95 | \n",
"
\n",
" \n",
" 2 | \n",
" sf_restaurants_inspections_dedup | \n",
" _doc | \n",
" 24936_20160609 | \n",
" None | \n",
" San Francisco Soup Company | \n",
" 77 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _index _type _id _score \\\n",
"0 sf_restaurants_inspections_dedup _doc 5794_20160907 None \n",
"1 sf_restaurants_inspections_dedup _doc 60354_20161123 None \n",
"2 sf_restaurants_inspections_dedup _doc 24936_20160609 None \n",
"\n",
" business_name inspection_score \n",
"0 Soup-or-Salad 96 \n",
"1 Soup Unlimited 95 \n",
"2 San Francisco Soup Company 77 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.search_by_sql(\n",
" client,\n",
" sql_query=\"\"\"SELECT business_name, inspection_score \n",
" FROM sf_restaurants_inspections_dedup\n",
" WHERE business_name LIKE '%soup%'\n",
" ORDER BY inspection_score DESC LIMIT 5\"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Delete Indices"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'acknowledged': True}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.delete_index(\n",
" client=client,\n",
" index=\"sf_restaurants_inspections\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Bonus - Prepare data and index from DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this exercise we'll use [DOHMH New York City Restaurant Inspection Results dataset](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare the data for indexing"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# fields names underscore casing \n",
"df.columns = [col.lower().replace(' ', '_') for col in df.columns]\n",
"\n",
"# convert lon/lat to OpenSearch geo_point\n",
"df['business_location'] = \"POINT (\" + df.longitude.fillna('0').astype(str) + \" \" + df.latitude.fillna('0').astype(str) + \")\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create index with mapping"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'acknowledged': True,\n",
" 'shards_acknowledged': True,\n",
" 'index': 'nyc_restaurants_inspections'}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# delete index if exists\n",
"wr.opensearch.delete_index(\n",
" client=client,\n",
" index=\"nyc_restaurants\"\n",
" \n",
")\n",
"\n",
"# use dynamic_template to map date fields\n",
"# define business_location as geo_point\n",
"wr.opensearch.create_index(\n",
" client=client,\n",
" index=\"nyc_restaurants_inspections\",\n",
" mappings={\n",
" \"dynamic_templates\" : [\n",
" {\n",
" \"dates\" : {\n",
" \"match\" : \"*date\",\n",
" \"mapping\" : {\n",
" \"type\" : \"date\",\n",
" \"format\" : 'MM/dd/yyyy'\n",
" }\n",
" }\n",
" }\n",
" ],\n",
" \"properties\": {\n",
" \"business_location\": {\n",
" \"type\": \"geo_point\"\n",
" }\n",
" }\n",
" } \n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Index dataframe"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Indexing: 100% (382655/382655)|##########################|Elapsed Time: 0:04:15"
]
},
{
"data": {
"text/plain": [
"{'success': 382655, 'errors': []}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.index_df(\n",
" client,\n",
" df=df,\n",
" index=\"nyc_restaurants_inspections\",\n",
" id_keys=[\"camis\"],\n",
" bulk_size=1000\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Execute geo query\n",
"#### Sort restaurants by distance from Times-Square"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" camis | \n",
" dba | \n",
" boro | \n",
" building | \n",
" street | \n",
" zipcode | \n",
" phone | \n",
" cuisine_description | \n",
" inspection_date | \n",
" action | \n",
" ... | \n",
" inspection_type | \n",
" latitude | \n",
" longitude | \n",
" community_board | \n",
" council_district | \n",
" census_tract | \n",
" bin | \n",
" bbl | \n",
" nta | \n",
" business_location | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 41551304 | \n",
" THE COUNTER | \n",
" Manhattan | \n",
" 7 | \n",
" TIMES SQUARE | \n",
" 10036.0 | \n",
" 2129976801 | \n",
" American | \n",
" 12/22/2016 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.755908 | \n",
" -73.986681 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1086069.0 | \n",
" 1.009940e+09 | \n",
" MN17 | \n",
" POINT (-73.986680953809 40.755907817312) | \n",
"
\n",
" \n",
" 1 | \n",
" 50055665 | \n",
" ANN INC CAFE | \n",
" Manhattan | \n",
" 7 | \n",
" TIMES SQUARE | \n",
" 10036.0 | \n",
" 2125413287 | \n",
" American | \n",
" 12/11/2019 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.755908 | \n",
" -73.986681 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1086069.0 | \n",
" 1.009940e+09 | \n",
" MN17 | \n",
" POINT (-73.986680953809 40.755907817312) | \n",
"
\n",
" \n",
" 2 | \n",
" 50049552 | \n",
" ERNST AND YOUNG | \n",
" Manhattan | \n",
" 5 | \n",
" TIMES SQ | \n",
" 10036.0 | \n",
" 2127739994 | \n",
" Coffee/Tea | \n",
" 11/30/2018 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.755702 | \n",
" -73.987208 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1024656.0 | \n",
" 1.010130e+09 | \n",
" MN17 | \n",
" POINT (-73.987207980138 40.755702020307) | \n",
"
\n",
" \n",
" 3 | \n",
" 50014078 | \n",
" RED LOBSTER | \n",
" Manhattan | \n",
" 5 | \n",
" TIMES SQ | \n",
" 10036.0 | \n",
" 2127306706 | \n",
" Seafood | \n",
" 10/03/2017 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Initial Inspection | \n",
" 40.755702 | \n",
" -73.987208 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1024656.0 | \n",
" 1.010130e+09 | \n",
" MN17 | \n",
" POINT (-73.987207980138 40.755702020307) | \n",
"
\n",
" \n",
" 4 | \n",
" 50015171 | \n",
" NEW AMSTERDAM THEATER | \n",
" Manhattan | \n",
" 214 | \n",
" WEST 42 STREET | \n",
" 10036.0 | \n",
" 2125825472 | \n",
" American | \n",
" 06/26/2018 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Re-inspection | \n",
" 40.756317 | \n",
" -73.987652 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1024660.0 | \n",
" 1.010130e+09 | \n",
" MN17 | \n",
" POINT (-73.987651832547 40.756316895053) | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 95 | \n",
" 41552060 | \n",
" PROSKAUER ROSE | \n",
" Manhattan | \n",
" 11 | \n",
" TIMES SQUARE | \n",
" 10036.0 | \n",
" 2129695493 | \n",
" American | \n",
" 08/11/2017 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Administrative Miscellaneous / Initial Inspection | \n",
" 40.756891 | \n",
" -73.990023 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1087978.0 | \n",
" 1.010138e+09 | \n",
" MN17 | \n",
" POINT (-73.990023200823 40.756890780426) | \n",
"
\n",
" \n",
" 96 | \n",
" 41242148 | \n",
" GABBY O'HARA'S | \n",
" Manhattan | \n",
" 123 | \n",
" WEST 39 STREET | \n",
" 10018.0 | \n",
" 2122788984 | \n",
" Irish | \n",
" 07/30/2019 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Cycle Inspection / Re-inspection | \n",
" 40.753405 | \n",
" -73.986602 | \n",
" 105.0 | \n",
" 4.0 | \n",
" 11300.0 | \n",
" 1080611.0 | \n",
" 1.008150e+09 | \n",
" MN17 | \n",
" POINT (-73.986602050292 40.753404587174) | \n",
"
\n",
" \n",
" 97 | \n",
" 50095860 | \n",
" THE TIMES EATERY | \n",
" Manhattan | \n",
" 680 | \n",
" 8 AVENUE | \n",
" 10036.0 | \n",
" 6463867787 | \n",
" American | \n",
" 02/28/2020 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Pre-permit (Operational) / Initial Inspection | \n",
" 40.757991 | \n",
" -73.989218 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11900.0 | \n",
" 1024703.0 | \n",
" 1.010150e+09 | \n",
" MN17 | \n",
" POINT (-73.989218092096 40.757991356019) | \n",
"
\n",
" \n",
" 98 | \n",
" 50072861 | \n",
" ITSU | \n",
" Manhattan | \n",
" 530 | \n",
" 7 AVENUE | \n",
" 10018.0 | \n",
" 9176393645 | \n",
" Asian/Asian Fusion | \n",
" 09/10/2018 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Pre-permit (Operational) / Initial Inspection | \n",
" 40.753844 | \n",
" -73.988551 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1014485.0 | \n",
" 1.007880e+09 | \n",
" MN17 | \n",
" POINT (-73.988551029682 40.753843959794) | \n",
"
\n",
" \n",
" 99 | \n",
" 50068109 | \n",
" LUKE'S LOBSTER | \n",
" Manhattan | \n",
" 1407 | \n",
" BROADWAY | \n",
" 10018.0 | \n",
" 9174759192 | \n",
" Seafood | \n",
" 09/06/2017 | \n",
" Violations were cited in the following area(s). | \n",
" ... | \n",
" Pre-permit (Operational) / Initial Inspection | \n",
" 40.753432 | \n",
" -73.987151 | \n",
" 105.0 | \n",
" 3.0 | \n",
" 11300.0 | \n",
" 1015265.0 | \n",
" 1.008140e+09 | \n",
" MN17 | \n",
" POINT (-73.98715066791 40.753432097521) | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 27 columns
\n",
"
"
],
"text/plain": [
" camis dba boro building street \\\n",
"0 41551304 THE COUNTER Manhattan 7 TIMES SQUARE \n",
"1 50055665 ANN INC CAFE Manhattan 7 TIMES SQUARE \n",
"2 50049552 ERNST AND YOUNG Manhattan 5 TIMES SQ \n",
"3 50014078 RED LOBSTER Manhattan 5 TIMES SQ \n",
"4 50015171 NEW AMSTERDAM THEATER Manhattan 214 WEST 42 STREET \n",
".. ... ... ... ... ... \n",
"95 41552060 PROSKAUER ROSE Manhattan 11 TIMES SQUARE \n",
"96 41242148 GABBY O'HARA'S Manhattan 123 WEST 39 STREET \n",
"97 50095860 THE TIMES EATERY Manhattan 680 8 AVENUE \n",
"98 50072861 ITSU Manhattan 530 7 AVENUE \n",
"99 50068109 LUKE'S LOBSTER Manhattan 1407 BROADWAY \n",
"\n",
" zipcode phone cuisine_description inspection_date \\\n",
"0 10036.0 2129976801 American 12/22/2016 \n",
"1 10036.0 2125413287 American 12/11/2019 \n",
"2 10036.0 2127739994 Coffee/Tea 11/30/2018 \n",
"3 10036.0 2127306706 Seafood 10/03/2017 \n",
"4 10036.0 2125825472 American 06/26/2018 \n",
".. ... ... ... ... \n",
"95 10036.0 2129695493 American 08/11/2017 \n",
"96 10018.0 2122788984 Irish 07/30/2019 \n",
"97 10036.0 6463867787 American 02/28/2020 \n",
"98 10018.0 9176393645 Asian/Asian Fusion 09/10/2018 \n",
"99 10018.0 9174759192 Seafood 09/06/2017 \n",
"\n",
" action ... \\\n",
"0 Violations were cited in the following area(s). ... \n",
"1 Violations were cited in the following area(s). ... \n",
"2 Violations were cited in the following area(s). ... \n",
"3 Violations were cited in the following area(s). ... \n",
"4 Violations were cited in the following area(s). ... \n",
".. ... ... \n",
"95 Violations were cited in the following area(s). ... \n",
"96 Violations were cited in the following area(s). ... \n",
"97 Violations were cited in the following area(s). ... \n",
"98 Violations were cited in the following area(s). ... \n",
"99 Violations were cited in the following area(s). ... \n",
"\n",
" inspection_type latitude longitude \\\n",
"0 Cycle Inspection / Initial Inspection 40.755908 -73.986681 \n",
"1 Cycle Inspection / Initial Inspection 40.755908 -73.986681 \n",
"2 Cycle Inspection / Initial Inspection 40.755702 -73.987208 \n",
"3 Cycle Inspection / Initial Inspection 40.755702 -73.987208 \n",
"4 Cycle Inspection / Re-inspection 40.756317 -73.987652 \n",
".. ... ... ... \n",
"95 Administrative Miscellaneous / Initial Inspection 40.756891 -73.990023 \n",
"96 Cycle Inspection / Re-inspection 40.753405 -73.986602 \n",
"97 Pre-permit (Operational) / Initial Inspection 40.757991 -73.989218 \n",
"98 Pre-permit (Operational) / Initial Inspection 40.753844 -73.988551 \n",
"99 Pre-permit (Operational) / Initial Inspection 40.753432 -73.987151 \n",
"\n",
" community_board council_district census_tract bin bbl \\\n",
"0 105.0 3.0 11300.0 1086069.0 1.009940e+09 \n",
"1 105.0 3.0 11300.0 1086069.0 1.009940e+09 \n",
"2 105.0 3.0 11300.0 1024656.0 1.010130e+09 \n",
"3 105.0 3.0 11300.0 1024656.0 1.010130e+09 \n",
"4 105.0 3.0 11300.0 1024660.0 1.010130e+09 \n",
".. ... ... ... ... ... \n",
"95 105.0 3.0 11300.0 1087978.0 1.010138e+09 \n",
"96 105.0 4.0 11300.0 1080611.0 1.008150e+09 \n",
"97 105.0 3.0 11900.0 1024703.0 1.010150e+09 \n",
"98 105.0 3.0 11300.0 1014485.0 1.007880e+09 \n",
"99 105.0 3.0 11300.0 1015265.0 1.008140e+09 \n",
"\n",
" nta business_location \n",
"0 MN17 POINT (-73.986680953809 40.755907817312) \n",
"1 MN17 POINT (-73.986680953809 40.755907817312) \n",
"2 MN17 POINT (-73.987207980138 40.755702020307) \n",
"3 MN17 POINT (-73.987207980138 40.755702020307) \n",
"4 MN17 POINT (-73.987651832547 40.756316895053) \n",
".. ... ... \n",
"95 MN17 POINT (-73.990023200823 40.756890780426) \n",
"96 MN17 POINT (-73.986602050292 40.753404587174) \n",
"97 MN17 POINT (-73.989218092096 40.757991356019) \n",
"98 MN17 POINT (-73.988551029682 40.753843959794) \n",
"99 MN17 POINT (-73.98715066791 40.753432097521) \n",
"\n",
"[100 rows x 27 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.opensearch.search(\n",
" client,\n",
" index=\"nyc_restaurants_inspections\",\n",
" filter_path=[\"hits.hits._source\"],\n",
" size=100,\n",
" search_body={\n",
" \"query\": {\n",
" \"match_all\": {}\n",
" },\n",
" \"sort\": [\n",
" {\n",
" \"_geo_distance\": {\n",
" \"business_location\": { # Times-Square - https://geojson.io/#map=16/40.7563/-73.9862\n",
" \"lat\": 40.75613228383523,\n",
" \"lon\": -73.9865791797638\n",
" },\n",
" \"order\": \"asc\"\n",
" }\n",
" }\n",
" ]\n",
" }\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.14",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.14"
}
},
"nbformat": 4,
"nbformat_minor": 4
}