{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 31 - OpenSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "* [1. Initialize](#initialize)\n", " * [Connect to your Amazon OpenSearch domain](#connect)\n", " * [Enter your bucket name](#bucket)\n", " * [Initialize sample data](#sample-data)\n", "* [2. Indexing (load)](#indexing)\n", "\t* [Index documents (no Pandas)](#index-documents)\n", "\t* [Index json file](#index-json)\n", " * [Index CSV](#index-csv)\n", "* [3. Search](#search)\n", "\t* [3.1 Search by DSL](#search-dsl)\n", "\t* [3.2 Search by SQL](#search-sql)\n", "* [4. Delete Indices](#delete-index)\n", "* [5. Bonus - Prepare data and index from DataFrame](#bonus)\n", "\t* [Prepare the data for indexing](#prepare-data)\n", " * [Create index with mapping](#create-index-w-mapping)\n", " * [Index dataframe](#index-df)\n", " * [Execute geo query](#search-geo)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Initialize" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install the optional modules first\n", "!pip install 'awswrangler[opensearch]'" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import awswrangler as wr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to your Amazon OpenSearch domain" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "client = wr.opensearch.connect(\n", " host='OPENSEARCH-ENDPOINT',\n", "# username='FGAC-USERNAME(OPTIONAL)',\n", "# password='FGAC-PASSWORD(OPTIONAL)'\n", ")\n", "client.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Enter your bucket name" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "bucket = 'BUCKET'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize sample data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sf_restaurants_inspections = [\n", " {\n", " \"inspection_id\": \"24936_20160609\",\n", " \"business_address\": \"315 California St\",\n", " \"business_city\": \"San Francisco\",\n", " \"business_id\": \"24936\",\n", " \"business_location\": {\"lon\": -122.400152, \"lat\": 37.793199},\n", " \"business_name\": \"San Francisco Soup Company\",\n", " \"business_postal_code\": \"94104\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-06-09T00:00:00.000\",\n", " \"inspection_score\": 77,\n", " \"inspection_type\": \"Routine - Unscheduled\",\n", " \"risk_category\": \"Low Risk\",\n", " \"violation_description\": \"Improper food labeling or menu misrepresentation\",\n", " \"violation_id\": \"24936_20160609_103141\",\n", " },\n", " {\n", " \"inspection_id\": \"60354_20161123\",\n", " \"business_address\": \"10 Mason St\",\n", " \"business_city\": \"San Francisco\",\n", " \"business_id\": \"60354\",\n", " \"business_location\": {\"lon\": -122.409061, \"lat\": 37.783527},\n", " \"business_name\": \"Soup Unlimited\",\n", " \"business_postal_code\": \"94102\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-11-23T00:00:00.000\",\n", " \"inspection_type\": \"Routine\",\n", " \"inspection_score\": 95,\n", " },\n", " {\n", " \"inspection_id\": \"1797_20160705\",\n", " \"business_address\": \"2872 24th St\",\n", " \"business_city\": \"San Francisco\",\n", " \"business_id\": \"1797\",\n", " \"business_location\": {\"lon\": -122.409752, \"lat\": 37.752807},\n", " \"business_name\": \"TIO CHILOS GRILL\",\n", " \"business_postal_code\": \"94110\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-07-05T00:00:00.000\",\n", " \"inspection_score\": 90,\n", " \"inspection_type\": \"Routine - Unscheduled\",\n", " \"risk_category\": \"Low Risk\",\n", " \"violation_description\": \"Unclean nonfood contact surfaces\",\n", " \"violation_id\": \"1797_20160705_103142\",\n", " },\n", " {\n", " \"inspection_id\": \"66198_20160527\",\n", " \"business_address\": \"1661 Tennessee St Suite 3B\",\n", " \"business_city\": \"San Francisco Whard Restaurant\",\n", " \"business_id\": \"66198\",\n", " \"business_location\": {\"lon\": -122.388478, \"lat\": 37.75072},\n", " \"business_name\": \"San Francisco Restaurant\",\n", " \"business_postal_code\": \"94107\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-05-27T00:00:00.000\",\n", " \"inspection_type\": \"Routine\",\n", " \"inspection_score\": 56,\n", " },\n", " {\n", " \"inspection_id\": \"5794_20160907\",\n", " \"business_address\": \"2162 24th Ave\",\n", " \"business_city\": \"San Francisco\",\n", " \"business_id\": \"5794\",\n", " \"business_location\": {\"lon\": -122.481299, \"lat\": 37.747228},\n", " \"business_name\": \"Soup House\",\n", " \"business_phone_number\": \"+14155752700\",\n", " \"business_postal_code\": \"94116\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-09-07T00:00:00.000\",\n", " \"inspection_score\": 96,\n", " \"inspection_type\": \"Routine - Unscheduled\",\n", " \"risk_category\": \"Low Risk\",\n", " \"violation_description\": \"Unapproved or unmaintained equipment or utensils\",\n", " \"violation_id\": \"5794_20160907_103144\",\n", " },\n", " \n", " # duplicate record\n", " {\n", " \"inspection_id\": \"5794_20160907\",\n", " \"business_address\": \"2162 24th Ave\",\n", " \"business_city\": \"San Francisco\",\n", " \"business_id\": \"5794\",\n", " \"business_location\": {\"lon\": -122.481299, \"lat\": 37.747228},\n", " \"business_name\": \"Soup-or-Salad\",\n", " \"business_phone_number\": \"+14155752700\",\n", " \"business_postal_code\": \"94116\",\n", " \"business_state\": \"CA\",\n", " \"inspection_date\": \"2016-09-07T00:00:00.000\",\n", " \"inspection_score\": 96,\n", " \"inspection_type\": \"Routine - Unscheduled\",\n", " \"risk_category\": \"Low Risk\",\n", " \"violation_description\": \"Unapproved or unmaintained equipment or utensils\",\n", " \"violation_id\": \"5794_20160907_103144\",\n", " },\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Indexing (load)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index documents (no Pandas)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:01" ] }, { "data": { "text/plain": [ "{'success': 6, 'errors': []}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# index documents w/o providing keys (_id is auto-generated)\n", "wr.opensearch.index_documents(\n", " client,\n", " documents=sf_restaurants_inspections,\n", " index=\"sf_restaurants_inspections\" \n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idbusiness_nameinspection_idbusiness_location.lonbusiness_location.lat
0663dd72d-0da4-495b-b0ae-ed000105ae73TIO CHILOS GRILL1797_20160705-122.40975237.752807
1ff2f50f6-5415-4706-9bcb-af7c5eb0afa3Soup House5794_20160907-122.48129937.747228
2b9e8f6a2-8fd1-4660-b041-2997a1a80984San Francisco Soup Company24936_20160609-122.40015237.793199
356b352e6-102b-4eff-8296-7e1fb2459babSoup Unlimited60354_20161123-122.40906137.783527
46fec5411-f79a-48e4-be7b-e0e44d5ebbabSan Francisco Restaurant66198_20160527-122.38847837.750720
57ba4fb17-f9a9-49da-b90e-8b3553d6d97cSoup-or-Salad5794_20160907-122.48129937.747228
\n", "
" ], "text/plain": [ " _id business_name \\\n", "0 663dd72d-0da4-495b-b0ae-ed000105ae73 TIO CHILOS GRILL \n", "1 ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 Soup House \n", "2 b9e8f6a2-8fd1-4660-b041-2997a1a80984 San Francisco Soup Company \n", "3 56b352e6-102b-4eff-8296-7e1fb2459bab Soup Unlimited \n", "4 6fec5411-f79a-48e4-be7b-e0e44d5ebbab San Francisco Restaurant \n", "5 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c Soup-or-Salad \n", "\n", " inspection_id business_location.lon business_location.lat \n", "0 1797_20160705 -122.409752 37.752807 \n", "1 5794_20160907 -122.481299 37.747228 \n", "2 24936_20160609 -122.400152 37.793199 \n", "3 60354_20161123 -122.409061 37.783527 \n", "4 66198_20160527 -122.388478 37.750720 \n", "5 5794_20160907 -122.481299 37.747228 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read all documents. There are total 6 documents\n", "wr.opensearch.search(\n", " client,\n", " index=\"sf_restaurants_inspections\",\n", " _source=[\"inspection_id\", \"business_name\", \"business_location\"]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index json file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame(sf_restaurants_inspections)\n", "path = f\"s3://{bucket}/json/sf_restaurants_inspections.json\"\n", "wr.s3.to_json(df, path,orient='records',lines=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Indexing: 100% (6/6)|####################################|Elapsed Time: 0:00:00" ] }, { "data": { "text/plain": [ "{'success': 6, 'errors': []}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# index json w/ providing keys\n", "wr.opensearch.index_json(\n", " client,\n", " path=path, # path can be s3 or local\n", " index=\"sf_restaurants_inspections_dedup\",\n", " id_keys=[\"inspection_id\"] # can be multiple fields. arg applicable to all index_* functions\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idbusiness_nameinspection_idbusiness_location.lonbusiness_location.lat
024936_20160609San Francisco Soup Company24936_20160609-122.40015237.793199
166198_20160527San Francisco Restaurant66198_20160527-122.38847837.750720
25794_20160907Soup-or-Salad5794_20160907-122.48129937.747228
360354_20161123Soup Unlimited60354_20161123-122.40906137.783527
41797_20160705TIO CHILOS GRILL1797_20160705-122.40975237.752807
\n", "
" ], "text/plain": [ " _id business_name inspection_id \\\n", "0 24936_20160609 San Francisco Soup Company 24936_20160609 \n", "1 66198_20160527 San Francisco Restaurant 66198_20160527 \n", "2 5794_20160907 Soup-or-Salad 5794_20160907 \n", "3 60354_20161123 Soup Unlimited 60354_20161123 \n", "4 1797_20160705 TIO CHILOS GRILL 1797_20160705 \n", "\n", " business_location.lon business_location.lat \n", "0 -122.400152 37.793199 \n", "1 -122.388478 37.750720 \n", "2 -122.481299 37.747228 \n", "3 -122.409061 37.783527 \n", "4 -122.409752 37.752807 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now there are no duplicates. There are total 5 documents\n", "wr.opensearch.search(\n", " client,\n", " index=\"sf_restaurants_inspections_dedup\",\n", " _source=[\"inspection_id\", \"business_name\", \"business_location\"]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index CSV" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Indexing: 100% (1000/1000)|##############################|Elapsed Time: 0:00:00" ] }, { "data": { "text/plain": [ "{'success': 1000, 'errors': []}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.index_csv(\n", " client, \n", " index=\"nyc_restaurants_inspections_sample\", \n", " path='https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD', # index_csv supports local, s3 and url path\n", " id_keys=[\"CAMIS\"],\n", " pandas_kwargs={'na_filter': True, 'nrows': 1000}, # pandas.read_csv() args - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\n", " bulk_size=500 # modify based on your cluster size\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idCAMISDBABOROBUILDINGSTREETZIPCODEPHONECUISINE DESCRIPTIONINSPECTION DATE...RECORD DATEINSPECTION TYPELatitudeLongitudeCommunity BoardCouncil DistrictCensus TractBINBBLNTA
04161042641610426GLOW THAI RESTAURANTBrooklyn71073 AVENUE11209.07187481920Thai02/26/2020...10/04/2021Cycle Inspection / Re-inspection40.633865-74.026798310.043.06800.03146519.03.058910e+09BK31
14081116240811162CARMINE'SManhattan2450BROADWAY10024.02123622200Italian05/28/2019...10/04/2021Cycle Inspection / Initial Inspection40.791168-73.974308107.06.017900.01033560.01.012380e+09MN12
25001211350012113TANGQueens196-50NORTHERN BOULEVARD11358.07182797080Korean08/16/2018...10/04/2021Cycle Inspection / Initial Inspection40.757850-73.784593411.019.0145101.04124565.04.055200e+09QN48
35001461850014618TOTTO RAMENManhattan248EAST 52 STREET10022.02124210052Japanese08/20/2018...10/04/2021Cycle Inspection / Re-inspection40.756596-73.968749106.04.09800.01038490.01.013250e+09MN19
45004578250045782OLLIE'S CHINESE RESTAURANTManhattan2705BROADWAY10025.02129323300Chinese10/21/2019...10/04/2021Cycle Inspection / Re-inspection40.799318-73.968440107.06.019100.01056562.01.018750e+09MN12
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " _id CAMIS DBA BORO BUILDING \\\n", "0 41610426 41610426 GLOW THAI RESTAURANT Brooklyn 7107 \n", "1 40811162 40811162 CARMINE'S Manhattan 2450 \n", "2 50012113 50012113 TANG Queens 196-50 \n", "3 50014618 50014618 TOTTO RAMEN Manhattan 248 \n", "4 50045782 50045782 OLLIE'S CHINESE RESTAURANT Manhattan 2705 \n", "\n", " STREET ZIPCODE PHONE CUISINE DESCRIPTION \\\n", "0 3 AVENUE 11209.0 7187481920 Thai \n", "1 BROADWAY 10024.0 2123622200 Italian \n", "2 NORTHERN BOULEVARD 11358.0 7182797080 Korean \n", "3 EAST 52 STREET 10022.0 2124210052 Japanese \n", "4 BROADWAY 10025.0 2129323300 Chinese \n", "\n", " INSPECTION DATE ... RECORD DATE INSPECTION TYPE \\\n", "0 02/26/2020 ... 10/04/2021 Cycle Inspection / Re-inspection \n", "1 05/28/2019 ... 10/04/2021 Cycle Inspection / Initial Inspection \n", "2 08/16/2018 ... 10/04/2021 Cycle Inspection / Initial Inspection \n", "3 08/20/2018 ... 10/04/2021 Cycle Inspection / Re-inspection \n", "4 10/21/2019 ... 10/04/2021 Cycle Inspection / Re-inspection \n", "\n", " Latitude Longitude Community Board Council District Census Tract \\\n", "0 40.633865 -74.026798 310.0 43.0 6800.0 \n", "1 40.791168 -73.974308 107.0 6.0 17900.0 \n", "2 40.757850 -73.784593 411.0 19.0 145101.0 \n", "3 40.756596 -73.968749 106.0 4.0 9800.0 \n", "4 40.799318 -73.968440 107.0 6.0 19100.0 \n", "\n", " BIN BBL NTA \n", "0 3146519.0 3.058910e+09 BK31 \n", "1 1033560.0 1.012380e+09 MN12 \n", "2 4124565.0 4.055200e+09 QN48 \n", "3 1038490.0 1.013250e+09 MN19 \n", "4 1056562.0 1.018750e+09 MN12 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.search(\n", " client,\n", " index=\"nyc_restaurants_inspections_sample\",\n", " size=5\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Search\n", "Search results are returned as Pandas DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Search by DSL" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idbusiness_nameinspection_idbusiness_location.lonbusiness_location.lat
0ff2f50f6-5415-4706-9bcb-af7c5eb0afa3Soup House5794_20160907-122.48129937.747228
17ba4fb17-f9a9-49da-b90e-8b3553d6d97cSoup-or-Salad5794_20160907-122.48129937.747228
2b9e8f6a2-8fd1-4660-b041-2997a1a80984San Francisco Soup Company24936_20160609-122.40015237.793199
356b352e6-102b-4eff-8296-7e1fb2459babSoup Unlimited60354_20161123-122.40906137.783527
\n", "
" ], "text/plain": [ " _id business_name \\\n", "0 ff2f50f6-5415-4706-9bcb-af7c5eb0afa3 Soup House \n", "1 7ba4fb17-f9a9-49da-b90e-8b3553d6d97c Soup-or-Salad \n", "2 b9e8f6a2-8fd1-4660-b041-2997a1a80984 San Francisco Soup Company \n", "3 56b352e6-102b-4eff-8296-7e1fb2459bab Soup Unlimited \n", "\n", " inspection_id business_location.lon business_location.lat \n", "0 5794_20160907 -122.481299 37.747228 \n", "1 5794_20160907 -122.481299 37.747228 \n", "2 24936_20160609 -122.400152 37.793199 \n", "3 60354_20161123 -122.409061 37.783527 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add a search query. search all soup businesses \n", "wr.opensearch.search(\n", " client,\n", " index=\"sf_restaurants_inspections\",\n", " _source=[\"inspection_id\", \"business_name\", \"business_location\"],\n", " filter_path=[\"hits.hits._id\",\"hits.hits._source\"],\n", " search_body={\n", " \"query\": {\n", " \"match\": {\n", " \"business_name\": \"soup\"\n", " }\n", " }\n", " }\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Search by SQL" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_index_type_id_scorebusiness_nameinspection_score
0sf_restaurants_inspections_dedup_doc5794_20160907NoneSoup-or-Salad96
1sf_restaurants_inspections_dedup_doc60354_20161123NoneSoup Unlimited95
2sf_restaurants_inspections_dedup_doc24936_20160609NoneSan Francisco Soup Company77
\n", "
" ], "text/plain": [ " _index _type _id _score \\\n", "0 sf_restaurants_inspections_dedup _doc 5794_20160907 None \n", "1 sf_restaurants_inspections_dedup _doc 60354_20161123 None \n", "2 sf_restaurants_inspections_dedup _doc 24936_20160609 None \n", "\n", " business_name inspection_score \n", "0 Soup-or-Salad 96 \n", "1 Soup Unlimited 95 \n", "2 San Francisco Soup Company 77 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.search_by_sql(\n", " client,\n", " sql_query=\"\"\"SELECT business_name, inspection_score \n", " FROM sf_restaurants_inspections_dedup\n", " WHERE business_name LIKE '%soup%'\n", " ORDER BY inspection_score DESC LIMIT 5\"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Delete Indices" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'acknowledged': True}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.delete_index(\n", " client=client,\n", " index=\"sf_restaurants_inspections\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Bonus - Prepare data and index from DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise we'll use [DOHMH New York City Restaurant Inspection Results dataset](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare the data for indexing" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# fields names underscore casing \n", "df.columns = [col.lower().replace(' ', '_') for col in df.columns]\n", "\n", "# convert lon/lat to OpenSearch geo_point\n", "df['business_location'] = \"POINT (\" + df.longitude.fillna('0').astype(str) + \" \" + df.latitude.fillna('0').astype(str) + \")\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create index with mapping" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'acknowledged': True,\n", " 'shards_acknowledged': True,\n", " 'index': 'nyc_restaurants_inspections'}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete index if exists\n", "wr.opensearch.delete_index(\n", " client=client,\n", " index=\"nyc_restaurants\"\n", " \n", ")\n", "\n", "# use dynamic_template to map date fields\n", "# define business_location as geo_point\n", "wr.opensearch.create_index(\n", " client=client,\n", " index=\"nyc_restaurants_inspections\",\n", " mappings={\n", " \"dynamic_templates\" : [\n", " {\n", " \"dates\" : {\n", " \"match\" : \"*date\",\n", " \"mapping\" : {\n", " \"type\" : \"date\",\n", " \"format\" : 'MM/dd/yyyy'\n", " }\n", " }\n", " }\n", " ],\n", " \"properties\": {\n", " \"business_location\": {\n", " \"type\": \"geo_point\"\n", " }\n", " }\n", " } \n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Index dataframe" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Indexing: 100% (382655/382655)|##########################|Elapsed Time: 0:04:15" ] }, { "data": { "text/plain": [ "{'success': 382655, 'errors': []}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.index_df(\n", " client,\n", " df=df,\n", " index=\"nyc_restaurants_inspections\",\n", " id_keys=[\"camis\"],\n", " bulk_size=1000\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Execute geo query\n", "#### Sort restaurants by distance from Times-Square" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
camisdbaborobuildingstreetzipcodephonecuisine_descriptioninspection_dateaction...inspection_typelatitudelongitudecommunity_boardcouncil_districtcensus_tractbinbblntabusiness_location
041551304THE COUNTERManhattan7TIMES SQUARE10036.02129976801American12/22/2016Violations were cited in the following area(s)....Cycle Inspection / Initial Inspection40.755908-73.986681105.03.011300.01086069.01.009940e+09MN17POINT (-73.986680953809 40.755907817312)
150055665ANN INC CAFEManhattan7TIMES SQUARE10036.02125413287American12/11/2019Violations were cited in the following area(s)....Cycle Inspection / Initial Inspection40.755908-73.986681105.03.011300.01086069.01.009940e+09MN17POINT (-73.986680953809 40.755907817312)
250049552ERNST AND YOUNGManhattan5TIMES SQ10036.02127739994Coffee/Tea11/30/2018Violations were cited in the following area(s)....Cycle Inspection / Initial Inspection40.755702-73.987208105.03.011300.01024656.01.010130e+09MN17POINT (-73.987207980138 40.755702020307)
350014078RED LOBSTERManhattan5TIMES SQ10036.02127306706Seafood10/03/2017Violations were cited in the following area(s)....Cycle Inspection / Initial Inspection40.755702-73.987208105.03.011300.01024656.01.010130e+09MN17POINT (-73.987207980138 40.755702020307)
450015171NEW AMSTERDAM THEATERManhattan214WEST 42 STREET10036.02125825472American06/26/2018Violations were cited in the following area(s)....Cycle Inspection / Re-inspection40.756317-73.987652105.03.011300.01024660.01.010130e+09MN17POINT (-73.987651832547 40.756316895053)
..................................................................
9541552060PROSKAUER ROSEManhattan11TIMES SQUARE10036.02129695493American08/11/2017Violations were cited in the following area(s)....Administrative Miscellaneous / Initial Inspection40.756891-73.990023105.03.011300.01087978.01.010138e+09MN17POINT (-73.990023200823 40.756890780426)
9641242148GABBY O'HARA'SManhattan123WEST 39 STREET10018.02122788984Irish07/30/2019Violations were cited in the following area(s)....Cycle Inspection / Re-inspection40.753405-73.986602105.04.011300.01080611.01.008150e+09MN17POINT (-73.986602050292 40.753404587174)
9750095860THE TIMES EATERYManhattan6808 AVENUE10036.06463867787American02/28/2020Violations were cited in the following area(s)....Pre-permit (Operational) / Initial Inspection40.757991-73.989218105.03.011900.01024703.01.010150e+09MN17POINT (-73.989218092096 40.757991356019)
9850072861ITSUManhattan5307 AVENUE10018.09176393645Asian/Asian Fusion09/10/2018Violations were cited in the following area(s)....Pre-permit (Operational) / Initial Inspection40.753844-73.988551105.03.011300.01014485.01.007880e+09MN17POINT (-73.988551029682 40.753843959794)
9950068109LUKE'S LOBSTERManhattan1407BROADWAY10018.09174759192Seafood09/06/2017Violations were cited in the following area(s)....Pre-permit (Operational) / Initial Inspection40.753432-73.987151105.03.011300.01015265.01.008140e+09MN17POINT (-73.98715066791 40.753432097521)
\n", "

100 rows × 27 columns

\n", "
" ], "text/plain": [ " camis dba boro building street \\\n", "0 41551304 THE COUNTER Manhattan 7 TIMES SQUARE \n", "1 50055665 ANN INC CAFE Manhattan 7 TIMES SQUARE \n", "2 50049552 ERNST AND YOUNG Manhattan 5 TIMES SQ \n", "3 50014078 RED LOBSTER Manhattan 5 TIMES SQ \n", "4 50015171 NEW AMSTERDAM THEATER Manhattan 214 WEST 42 STREET \n", ".. ... ... ... ... ... \n", "95 41552060 PROSKAUER ROSE Manhattan 11 TIMES SQUARE \n", "96 41242148 GABBY O'HARA'S Manhattan 123 WEST 39 STREET \n", "97 50095860 THE TIMES EATERY Manhattan 680 8 AVENUE \n", "98 50072861 ITSU Manhattan 530 7 AVENUE \n", "99 50068109 LUKE'S LOBSTER Manhattan 1407 BROADWAY \n", "\n", " zipcode phone cuisine_description inspection_date \\\n", "0 10036.0 2129976801 American 12/22/2016 \n", "1 10036.0 2125413287 American 12/11/2019 \n", "2 10036.0 2127739994 Coffee/Tea 11/30/2018 \n", "3 10036.0 2127306706 Seafood 10/03/2017 \n", "4 10036.0 2125825472 American 06/26/2018 \n", ".. ... ... ... ... \n", "95 10036.0 2129695493 American 08/11/2017 \n", "96 10018.0 2122788984 Irish 07/30/2019 \n", "97 10036.0 6463867787 American 02/28/2020 \n", "98 10018.0 9176393645 Asian/Asian Fusion 09/10/2018 \n", "99 10018.0 9174759192 Seafood 09/06/2017 \n", "\n", " action ... \\\n", "0 Violations were cited in the following area(s). ... \n", "1 Violations were cited in the following area(s). ... \n", "2 Violations were cited in the following area(s). ... \n", "3 Violations were cited in the following area(s). ... \n", "4 Violations were cited in the following area(s). ... \n", ".. ... ... \n", "95 Violations were cited in the following area(s). ... \n", "96 Violations were cited in the following area(s). ... \n", "97 Violations were cited in the following area(s). ... \n", "98 Violations were cited in the following area(s). ... \n", "99 Violations were cited in the following area(s). ... \n", "\n", " inspection_type latitude longitude \\\n", "0 Cycle Inspection / Initial Inspection 40.755908 -73.986681 \n", "1 Cycle Inspection / Initial Inspection 40.755908 -73.986681 \n", "2 Cycle Inspection / Initial Inspection 40.755702 -73.987208 \n", "3 Cycle Inspection / Initial Inspection 40.755702 -73.987208 \n", "4 Cycle Inspection / Re-inspection 40.756317 -73.987652 \n", ".. ... ... ... \n", "95 Administrative Miscellaneous / Initial Inspection 40.756891 -73.990023 \n", "96 Cycle Inspection / Re-inspection 40.753405 -73.986602 \n", "97 Pre-permit (Operational) / Initial Inspection 40.757991 -73.989218 \n", "98 Pre-permit (Operational) / Initial Inspection 40.753844 -73.988551 \n", "99 Pre-permit (Operational) / Initial Inspection 40.753432 -73.987151 \n", "\n", " community_board council_district census_tract bin bbl \\\n", "0 105.0 3.0 11300.0 1086069.0 1.009940e+09 \n", "1 105.0 3.0 11300.0 1086069.0 1.009940e+09 \n", "2 105.0 3.0 11300.0 1024656.0 1.010130e+09 \n", "3 105.0 3.0 11300.0 1024656.0 1.010130e+09 \n", "4 105.0 3.0 11300.0 1024660.0 1.010130e+09 \n", ".. ... ... ... ... ... \n", "95 105.0 3.0 11300.0 1087978.0 1.010138e+09 \n", "96 105.0 4.0 11300.0 1080611.0 1.008150e+09 \n", "97 105.0 3.0 11900.0 1024703.0 1.010150e+09 \n", "98 105.0 3.0 11300.0 1014485.0 1.007880e+09 \n", "99 105.0 3.0 11300.0 1015265.0 1.008140e+09 \n", "\n", " nta business_location \n", "0 MN17 POINT (-73.986680953809 40.755907817312) \n", "1 MN17 POINT (-73.986680953809 40.755907817312) \n", "2 MN17 POINT (-73.987207980138 40.755702020307) \n", "3 MN17 POINT (-73.987207980138 40.755702020307) \n", "4 MN17 POINT (-73.987651832547 40.756316895053) \n", ".. ... ... \n", "95 MN17 POINT (-73.990023200823 40.756890780426) \n", "96 MN17 POINT (-73.986602050292 40.753404587174) \n", "97 MN17 POINT (-73.989218092096 40.757991356019) \n", "98 MN17 POINT (-73.988551029682 40.753843959794) \n", "99 MN17 POINT (-73.98715066791 40.753432097521) \n", "\n", "[100 rows x 27 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wr.opensearch.search(\n", " client,\n", " index=\"nyc_restaurants_inspections\",\n", " filter_path=[\"hits.hits._source\"],\n", " size=100,\n", " search_body={\n", " \"query\": {\n", " \"match_all\": {}\n", " },\n", " \"sort\": [\n", " {\n", " \"_geo_distance\": {\n", " \"business_location\": { # Times-Square - https://geojson.io/#map=16/40.7563/-73.9862\n", " \"lat\": 40.75613228383523,\n", " \"lon\": -73.9865791797638\n", " },\n", " \"order\": \"asc\"\n", " }\n", " }\n", " ]\n", " }\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.14", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.14" } }, "nbformat": 4, "nbformat_minor": 4 }