{ "cells": [ { "cell_type": "markdown", "id": "94a47c81", "metadata": {}, "source": [ "\n", "\n", "# Bringing Media2Cloud Video Analysis into Amazon Neptune Knowledge Graph\n", "This notebook accompanies my blog post to show how to move Media2Cloud (M2C) video analysis into a knowledge graph on Amazon Neptune. \n", "\n", "In this notebook you insert into Neptune two types of data:\n", "- Seed data - an initial graph of orgs, their products, and main people in org\n", "- M2C video analysis results\n", "\n", "Then we bring them together! Refer to the blog post for a detailed discussion.\n", "\n", "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0\n" ] }, { "cell_type": "markdown", "id": "808f1776", "metadata": {}, "source": [ "## Seed initial data\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "3b71e2ed", "metadata": {}, "source": [ "### Extract names of S3 buckets from environment" ] }, { "cell_type": "code", "execution_count": null, "id": "72d6a587", "metadata": {}, "outputs": [], "source": [ "import os\n", "import subprocess\n", "\n", "stream = os.popen(\"source ~/.bashrc ; echo $STAGE_BUCKET; echo $M2C_ANALYSIS_BUCKET\")\n", "lines=stream.read().split(\"\\n\")\n", "STAGING_BUCKET=lines[0]\n", "ANALYSIS_BUCKET=lines[1]" ] }, { "cell_type": "markdown", "id": "28c3719f", "metadata": {}, "source": [ "### Create local folder for analysis results" ] }, { "cell_type": "code", "execution_count": null, "id": "613dea3a", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "mkdir -p m2c/analysis" ] }, { "cell_type": "markdown", "id": "df757617", "metadata": {}, "source": [ "### Bulk-load to Neptune from S3" ] }, { "cell_type": "code", "execution_count": null, "id": "cf1bc978", "metadata": {}, "outputs": [], "source": [ "%load -s s3://{STAGING_BUCKET}/data/seeddata.ttl -f turtle --store-to loadres --run" ] }, { "cell_type": "markdown", "id": "5ac19737", "metadata": {}, "source": [ "### Check status of load" ] }, { "cell_type": "code", "execution_count": null, "id": "e6078fff", "metadata": {}, "outputs": [], "source": [ "%load_status {loadres['payload']['loadId']} --errors --details" ] }, { "cell_type": "markdown", "id": "67478e75", "metadata": {}, "source": [ "### Query the seed data" ] }, { "cell_type": "markdown", "id": "364b5091", "metadata": {}, "source": [ "#### Persons and role in org" ] }, { "cell_type": "code", "execution_count": null, "id": "64b8693e", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select * where {\n", " ?person a :Person .\n", " ?person :name ?label .\n", " OPTIONAL {?person skos:altLabel ?altlabel .} .\n", " OPTIONAL {?person :hasWikidataRef ?ref .} .\n", " ?person :hasRole ?role .\n", " ?role :hasRoleType ?roleType .\n", " ?role :hasRoleOrg ?org .\n", " \n", "} order by ?person \n" ] }, { "cell_type": "markdown", "id": "65fe4b6b", "metadata": {}, "source": [ "#### Federated query of person" ] }, { "cell_type": "code", "execution_count": null, "id": "1e7dc7cc", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "\n", "SELECT ?wiki ?p ?o\n", "WHERE \n", "{\n", " ?person a :Person .\n", " ?person :name \"Steve Jobs\" .\n", " ?person :hasWikidataRef ?wiki .\n", " \n", " SERVICE {\n", " ?wiki ?p ?o . \n", " } \n", "}" ] }, { "cell_type": "markdown", "id": "6d7f41f9", "metadata": {}, "source": [ "#### Orgs and products" ] }, { "cell_type": "code", "execution_count": null, "id": "fe51ed05", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select distinct ?org ?ref ?product ?productRef where {\n", " ?org a :Organization .\n", " OPTIONAL {?org :hasWikidataRef ?ref .} .\n", " OPTIONAL {\n", " ?product :producedBy ?org .\n", " OPTIONAL {?product :hasWikidataRef ?productRef . } . \n", " }\n", " \n", "} order by ?org ?product\n" ] }, { "cell_type": "markdown", "id": "bdf195c4", "metadata": {}, "source": [ "## Now add in the M2C analysis" ] }, { "cell_type": "markdown", "id": "25aa4109", "metadata": {}, "source": [ "### Boto3 helpers to bring in the video files from S3" ] }, { "cell_type": "code", "execution_count": null, "id": "00707588", "metadata": {}, "outputs": [], "source": [ "# S3 Helpers\n", "\n", "import boto3\n", "import json\n", "\n", "# categories of findings we're interested in from Rekognition and Comprehend\n", "REKOGNITION_CATS = ['celeb', 'person', 'face', 'label', 'text']\n", "COMPREHEND_CATS = ['entity', 'keyphrase', 'sentiment']\n", "\n", "# connection to analysis bucket in S3. Make sure you set ANALYSIS_BUCKET above.\n", "s3 = boto3.client('s3')\n", "s3r = boto3.resource('s3')\n", "s3_ab = s3r.Bucket(ANALYSIS_BUCKET)\n", "\n", "# Read S3 object at given key and return as JSON object\n", "def open_json(key):\n", " obj = s3r.Object(ANALYSIS_BUCKET, key)\n", " file_content = obj.get()['Body'].read().decode('utf-8') \n", " return json.loads(file_content)\n", "\n", "# return list of top-level folders in analysis bucket\n", "def get_video_folders():\n", " return s3.list_objects_v2(Bucket=ANALYSIS_BUCKET, Delimiter='/')['CommonPrefixes']\n", " \n", "# For a given video (in the specified folder), build list of S3 files that have data for \n", "# the Rekognition and Comprehend categories.\n", "def build_video_file_summary(folder_obj):\n", " # build initial video object\n", " path = folder_obj['Prefix']\n", " \n", " # skip the neptune folder, which we use to stage data for Neptune\n", " if path == 'neptune/':\n", " return None\n", " \n", " video = {'id': path.split(\"/\")[0], 'rekognition': {}, 'comprehend': {}} \n", " for r in REKOGNITION_CATS:\n", " video['rekognition'][r] = {'metadata': [], 'timeseries': [], 'raw': []}\n", " for c in COMPREHEND_CATS:\n", " video['comprehend'][c] = {'metadata': [], 'raw': []}\n", " \n", " # List those files and organize them by category\n", " for object_summary in s3_ab.objects.filter(Prefix=path):\n", " opath = object_summary.key\n", " splits = opath.split(\"/\")\n", " subfolder = splits[2]\n", " video['filename'] = splits[1]\n", " if subfolder in ['metadata', 'timeseries']:\n", " cat = splits[3]\n", " item = splits[4]\n", " service = 'rekognition' if cat in video['rekognition'] else 'comprehend' if cat in video['comprehend'] else None\n", " if not(service is None) and cat in video[service]:\n", " video[service][cat][subfolder].append(opath)\n", " elif subfolder == 'raw':\n", " raw_ts = splits[3]\n", " raw_service = splits[4]\n", " if raw_service in ['rekognition', 'comprehend']:\n", " raw_cat = splits[5]\n", " raw_item = splits[6] \n", " if raw_cat in video[raw_service]:\n", " video[raw_service][raw_cat]['raw'].append(opath)\n", " return video\n", " \n", "\n", "#test\n", "get_video_folders()\n" ] }, { "cell_type": "markdown", "id": "40b8c38e", "metadata": {}, "source": [ "### Install RDFLib" ] }, { "cell_type": "code", "execution_count": null, "id": "ad0ce8b7", "metadata": {}, "outputs": [], "source": [ "!pip install rdflib" ] }, { "cell_type": "markdown", "id": "7177b3f0", "metadata": {}, "source": [ "### RDFLib helpers to build the triples for M2C video analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "45d99959", "metadata": {}, "outputs": [], "source": [ "# RDF helpers\n", "\n", "from rdflib import Graph, URIRef, Literal, BNode\n", "from rdflib.namespace import RDF, RDFS, OWL, SKOS\n", "import urllib.parse\n", "\n", "# main URI namespace\n", "NAMESPACE=\"http://amazon.com/aws/wwso/neptune/demo/m2c/\"\n", "\n", "# List to create common triples. Common across all videos.\n", "# We will also be creating lists per video in next cell.\n", "common_g = Graph()\n", "\n", "# OWL classes\n", "CLASS_VIDEO_ANALYSIS = URIRef(NAMESPACE + \"VideoAnalysis\")\n", "CLASS_CELEB = URIRef(NAMESPACE + \"Celebrity\")\n", "CLASS_LABEL = URIRef(NAMESPACE + \"Label\")\n", "CLASS_APPEARANCE = URIRef(NAMESPACE + \"Appearance\")\n", "CLASS_EMOTION = URIRef(NAMESPACE + \"Emotion\")\n", "CLASS_SENTIMENT = URIRef(NAMESPACE + \"Sentiment\")\n", "CLASS_ENTITY = URIRef(NAMESPACE + \"Entity\")\n", "\n", "# datatype properties\n", "PROP_ID=URIRef(NAMESPACE + \"id\")\n", "PROP_FILENAME=URIRef(NAMESPACE + \"filename\")\n", "PROP_NAME=URIRef(NAMESPACE + \"name\")\n", "PROP_APPEARANCE=URIRef(NAMESPACE + \"appearance\")\n", "PROP_APPEARANCE_PCT=URIRef(NAMESPACE + \"appearancePct\")\n", "PROP_PERSON_COUNT=URIRef(NAMESPACE + \"personCount\")\n", "PROP_COUNT=URIRef(NAMESPACE + \"count\")\n", "PROP_SUBTYPE=URIRef(NAMESPACE + \"subtype\")\n", "PROP_OBSERVED_TEXT=URIRef(NAMESPACE + \"observedText\")\n", "PROP_EXTRACTED_KEYPHRASE=URIRef(NAMESPACE + \"extractedKeyphrase\")\n", "PROP_SENTIMENT=URIRef(NAMESPACE + \"hasSentiment\")\n", "PROP_SENTIMENT_POSITIVE=URIRef(NAMESPACE + \"sentimentCountPositive\")\n", "PROP_SENTIMENT_NEGATIVE=URIRef(NAMESPACE + \"sentimentCountNegative\")\n", "PROP_SENTIMENT_MIXED=URIRef(NAMESPACE + \"sentimentCountMixed\")\n", "PROP_SENTIMENT_NEUTRAL=URIRef(NAMESPACE + \"sentimentCountNeutral\")\n", "\n", "# Object properties\n", "PROP_HAS_GENDER=URIRef(NAMESPACE + \"hasGender\")\n", "PROP_HAS_EXT_URL=URIRef(NAMESPACE + \"hasExternalURL\")\n", "PROP_HAS_WIKIDATA_REF=URIRef(NAMESPACE + \"hasWikidataRef\")\n", "PROP_HAS_APPEARANCE=URIRef(NAMESPACE + \"hasAppearance\")\n", "PROP_HAS_CELEB_APPEARANCE=URIRef(NAMESPACE + \"hasCelebAppearance\")\n", "PROP_HAS_LABEL_APPEARANCE=URIRef(NAMESPACE + \"hasLabelAppearance\")\n", "PROP_HAS_APPEARANCE_SUBJECT=URIRef(NAMESPACE + \"hasAppearanceSubject\")\n", "PROP_HAS_EMOTION=URIRef(NAMESPACE + \"hasEmotion\")\n", "PROP_HAS_EXTRACTED_ENTITY=URIRef(NAMESPACE + \"hasExtractedEntity\")\n", "\n", "# add triples defining an appearance. Add to the graph g for the video.\n", "def add_appearance(g, prop, uri, j, vuri):\n", " auri = BNode()\n", " g.add((auri, RDF.type, CLASS_APPEARANCE))\n", " g.add((vuri, prop, auri))\n", " g.add((auri, PROP_APPEARANCE, Literal(j['appearance'])))\n", " g.add((auri, PROP_APPEARANCE_PCT, Literal(100.0 * j['appearance']/j['duration'])))\n", " g.add((auri, PROP_HAS_APPEARANCE_SUBJECT, uri))\n", "\n", "# add a celebrity to the common graph\n", "# note we also need to convert Wikidata web URL to an RDF URI\n", "common_celeb = {}\n", "def get_gender_uri(g):\n", " return URIRef(NAMESPACE + \"gender/\" + g)\n", "def add_celeb(c):\n", " if c['Id'] in common_celeb:\n", " return common_celeb[c['Id']]\n", " \n", " curi = URIRef(NAMESPACE + \"celeb/\" + c['Id'])\n", " common_g.add((curi, RDF.type, CLASS_CELEB))\n", " common_g.add((curi, PROP_ID, Literal(c['Id'])))\n", " common_g.add((curi, PROP_NAME, Literal(c['Name'])))\n", " common_g.add((curi, PROP_HAS_GENDER, get_gender_uri(c['KnownGender']['Type'])))\n", " for u in c['Urls']:\n", " # Convert the wikidata to a URI\n", " if u.startswith(\"www.wikidata.org/wiki\"):\n", " spl = u.split(\"/\")\n", " spl[1] = 'entity'\n", " w = \"http://\" + \"/\".join(spl)\n", " common_g.add((curi, PROP_HAS_WIKIDATA_REF, URIRef(w)))\n", " common_g.add((curi, PROP_HAS_EXT_URL, URIRef(u)))\n", " common_celeb[c['Id']] = curi\n", " return curi\n", " \n", "# add a label to the common graph\n", "# and build SKOS broader from label parents.\n", "common_label = {}\n", "def get_label_uri(l):\n", " return URIRef(NAMESPACE + \"label/\" +urllib.parse.quote(l))\n", " \n", "def add_label(c):\n", " if c['Name'] in common_label:\n", " return common_label[c['Name']]\n", "\n", " luri = get_label_uri(c['Name'])\n", " common_g.add((luri, RDF.type, CLASS_LABEL))\n", " common_g.add((luri, RDF.type, SKOS.Concept))\n", " common_g.add((luri, PROP_NAME, Literal(c['Name'])))\n", " common_g.add((luri, SKOS.prefLabel, Literal(c['Name'])))\n", " kid = luri\n", " for p in c['Parents']:\n", " parent = get_label_uri(p['Name'])\n", " if not(p['Name'] in common_label):\n", " common_g.add((parent, RDF.type, CLASS_LABEL))\n", " common_g.add((parent, RDF.type, SKOS.Concept))\n", " common_g.add((parent, PROP_NAME, Literal(p['Name'])))\n", " common_g.add((parent, SKOS.prefLabel, Literal(p['Name'])))\n", " #print(\"broader \" + str(kid) + \" \" + str(parent))\n", " common_g.add((kid, SKOS.broader, parent))\n", " kid = parent\n", " return luri\n", "\n", "# Add an emotion to video graph g\n", "def add_emotion(g, gender, subtype, count, vuri):\n", " auri = BNode()\n", " g.add((auri, RDF.type, CLASS_EMOTION))\n", " g.add((vuri, PROP_HAS_EMOTION, auri))\n", " g.add((auri, PROP_COUNT, Literal(count)))\n", " g.add((auri, PROP_SUBTYPE, Literal(subtype)))\n", " g.add((auri, PROP_HAS_GENDER, get_gender_uri(gender)))\n", "\n", "# Add an entity to video graph g\n", "def add_entity(g, record, vuri):\n", " auri = BNode()\n", " g.add((auri, RDF.type, CLASS_ENTITY))\n", " g.add((vuri, PROP_HAS_EXTRACTED_ENTITY, auri))\n", " g.add((auri, PROP_SUBTYPE, Literal(record['type'])))\n", " g.add((auri, PROP_NAME, Literal(ent['text'])))\n", " \n", "# Give URI of the sentimentCount property. We define 4 of these for Positive, Negative, Neutral, Mixed\n", "def get_sentiment_uri(s):\n", " us = URIRef(NAMESPACE + \"sentimentCount\") + s\n", " return us\n", " \n", "# save common graph to TTL file on the notebook instance\n", "def save_common():\n", " common_g.serialize(format='ttl', destination='m2c/analysis/common.ttl')\n" ] }, { "cell_type": "markdown", "id": "7001e12d", "metadata": {}, "source": [ "### For each video in analysis bucket, build triples and write to RDF file on the notebook instance" ] }, { "cell_type": "code", "execution_count": null, "id": "9ad2e306", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import traceback\n", "\n", "# Get all the video folders. If you have a lot of these, you might want to filter this list\n", "# Loop through the videos and build RDF for each.\n", "videos = get_video_folders()\n", "for v in videos:\n", " try:\n", " # determine which S3 objects are needed\n", " summary = build_video_file_summary(v)\n", " if summary is None:\n", " continue\n", " \n", " # Create video-specific RDF graph we will be building on \n", " g = Graph()\n", " video_uri = URIRef(NAMESPACE + \"video/\" + summary['id'])\n", " g.add((video_uri, RDF.type, CLASS_VIDEO_ANALYSIS))\n", " g.add((video_uri, PROP_FILENAME, Literal(summary['filename'])))\n", " g.add((video_uri, PROP_ID, Literal(summary['id'])))\n", "\n", " # Build celebs from Rekognition\n", " # Add a celebrity appearance to the video graph. Also add celeb to common graph.\n", " celebs = {}\n", " celebs_byname = {}\n", " for f in summary['rekognition']['celeb']['raw']:\n", " if f.endswith('mapFile.json'):\n", " continue # skip\n", " j = open_json(f)\n", " for cr in j['Celebrities']:\n", " c = cr['Celebrity']\n", " curi = add_celeb(c)\n", " if not(c['Id'] in celebs):\n", " celebs[c['Id']] = curi\n", " celebs_byname[c['Name']] = curi\n", " for f in summary['rekognition']['celeb']['timeseries']:\n", " j = open_json(f)\n", " curi = celebs_byname[j['label']]\n", " if not(curi is None):\n", " add_appearance(g, PROP_HAS_CELEB_APPEARANCE, curi, j, video_uri)\n", "\n", " # Build labels from Rekognition\n", " # Add a label appearance to the video graph. Also add label to common graph. \n", " for f in summary['rekognition']['label']['raw']:\n", " if f.endswith('mapFile.json'):\n", " continue\n", " j = open_json(f)\n", " labels = {}\n", " for cr in j['Labels']:\n", " c = cr['Label']\n", " luri = add_label(c)\n", " if not(c['Name'] in labels):\n", " labels[c['Name']] = c\n", " for f in summary['rekognition']['label']['timeseries']:\n", " j = open_json(f)\n", " luri = get_label_uri(j['label'])\n", " if not(luri is None):\n", " add_appearance(g, PROP_HAS_LABEL_APPEARANCE, luri, j, video_uri)\n", " \n", " # Build persons from Rekognition.\n", " # Lots of detail, but just count is OK\n", " for f in summary['rekognition']['person']['raw']:\n", " if f.endswith(\"mapFile.json\"): \n", " j = open_json(f)\n", " g.add((video_uri, PROP_PERSON_COUNT, Literal(len(j))))\n", "\n", " # Build faces from Rekognition\n", " # We express this in terms of Emotion rather than face!\n", " # We will just build count of emotions per gender\n", " emotions = {}\n", " for f in summary['rekognition']['face']['raw']:\n", " if f.endswith(\"mapFile.json\"): \n", " continue\n", " j = open_json(f)\n", " for ff in j['Faces']:\n", " f = ff['Face']\n", " gender = f['Gender']['Value']\n", " if not(gender in emotions):\n", " emotions[gender] = {}\n", " for em in f['Emotions']:\n", " if em['Confidence'] >= 0.75:\n", " emv = em['Type']\n", " if not(emv in emotions[gender]):\n", " emotions[gender][emv] = 0\n", " emotions[gender][emv] = emotions[gender][emv] + 1\n", " for gender in emotions:\n", " for emo in emotions[gender]:\n", " add_emotion(g, gender, emo, emotions[gender][emo], video_uri)\n", "\n", " # Build text from Rekognition\n", " for f in summary['rekognition']['text']['raw']:\n", " if f.endswith('mapFile.json'):\n", " j = open_json(f)\n", " for txt in j:\n", " g.add((video_uri, PROP_OBSERVED_TEXT, Literal(txt)))\n", " \n", " # Build entity from Comprehend\n", " for f in summary['comprehend']['entity']['metadata']:\n", " j = open_json(f)\n", " for ent in j:\n", " add_entity(g, ent, video_uri)\n", " \n", " # Build keyphrases from Comprehend\n", " for f in summary['comprehend']['keyphrase']['metadata']:\n", " j = open_json(f)\n", " for kpr in j:\n", " g.add((video_uri, PROP_EXTRACTED_KEYPHRASE, Literal(kpr['text'])))\n", "\n", " # Build sentiment from Comprehend - just totals\n", " for f in summary['comprehend']['sentiment']['metadata']:\n", " j = open_json(f)\n", " sentiment = {}\n", " for s in j:\n", " if not(s['text'] in sentiment):\n", " sentiment[s['text']] = 0\n", " sentiment[s['text']] = sentiment[s['text']] + s['end'] - s['begin']\n", " for s in sentiment:\n", " g.add((video_uri, get_sentiment_uri(s), Literal(sentiment[s])))\n", "\n", "\n", " # save the triples for video-specific graph\n", " print(\"serialize \" + summary['id'])\n", " g.serialize(format=\"ttl\", destination=\"m2c/analysis/\" + summary['id'] + \".ttl\")\n", " except Exception as e:\n", " traceback.print_exc()\n", "\n", "# serialize the common\n", "print(\"common\")\n", "save_common()\n" ] }, { "cell_type": "markdown", "id": "01308529", "metadata": {}, "source": [ "### Upload RDF files to S3 " ] }, { "cell_type": "code", "execution_count": null, "id": "7fe67ed3", "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$STAGING_BUCKET\"\n", "\n", "cd m2c/analysis\n", "aws s3 sync . s3://$1/analysis" ] }, { "cell_type": "markdown", "id": "f0725086", "metadata": {}, "source": [ "### Bulk-load these to Neptune" ] }, { "cell_type": "code", "execution_count": null, "id": "f849ad48", "metadata": {}, "outputs": [], "source": [ "%load -s s3://{STAGING_BUCKET}/analysis -f turtle --store-to loadres2 --run" ] }, { "cell_type": "markdown", "id": "b920f355", "metadata": {}, "source": [ "### Check load status" ] }, { "cell_type": "code", "execution_count": null, "id": "8927b888", "metadata": {}, "outputs": [], "source": [ "%load_status {loadres2['payload']['loadId']} --errors --details" ] }, { "cell_type": "markdown", "id": "3bddf0d4", "metadata": {}, "source": [ "### Query the combined data\n" ] }, { "cell_type": "markdown", "id": "24b061cf", "metadata": {}, "source": [ "#### Summary of videos" ] }, { "cell_type": "code", "execution_count": null, "id": "157f9ce1", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "\n", "SELECT * where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " OPTIONAL {?video :personCount ?personCount . } .\n", " OPTIONAL {?video :sentimentCountPositive ?sentimentPos . } .\n", " OPTIONAL {?video :sentimentCountNegative ?sentimentNeg . } .\n", " OPTIONAL {?video :sentimentCountMixed ?sentimentMixed . } .\n", " OPTIONAL {?video :sentimentCountNeutral ?sentimentNeural . } .\n", "} " ] }, { "cell_type": "markdown", "id": "beae10ce", "metadata": {}, "source": [ "#### Describe one\n", "Choose a VIDEO URI from the previous result and paste it the DESCRIBE QUERY. Scroll the Table view and notice sentiment, observed text, extracted entities and keyphrases, labels and celebs appearing, emotions. Try the Graph view to see visual.\n", "\n", "If you are on a release prior to 1.2.1.0, comment out the hint:Query clause" ] }, { "cell_type": "code", "execution_count": null, "id": "b9eaadee", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "PREFIX hint: \n", "\n", "DESCRIBE \n", "{\n", " hint:Query hint:describeMode \"CBD\"\n", "}\n" ] }, { "cell_type": "markdown", "id": "8dc65cdf", "metadata": {}, "source": [ "#### Show celeb appearances in the videos" ] }, { "cell_type": "code", "execution_count": null, "id": "58e52f0c", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select ?filename ?celebName ?celebRef ?appPct where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " OPTIONAL {\n", " ?video :hasCelebAppearance ?appo .\n", " ?appo :appearancePct ?appPct .\n", " ?appo :hasAppearanceSubject ?sub .\n", " OPTIONAL { ?sub :hasWikidataRef ?celebRef } .\n", " ?sub :name ?celebName\n", " \n", " } . \n", "} " ] }, { "cell_type": "markdown", "id": "6b061a71", "metadata": {}, "source": [ "#### LINK: Tie celebs in videos to persons from seed, matching on wikidata ref" ] }, { "cell_type": "code", "execution_count": null, "id": "7b824b41", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select (?filename AS ?s) (?ref as ?p) (?person as ?o) where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " ?video :hasCelebAppearance/:hasAppearanceSubject/:hasWikidataRef ?ref.\n", " ?person :hasWikidataRef ?ref .\n", " ?person a :Person .\n", "} " ] }, { "cell_type": "markdown", "id": "70716c69", "metadata": {}, "source": [ "#### LINK: Tie extracted entities to persons, orgs, products.\n", "This uses exact string match on name or alt label." ] }, { "cell_type": "code", "execution_count": null, "id": "29db083f", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select distinct ?filename ?entName ?match ?entType where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " ?video :hasExtractedEntity ?ent .\n", " ?ent :name ?entName .\n", " ?ent :subtype ?entType .\n", " VALUES ?entType { \"PERSON\" \"ORGANIZATION\" \"OTHER\" \"COMMERCIAL_ITEM\" \"EVENT\" } .\n", " VALUES ?seedClass { :Person :Organization :Product } .\n", " ?match :name|skos:altLabel ?entName .\n", " ?match a ?seedClass .\n", "} " ] }, { "cell_type": "markdown", "id": "c1f08383", "metadata": {}, "source": [ "#### LINK: Tie any text in video analysis to seed object. \n", "Uses lower-case match. In practice, run this for a specific video." ] }, { "cell_type": "code", "execution_count": null, "id": "26d3f609", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select distinct ?filename ?text ?match ?entName where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " ?video :hasExtractedEntity/:name|:extractedKeyphrase|:observedText|:hasLabelAppearance/:hasAppearanceSubject/:name ?text .\n", " VALUES ?seedClass { :Person :Organization :Product } .\n", " ?match :name|skos:altLabel ?entName .\n", " ?match a ?seedClass .\n", " FILTER(lcase(?text) = lcase(?entName)) .\n", "} order by ?filename ?match" ] }, { "cell_type": "markdown", "id": "2ba6a61d", "metadata": {}, "source": [ "#### LINK: Match on m2c label and bring in m2c taxonomy for reference" ] }, { "cell_type": "code", "execution_count": null, "id": "ecbcf122", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select distinct ?filename ?term ?entName (GROUP_CONCAT(?parentTerm;SEPARATOR=\",\") AS ?parentTerms)\n", "where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " ?video :hasLabelAppearance/:hasAppearanceSubject ?term .\n", " ?term :name ?label .\n", " OPTIONAL {?term skos:broader+ ?parentTerm .} .\n", " \n", " VALUES ?seedClass { :Person :Organization :Product } .\n", " ?match :name|skos:altLabel ?entName .\n", " ?match a ?seedClass .\n", " \n", " FILTER (lcase(?entName) = lcase(?label)) .\n", "} GROUP BY ?filename ?term ?entName " ] }, { "cell_type": "markdown", "id": "4d14219a", "metadata": {}, "source": [ "#### LINK: Match on extracted entity and show what else is there" ] }, { "cell_type": "code", "execution_count": null, "id": "62f512a1", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "prefix : \n", "prefix rdf: \n", "prefix rdfs: \n", "prefix owl: \n", "prefix xsd: \n", "prefix skos: \n", "\n", "select distinct ?filename ?entName ?match (GROUP_CONCAT(?anyEntity;SEPARATOR=\",\") AS ?allEntities) \n", "where {\n", " ?video a :VideoAnalysis .\n", " ?video :filename ?filename .\n", " ?video :hasExtractedEntity ?ent .\n", " ?ent :name ?entName .\n", " ?ent :subtype ?entType .\n", " VALUES ?entType { \"PERSON\" \"ORGANIZATION\" \"OTHER\" \"COMMERCIAL_ITEM\" \"EVENT\" } .\n", " VALUES ?seedClass { :Person :Organization :Product } .\n", " ?match :name|skos:altLabel ?entName .\n", " ?match a ?seedClass .\n", " ?video :hasExtractedEntity/:name ?anyEntity .\n", "} GROUP BY ?filename ?entName ?match \n", "LIMIT 10" ] }, { "cell_type": "markdown", "id": "f001571a", "metadata": {}, "source": [ "## Cleanup (if necessary)\n", "Delete all triples transactionally. OK if you are only using this small dataset. " ] }, { "cell_type": "code", "execution_count": null, "id": "01386517", "metadata": {}, "outputs": [], "source": [ "%%sparql\n", "\n", "delete {?s ?p ?o} where {?s ?p ?o}" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }