{ "cells": [ { "cell_type": "markdown", "id": "c4a68ee1-09fb-4d8f-a614-d137c731d243", "metadata": {}, "source": [ "# Unsupervised evaluation of LLMs using LLMs" ] }, { "cell_type": "markdown", "id": "ded26b0e", "metadata": {}, "source": [ "_License information_\n", "\n", " Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", " SPDX-License-Identifier: MIT-0" ] }, { "cell_type": "markdown", "id": "ecaa6385-94c8-43d1-8f25-9abf3b9f02c1", "metadata": {}, "source": [ "This notebook shows how to use an LLM to evaluate the work of other LLMs.\n", "\n", "In this simple example, we will load a canned summarization dataset from Hugging Face. We will obtain summaries from two LLMs, Falcon 40B Instruct BF16 and Flan T5 XL. We will ask Anthropic's Claude model to evaluate those summaries, along with the ground truth summary." ] }, { "cell_type": "markdown", "id": "6ea04ab4-09e7-4dee-b725-5e31bb6f04e6", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "You will need an [API key](https://docs.anthropic.com/claude/docs/getting-access-to-claude) to use Claude. In the future you can use [Amazon Bedrock](https://aws.amazon.com/bedrock/) instead, as it offers Claude as a supported model." ] }, { "cell_type": "code", "execution_count": 3, "id": "4edc13f1-283c-43a4-9e61-4461850231cf", "metadata": { "tags": [] }, "outputs": [], "source": [ "claude_api_key = '' ## Enter your API key here" ] }, { "cell_type": "markdown", "id": "1850362f-d73b-4fac-904b-fb00f621e66d", "metadata": { "tags": [] }, "source": [ "You will also need to install several Python modules." ] }, { "cell_type": "code", "execution_count": null, "id": "298f4860-c7c4-4575-9590-08a50862f703", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "7f892d3a-4dcc-4699-a763-9f04e3b7db55", "metadata": { "tags": [] }, "outputs": [], "source": [ "! pip install -U anthropic" ] }, { "cell_type": "code", "execution_count": null, "id": "218f5f95-aa09-4028-8342-fdebce701f63", "metadata": { "tags": [] }, "outputs": [], "source": [ "! pip install -U pydantic==1.10" ] }, { "cell_type": "markdown", "id": "de6ac61a-49d6-4bbb-a2ad-1e19784ea35d", "metadata": {}, "source": [ "Finally, define the names of the SageMaker endpoints you're using for Falcon and Flan-T5." ] }, { "cell_type": "code", "execution_count": 4, "id": "bd36946a-d42a-4113-9575-6062a3605fad", "metadata": { "tags": [] }, "outputs": [], "source": [ "t5_ep_name = '' # Enter the name of the SageMaker inference endpoint you deployed with Flan-T5" ] }, { "cell_type": "code", "execution_count": 5, "id": "922690e3-e1e5-46d8-b7f3-aa8006ca5be4", "metadata": { "tags": [] }, "outputs": [], "source": [ "falcon_ep_name = '' # Enter the name of the SageMaker inference endpoint you deployed with Falcon 40B" ] }, { "cell_type": "markdown", "id": "8cae7a3d-ce06-4655-a203-8f6e13aa15b7", "metadata": { "tags": [] }, "source": [ "## Dataset\n", "\n", "We'll use the cnn_dailymail dataset. We'll only process five samples to save time." ] }, { "cell_type": "code", "execution_count": 6, "id": "57890bee-eaee-41a4-bb11-455475df0e6f", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3\n", " warnings.warn(f\"A NumPy version >={np_minversion} and <{np_maxversion}\"\n", "Found cached dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "29e891a9f3d148beb24e9d36b0e4cbb5", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from datasets import load_dataset\n", "\n", "dataset = load_dataset('cnn_dailymail', '3.0.0')" ] }, { "cell_type": "code", "execution_count": 7, "id": "0bc6de13-a24f-40fb-bb1f-a72b17cd3000", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['article', 'highlights', 'id'],\n", " num_rows: 287113\n", " })\n", " validation: Dataset({\n", " features: ['article', 'highlights', 'id'],\n", " num_rows: 13368\n", " })\n", " test: Dataset({\n", " features: ['article', 'highlights', 'id'],\n", " num_rows: 11490\n", " })\n", "})" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset" ] }, { "cell_type": "code", "execution_count": 8, "id": "ba4ce648-d68e-487f-b317-48ca451cee8e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\\'t cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don\\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don\\'t think I\\'ll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he\\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I\\'ll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe\\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say \\'kid star goes off the rails,\\'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films. Watch I-Reporter give her review of Potter\\'s latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer\\'s \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he\\'s legally an adult: \"I just think I\\'m going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['train']['article'][0]" ] }, { "cell_type": "code", "execution_count": 9, "id": "8bb890e9-5b7b-4602-a39c-c66f58620293", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\"Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\\nYoung actor says he has no plans to fritter his cash away .\\nRadcliffe's earnings from first five Potter films have been held in trust fund .\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['train']['highlights'][0]" ] }, { "cell_type": "code", "execution_count": 10, "id": "1d366892-6b94-432e-8291-9dfa03ed779b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([143884, 60192, 54750, 117176, 138558])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "num_to_eval = 5\n", "eval_idxs = np.random.randint(low=0, high=len(dataset['train']), size=num_to_eval)\n", "eval_idxs" ] }, { "cell_type": "code", "execution_count": 11, "id": "d5bda61f-ed8a-4e2c-bd8e-92976095919b", "metadata": { "tags": [] }, "outputs": [], "source": [ "docs_to_summarize = [dataset['train']['article'][i] for i in eval_idxs]" ] }, { "cell_type": "code", "execution_count": 12, "id": "2376fd18-c725-47f6-865c-8e8f0cf0e381", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs_to_summarize)" ] }, { "cell_type": "code", "execution_count": 13, "id": "492b972a-b408-4bae-8c1c-c21aa24ac42e", "metadata": { "tags": [] }, "outputs": [], "source": [ "docs_gt = [dataset['train']['highlights'][i] for i in eval_idxs]# ## Get summaries " ] }, { "cell_type": "markdown", "id": "a741a9f5-051b-41cd-bcd5-75f22e6f046a", "metadata": { "tags": [] }, "source": [ "## Get summaries " ] }, { "cell_type": "markdown", "id": "1e23632f-2efa-46d8-b533-4b33027eda62", "metadata": {}, "source": [ "### Flan-T5" ] }, { "cell_type": "code", "execution_count": 14, "id": "c2c7d02d-6dce-4379-b9f0-cc0c4a95fe9d", "metadata": { "tags": [] }, "outputs": [], "source": [ "import time" ] }, { "cell_type": "code", "execution_count": 15, "id": "c4d6348e-5f2b-4b2e-a754-3af0df8cdb4e", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.serializers import JSONSerializer\n", "import sagemaker\n", "\n", "t5_ep = t5_ep_name\n", "t5_predictor = sagemaker.predictor.Predictor(t5_ep)\n", "t5_predictor.serializer = JSONSerializer()\n", "t5_predictor.content_type = \"application/json\"" ] }, { "cell_type": "code", "execution_count": 16, "id": "3366dabc-43e2-45ea-ac9b-0ab124f8cd23", "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "\n", "def query_t5(predictor, doc):\n", " payload = {\n", " \"text_inputs\": f\"Write a short summary for this text: \\\"{doc}\\\"\",\n", " \"max_length\": 5000,\n", " \"do_sample\": True,\n", " \"top_k\": 10,\n", " }\n", " response = json.loads(predictor.predict(payload))\n", " return response['generated_texts'][0]" ] }, { "cell_type": "code", "execution_count": 17, "id": "d57eb662-c6ea-45e5-a85d-fbcba94c56ef", "metadata": { "tags": [] }, "outputs": [], "source": [ "t5_sums = []\n", "for doc in docs_to_summarize:\n", " response = query_t5(t5_predictor, doc)\n", " t5_sums.append(response)\n", " time.sleep(2)\n", " " ] }, { "cell_type": "code", "execution_count": 18, "id": "bf52736b-6efc-44b8-b21c-046d81993d4f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'Jury selection will begin in the trial of Andre Leteve, who is accused of gunning down his two boys in 2010. He is accused of doing it as an act of vengeance against his wife Laurie, with whom he was locked in a bitter divorce battle.'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t5_sums[0]" ] }, { "cell_type": "markdown", "id": "ca1ea9f6-eefc-4dbb-a145-b8ceb6d678e6", "metadata": {}, "source": [ "### Falcon" ] }, { "cell_type": "code", "execution_count": 19, "id": "5f150f87-1129-4f48-9797-7430d8d3029a", "metadata": {}, "outputs": [], "source": [ "from sagemaker.serializers import JSONSerializer\n", "import sagemaker\n", "\n", "falcon_ep = falcon_ep_name\n", "falcon_predictor = sagemaker.predictor.Predictor(falcon_ep)\n", "falcon_predictor.serializer = JSONSerializer()\n", "falcon_predictor.content_type = \"application/json\"" ] }, { "cell_type": "code", "execution_count": 23, "id": "4049fa31-c0c4-4bdc-bfad-39a0dc5d2060", "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "\n", "def query_falcon(predictor, doc):\n", " payload = {\n", " \"inputs\": f\"\\\"{doc[:950]}\\\". Summarize the article above:\",\n", " \"max_new_tokens\": 300,\n", " \"return_full_text\": False\n", " }\n", " response = json.loads(predictor.predict(payload))\n", " return response[0]['generated_text']" ] }, { "cell_type": "code", "execution_count": 24, "id": "f594b27b-e42d-4b21-838d-94f5439a65c3", "metadata": {}, "outputs": [], "source": [ "falcon_sums = []\n", "for doc in docs_to_summarize:\n", " response = query_falcon(falcon_predictor, doc)\n", " falcon_sums.append(response)\n", " time.sleep(2)" ] }, { "cell_type": "code", "execution_count": 25, "id": "93c79085-b54e-461e-8440-a57246d6cebb", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'\\nThe article above is about an Arizona man named Andre Leteve who is accused of murdering his'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "falcon_sums[0]" ] }, { "cell_type": "markdown", "id": "aa341ff9-3615-482c-af71-c47b5f8be003", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": 26, "id": "8db497fb-96ee-450c-b0af-79398d21ba54", "metadata": { "tags": [] }, "outputs": [], "source": [ "from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT\n", "\n", "anthropic = Anthropic(api_key=claude_api_key)\n" ] }, { "cell_type": "code", "execution_count": 27, "id": "46ef5253-30cb-4e08-aed8-eaa9e6e002ce", "metadata": {}, "outputs": [], "source": [ "prompt_eng_base = '''You will be given a summmary of a news article. Your task is to evaluate the summary in four dimensions, accuracy, coherence, factuality, and completeness. Provide a score of 1-5 in each dimension, with 5 being the best score.\n", "\n", "Original discussion: [DISCUSSION]\n", "\n", "Summary: [SUMMARY]\n", "\n", "Evaluation form (scores only):\n", "\n", "- Coherence: \n", "- Accuracy:\n", "- Factuality:\n", "- Completeness:\n", "'''" ] }, { "cell_type": "code", "execution_count": 28, "id": "fc6f1d57-ab20-437b-9c66-c14a053b0468", "metadata": {}, "outputs": [], "source": [ "def make_prompt(search, context, prompt_eng_base):\n", " search = search.replace(\"\\\"\", \"'\")\n", " context = context.replace(\"\\\"\", \"'\")\n", " prompt = prompt_eng_base.replace('[DISCUSSION]', context)\n", " prompt = prompt.replace('[SUMMARY]', search)\n", " return prompt" ] }, { "cell_type": "code", "execution_count": 29, "id": "27b46b4d-c3c9-4287-ae79-cb95e24faea2", "metadata": {}, "outputs": [], "source": [ "def get_eval(prompt):\n", " completion = anthropic.completions.create(\n", " model=\"claude-1\",\n", " max_tokens_to_sample=300,\n", " prompt=f\"{HUMAN_PROMPT} {prompt} {AI_PROMPT}\",\n", " )\n", " return completion.completion" ] }, { "cell_type": "markdown", "id": "50b23f9a-de5f-4cb1-b1d4-c45f82188ef1", "metadata": {}, "source": [ "Example output:\n", " \n", " \"\\nSummary Scores:\\n\\nCoherence: 5\\nAccuracy: 5 \\nFactuality: 5\\nCompleteness: 2\\n\\nComments: The summary is coherent, accurate and factual in reporting the key details about Daniel Radcliffe gaining access to his trust fund as he turns 18, and his stated plans to not spend extravagantly. However, the summary lacks details on Radcliffe's acting work beyond the Harry Potter series, his upcoming films My Boy Jack and December Boys, as well as his stage debut in Equus. The summary is incomplete in covering the breadth of topics discussed in the original report. Overall, this is a decent high-level summary but could be more complete.\"" ] }, { "cell_type": "code", "execution_count": 30, "id": "46225833-2899-43b6-83e3-bcd3cecce5c1", "metadata": { "tags": [] }, "outputs": [], "source": [ "import re\n", "def parse_result(r):\n", " m = re.search(\"Accuracy: (\\d)\", r)\n", " if m is None:\n", " accuracy = 0\n", " else:\n", " accuracy = m.group(1)\n", " \n", " m = re.search(\"Coherence: (\\d)\", r)\n", " if m is None:\n", " coherence = 0\n", " else:\n", " coherence = m.group(1)\n", " \n", " m = re.search(\"Factuality: (\\d)\", r)\n", " if m is None:\n", " factuality = 0\n", " else:\n", " factuality = m.group(1)\n", " \n", " m = re.search(\"Completeness: (\\d)\", r)\n", " if m is None:\n", " completeness = 0\n", " else:\n", " completeness = m.group(1)\n", " \n", " return accuracy, coherence, factuality, completeness" ] }, { "cell_type": "code", "execution_count": 31, "id": "7733355c-cc47-455c-bc2c-3317ee38267e", "metadata": { "tags": [] }, "outputs": [], "source": [ "result_map = []\n", "for doc, t5_sum, falcon_sum, gt in zip(docs_to_summarize, t5_sums, falcon_sums, docs_gt):\n", " p_t5 = make_prompt(t5_sum, doc, prompt_eng_base)\n", " p_falcon = make_prompt(falcon_sum, doc, prompt_eng_base)\n", " p_gt = make_prompt(gt, doc, prompt_eng_base)\n", " \n", " r_t5 = get_eval(p_t5)\n", " time.sleep(5)\n", " r_falcon = get_eval(p_falcon)\n", " time.sleep(5)\n", " r_gt = get_eval(p_gt)\n", " time.sleep(5)\n", " \n", " metrics_t5 = parse_result(r_t5)\n", " metrics_falcon = parse_result(r_falcon)\n", " metrics_gt = parse_result(r_gt)\n", " \n", " result_map.append({\n", " 'doc': doc,\n", " 't5': {\n", " 'summary': t5_sum,\n", " 'eval': r_t5,\n", " 'accuracy': metrics_t5[0],\n", " 'coherence': metrics_t5[1],\n", " 'factuality': metrics_t5[2],\n", " 'completeness': metrics_t5[3],\n", " },\n", " 'falcon': {\n", " 'summary': falcon_sum,\n", " 'eval': r_falcon,\n", " 'accuracy': metrics_falcon[0],\n", " 'coherence': metrics_falcon[1],\n", " 'factuality': metrics_falcon[2],\n", " 'completeness': metrics_falcon[3],\n", " },\n", " 'gt': {\n", " 'summary': gt,\n", " 'eval': r_gt,\n", " 'accuracy': metrics_gt[0],\n", " 'coherence': metrics_gt[1],\n", " 'factuality': metrics_gt[2],\n", " 'completeness': metrics_gt[3],\n", " }\n", " \n", " })" ] }, { "cell_type": "code", "execution_count": 32, "id": "b1b138f4-62e9-4337-86a0-e00e35a74164", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'doc': \"By . Daily Mail Reporter . PUBLISHED: . 12:03 EST, 10 September 2012 . | . UPDATED: . 13:03 EST, 10 September 2012 . Mug shot: Andre Leteve suffered facial injuries after trying to kill himself following the murder of his sons . An Arizona man who stands accused of murdering his two sons at the height of a bitter divorce battle with his wife will stand trial beginning this week. Jury selection is set to begin in the trial of Andre Leteve, who is accused of gunning down his two young sons Alec, 5, and Asher, 1, in Scottsdale on March 31, 2010. Prosecutors are out to prove that Leteve, who was 39 at the time, shot his sons as an act of vengeance against his estranged wife Laurie, with whom he was locked in a bitter divorce battle. Their marriage reportedly soured after she discovered that he had cheated on her with prostitutes. A court document obtained by The Arizona Republic read: 'The state claims one of the motives for killing the sons was revenge because Laurie, [his] wife and the children's mother, left him after finding out that [he] had sex with multiple prostitutes.' The document went on: 'Laurie was dating someone else at the time of the shootings. The other motive stems from the defendant's poor finances at the time of the shooting.' Leteve reportedly has difficulty speaking after he tried to kill himself. Cops said that moments after killing his Alec and Asher, Leteve stuck the gun under his chin and pulled the trigger, but the bullet went through his nose and he survived. Scroll down for video . Shattered: Prosecutors are out to prove that Leteve shot his sons as an act of vengeance against his wife Laurie, pictured, with whom he was locked in a bitter divorce battle . In a 911 call made moments after the shootings, Leteve can be heard saying: 'I just shot and killed my sons and shot myself.' Leteve was taken to a hospital for treatment of his facial injuries, and taken into custody after his release. KTVK reported that Leteve told detectives that he killed the boys because their mother was preparing to move them with her to Florida. Relatives . told the network that the divorce was not a messy one, and that there . were no indications that Leteve would commit such an act. Murdered: Leteve is accused of gunning down his two young sons Alec, 5, left, and Asher, 1, right, in Scottsdale, Arizona, on March 31, 2010 . Defense attorney Greg Parzych told The Republic: 'We don't feel comfortable talking about our theories of the case on the brink of jury selection, other than to say that for the past two years, we have been working diligently and thoroughly to ensure that Mr Leteve gets the best possible trial he can.' 100 witnesses are reportedly expected to testify. Opening statements are slated to begin on October 9. If convicted, Leteve could face the death penalty. Watch video here .\",\n", " 't5': {'summary': 'Jury selection will begin in the trial of Andre Leteve, who is accused of gunning down his two boys in 2010. He is accused of doing it as an act of vengeance against his wife Laurie, with whom he was locked in a bitter divorce battle.',\n", " 'eval': ' Here are the scores for the summary:\\n\\nCoherence: 5\\nAccuracy: 5 \\nFactuality: 5\\nCompleteness: 3',\n", " 'accuracy': '5',\n", " 'coherence': '5',\n", " 'factuality': '5',\n", " 'completeness': '3'},\n", " 'falcon': {'summary': '\\nThe article above is about an Arizona man named Andre Leteve who is accused of murdering his',\n", " 'eval': ' Here are the scores for the summary:\\n\\nCoherence: 5\\nAccuracy: 4 \\nFactuality: 5\\nCompleteness: 2',\n", " 'accuracy': '4',\n", " 'coherence': '5',\n", " 'factuality': '5',\n", " 'completeness': '2'},\n", " 'gt': {'summary': 'Andre Leteve charged with murdering sons Alec, 5, and Asher, 1, on March 31, 2010 .\\nLeteve attempted suicide after the murders but survived .\\nProsecutors to argue that Leteve shots his sons as an act of revenge against estranged wife .',\n", " 'eval': ' Here are the scores I would assign to the summary:\\n\\n- Coherence: 4\\n- Accuracy: 5 \\n- Factuality: 4\\n- Completeness: 3',\n", " 'accuracy': '5',\n", " 'coherence': '4',\n", " 'factuality': '4',\n", " 'completeness': '3'}}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_map[0]" ] }, { "cell_type": "markdown", "id": "108c1e95-ef63-499e-96d6-56d5287ef313", "metadata": {}, "source": [ "## Results\n", "\n", "*Note*: Do not take these results as a comprehensive model evaluation. The purpose of this notebook is to illustrate the evaluation technique. We did not spend any time on model tuning or prompt engineering." ] }, { "cell_type": "code", "execution_count": 33, "id": "e173fd58-ea90-4ef0-9889-71e4c5e255d2", "metadata": { "tags": [] }, "outputs": [], "source": [ "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 34, "id": "4cf60863-1e82-47f4-a9cf-c5051c0ddbd0", "metadata": { "tags": [] }, "outputs": [], "source": [ "result_values = []\n", "result_types = []\n", "result_models = []\n", "result_docs = []" ] }, { "cell_type": "code", "execution_count": 35, "id": "cd70e840-f4a2-4b95-ada0-86492b77548f", "metadata": { "tags": [] }, "outputs": [], "source": [ "for r in result_map:\n", " for m in ['t5', 'falcon', 'gt']:\n", " for t in ['accuracy', 'coherence', 'factuality', 'completeness']:\n", " result_docs.append(r['doc'])\n", " result_values.append(r[m][t])\n", " result_types.append(t)\n", " result_models.append(m)" ] }, { "cell_type": "code", "execution_count": 36, "id": "b4a43e69-6196-44b5-ae41-46c7f25145d5", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame({'doc': result_docs, 'value': result_values, 'type': result_types, 'model': result_models})" ] }, { "cell_type": "code", "execution_count": 37, "id": "9aec6f63-a568-45b1-87bd-e998a858361d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | doc | \n", "value | \n", "type | \n", "model | \n", "
---|---|---|---|---|
0 | \n", "By . Daily Mail Reporter . PUBLISHED: . 12:03 ... | \n", "5 | \n", "accuracy | \n", "t5 | \n", "
1 | \n", "By . Daily Mail Reporter . PUBLISHED: . 12:03 ... | \n", "5 | \n", "coherence | \n", "t5 | \n", "
2 | \n", "By . Daily Mail Reporter . PUBLISHED: . 12:03 ... | \n", "5 | \n", "factuality | \n", "t5 | \n", "
3 | \n", "By . Daily Mail Reporter . PUBLISHED: . 12:03 ... | \n", "3 | \n", "completeness | \n", "t5 | \n", "
4 | \n", "By . Daily Mail Reporter . PUBLISHED: . 12:03 ... | \n", "4 | \n", "accuracy | \n", "falcon | \n", "