{ "cells": [ { "cell_type": "markdown", "id": "78b0280b-6461-4fc0-bca3-81fe4d72e92b", "metadata": {}, "source": [ "# OpenAI API Fine Tuning for JAQKET dataset\n", "\n", "[OpenAI API](https://platform.openai.com/) を、 [JAQKET](https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/) のデータセットで Fine Tuning するためのサンプルコードです。\n", "\n", "Notebook を動かす前に、 OpenAI API の API KEY と Organization ID の確認が必要です。いずれも Manage Account のメニューから確認できます。取得した API KEY と Organization ID は `os.environ` などを使い事前に設定しておいてください。これらは機微な情報のため、 GitHub などに誤ってコミットしないよう注意してください。\n", "\n", "## Setup\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c8310b9d-c52c-4d89-ab53-78dfcfecd348", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install openai requests" ] }, { "cell_type": "code", "execution_count": null, "id": "c8a6a1d6-73e2-4c03-8172-2e143c83de12", "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "import requests\n", "import pandas as pd\n", "import openai\n", "\n", "\n", "OPENAI_ORGANIZATION = os.getenv(\"OPENAI_ORGANIZATION\")\n", "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n", "\n", "JAQKET_TRAIN_DATASET = (\n", " \"https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_02/aio_02_train.jsonl\"\n", ")\n", "JAQKET_DEV_DATASET = (\n", " \"https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_02/aio_02_dev_v1.0.jsonl\"\n", ")\n", "\n", "\n", "if OPENAI_ORGANIZATION is None or OPENAI_API_KEY is None:\n", " raise Exception(\n", " \"Please set the OPENAI_ORGANIZATION and OPENAI_API_KEY environment variables for organization and api key.\"\n", " )" ] }, { "cell_type": "markdown", "id": "4e0c91fd-8064-4f08-bb97-37d655f30f3e", "metadata": {}, "source": [ "## Prepare dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "dadd02d7-704c-426f-a2b1-def45b9d90c6", "metadata": { "tags": [] }, "outputs": [], "source": [ "def read_jaqket_dataset(dataset_url: str) -> pd.DataFrame:\n", " file_name = os.path.basename(dataset_url)\n", " location = os.path.join(f\"data/{file_name}\")\n", " if not os.path.exists(location):\n", " response = requests.get(dataset_url)\n", " with open(location, mode=\"wb\") as f:\n", " f.write(response.content)\n", "\n", " return pd.read_json(location, lines=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "d466279e-82b8-461b-b9cc-ca6466baf125", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_train = read_jaqket_dataset(JAQKET_TRAIN_DATASET)\n", "df_dev = read_jaqket_dataset(JAQKET_DEV_DATASET)" ] }, { "cell_type": "markdown", "id": "44190358-7cfe-472a-ac6c-3040399c7f12", "metadata": {}, "source": [ "## Fine tuning" ] }, { "cell_type": "code", "execution_count": null, "id": "f80d71fd-30a1-46bd-9b8e-04057cc1a179", "metadata": { "tags": [] }, "outputs": [], "source": [ "PROMPT_TEMPLATE = \"日本語のクイズに答えてください。\\n{instruction}\\n答えは「\"\n", "\n", "\n", "def convert_for_fine_tune(df: pd.DataFrame) -> pd.DataFrame:\n", " df_fine_tune = df[[\"question\", \"answers\"]].rename(\n", " columns={\"question\": \"prompt\", \"answers\": \"completion\"}\n", " )\n", " df_fine_tune.prompt = df_fine_tune.prompt.map(\n", " lambda p: PROMPT_TEMPLATE.format(instruction=p)\n", " )\n", " df_fine_tune.completion = df_fine_tune.completion.map(lambda c: f\"{c[0]}」\")\n", " return df_fine_tune\n", "\n", "\n", "fine_tune_file_name = \"jaqket_fine_tune.jsonl\"\n", "df_fine_tune = convert_for_fine_tune(df_train)\n", "df_fine_tune = df_fine_tune.drop_duplicates(\"prompt\")\n", "df_fine_tune.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "b72ede98-3d04-47be-8ec1-b32f2f159b1d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_fine_tune.to_json(f\"data/{fine_tune_file_name}\", orient=\"records\", lines=True)" ] }, { "cell_type": "markdown", "id": "dba92d3d-0240-4ddf-a7d7-c7296120730b", "metadata": {}, "source": [ "Open AI API で Fine Tuning をする前にデータセットの内容を確認します。" ] }, { "cell_type": "code", "execution_count": null, "id": "2fe8a2ec-3fae-442c-b7c2-6e90d18e06ef", "metadata": { "tags": [] }, "outputs": [], "source": [ "!openai tools fine_tunes.prepare_data -f data/{fine_tune_file_name} -q" ] }, { "cell_type": "markdown", "id": "6a31fe3d-ea47-4cdf-921d-8d618c974f04", "metadata": {}, "source": [ "Fine Tuning を実行します。 2023 年 6 月時点 の価格ではデフォルトの `Curie` モデルでだいたい 28 ドルぐらいかかります。" ] }, { "cell_type": "code", "execution_count": null, "id": "1d3f57ad-1580-41c3-91b1-b2c795342b12", "metadata": { "tags": [] }, "outputs": [], "source": [ "!openai api fine_tunes.create -t data/{fine_tune_file_name}" ] }, { "cell_type": "markdown", "id": "9d13400e-9604-43ae-965d-85ee9507b2e7", "metadata": {}, "source": [ "Fine Tuning ジョブの一覧は `fine_tunes.list` で、個別 Fine Tuning については `fine_tunes.get` で ID を指定し確認できます。" ] }, { "cell_type": "code", "execution_count": null, "id": "1cf90bdd-228b-4ffd-a2bb-2892a977ca4c", "metadata": { "tags": [] }, "outputs": [], "source": [ "!openai api fine_tunes.list" ] }, { "cell_type": "code", "execution_count": null, "id": "5fbe121e-9c03-46d1-8692-3d58078ed39b", "metadata": { "tags": [] }, "outputs": [], "source": [ "!openai api fine_tunes.get -i ft-tlIvgtEU7m2IpkanRAxqQfbF" ] }, { "cell_type": "markdown", "id": "675f0017-3e66-4277-ae42-039fc4922ba3", "metadata": {}, "source": [ "Fine Tuning の結果得られたモデルの ID を設定してください。" ] }, { "cell_type": "code", "execution_count": null, "id": "6ab06eab-ba4b-4032-9235-7c35010e215d", "metadata": { "tags": [] }, "outputs": [], "source": [ "model_name = \"\" # Ex: curie:ft-personal-2023-06-11-08-56-42" ] }, { "cell_type": "markdown", "id": "2c224fae-518f-46ed-99e3-98249222515b", "metadata": {}, "source": [ "## Answer to quiz by ChatGPT" ] }, { "cell_type": "code", "execution_count": null, "id": "159dad9d-5d2d-4c41-a3e8-ffd438baa1dc", "metadata": { "tags": [] }, "outputs": [], "source": [ "import re\n", "\n", "def answer(model: str, question: str) -> str:\n", " openai.organization = OPENAI_ORGANIZATION\n", " openai.api_key = OPENAI_API_KEY\n", "\n", " template = \"日本語のクイズに答えてください。\\n{instruction}\\n答えは「\"\n", " prompt = template.format(instruction=question)\n", "\n", " response = openai.Completion.create(\n", " model=model,\n", " prompt=prompt,\n", " max_tokens=64,\n", " temperature=0,\n", " top_p=1,\n", " n=1,\n", " stop=\"\\n\",\n", " )\n", "\n", " _answer = response[\"choices\"][0][\"text\"]\n", " _answer = re.findall(\"「(.*?)」\", prompt + _answer)[-1]\n", " return _answer" ] }, { "cell_type": "markdown", "id": "7263497e-071e-4064-b9b3-41196c10a96c", "metadata": {}, "source": [ "## Answer to dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "bcac4083-091c-48e7-9e19-def56674d168", "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm\n", "\n", "\n", "def answer_jaqket(model: str, question_df: pd.DataFrame) -> pd.DataFrame:\n", " chatgpt_answers = []\n", " matches = []\n", " for idx, row in tqdm(question_df.iterrows()):\n", " chatgpt_answer = answer(model, row[\"question\"])\n", " chatgpt_answers += [chatgpt_answer]\n", " matches += [chatgpt_answer in row[\"answers\"]]\n", "\n", " question_df[\"chatgpt_answer\"] = pd.Series(chatgpt_answers)\n", " question_df[\"match\"] = pd.Series(matches)\n", " return question_df\n", "\n", "\n", "answer_file_name = \"jaqket_answers_with_fine_tune.csv\"\n", "answers = answer_jaqket(model_name, df_dev)\n", "answers.to_csv(f\"data/{answer_file_name}\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "a016cf95-dff3-4ed3-82b7-f019f013994b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }