# OpenAI API Fine Tuning for JAQKET dataset

[OpenAI API](https://platform.openai.com/) を、 [JAQKET](https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/) のデータセットで Fine Tuning するためのサンプルコードです。

Notebook を動かす前に、 OpenAI API の API KEY と Organization ID の確認が必要です。いずれも Manage Account のメニューから確認できます。取得した API KEY と Organization ID は `os.environ` などを使い事前に設定しておいてください。これらは機微な情報のため、 GitHub などに誤ってコミットしないよう注意してください。

## Setup


In [None]:
!pip install openai requests

In [None]:
import os
import json
import requests
import pandas as pd
import openai


OPENAI_ORGANIZATION = os.getenv("OPENAI_ORGANIZATION")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

JAQKET_TRAIN_DATASET = (
    "https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_02/aio_02_train.jsonl"
)
JAQKET_DEV_DATASET = (
    "https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_02/aio_02_dev_v1.0.jsonl"
)


if OPENAI_ORGANIZATION is None or OPENAI_API_KEY is None:
    raise Exception(
        "Please set the OPENAI_ORGANIZATION and OPENAI_API_KEY environment variables for organization and api key."
    )

## Prepare dataset

In [None]:
def read_jaqket_dataset(dataset_url: str) -> pd.DataFrame:
    file_name = os.path.basename(dataset_url)
    location = os.path.join(f"data/{file_name}")
    if not os.path.exists(location):
        response = requests.get(dataset_url)
        with open(location, mode="wb") as f:
            f.write(response.content)

    return pd.read_json(location, lines=True)

In [None]:
df_train = read_jaqket_dataset(JAQKET_TRAIN_DATASET)
df_dev = read_jaqket_dataset(JAQKET_DEV_DATASET)

## Fine tuning

In [None]:
PROMPT_TEMPLATE = "日本語のクイズに答えてください。\n{instruction}\n答えは「"


def convert_for_fine_tune(df: pd.DataFrame) -> pd.DataFrame:
    df_fine_tune = df[["question", "answers"]].rename(
        columns={"question": "prompt", "answers": "completion"}
    )
    df_fine_tune.prompt = df_fine_tune.prompt.map(
        lambda p: PROMPT_TEMPLATE.format(instruction=p)
    )
    df_fine_tune.completion = df_fine_tune.completion.map(lambda c: f"{c[0]}」")
    return df_fine_tune


fine_tune_file_name = "jaqket_fine_tune.jsonl"
df_fine_tune = convert_for_fine_tune(df_train)
df_fine_tune = df_fine_tune.drop_duplicates("prompt")
df_fine_tune.head(3)

In [None]:
df_fine_tune.to_json(f"data/{fine_tune_file_name}", orient="records", lines=True)

Open AI API で Fine Tuning をする前にデータセットの内容を確認します。

In [None]:
!openai tools fine_tunes.prepare_data -f data/{fine_tune_file_name} -q

Fine Tuning を実行します。 2023 年 6 月時点 の価格ではデフォルトの `Curie` モデルでだいたい 28 ドルぐらいかかります。

In [None]:
!openai api fine_tunes.create -t data/{fine_tune_file_name}

Fine Tuning ジョブの一覧は `fine_tunes.list` で、個別 Fine Tuning については `fine_tunes.get` で ID を指定し確認できます。

In [None]:
!openai api fine_tunes.list

In [None]:
!openai api fine_tunes.get -i ft-tlIvgtEU7m2IpkanRAxqQfbF

Fine Tuning の結果得られたモデルの ID を設定してください。

In [None]:
model_name = ""  # Ex: curie:ft-personal-2023-06-11-08-56-42

## Answer to quiz by ChatGPT

In [None]:
import re

def answer(model: str, question: str) -> str:
    openai.organization = OPENAI_ORGANIZATION
    openai.api_key = OPENAI_API_KEY

    template = "日本語のクイズに答えてください。\n{instruction}\n答えは「"
    prompt = template.format(instruction=question)

    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=64,
        temperature=0,
        top_p=1,
        n=1,
        stop="\n",
    )

    _answer = response["choices"][0]["text"]
    _answer = re.findall("「(.*?)」", prompt + _answer)[-1]
    return _answer

## Answer to dataset

In [None]:
from tqdm import tqdm


def answer_jaqket(model: str, question_df: pd.DataFrame) -> pd.DataFrame:
    chatgpt_answers = []
    matches = []
    for idx, row in tqdm(question_df.iterrows()):
        chatgpt_answer = answer(model, row["question"])
        chatgpt_answers += [chatgpt_answer]
        matches += [chatgpt_answer in row["answers"]]

    question_df["chatgpt_answer"] = pd.Series(chatgpt_answers)
    question_df["match"] = pd.Series(matches)
    return question_df


answer_file_name = "jaqket_answers_with_fine_tune.csv"
answers = answer_jaqket(model_name, df_dev)
answers.to_csv(f"data/{answer_file_name}", index=False)