{ "cells": [ { "cell_type": "markdown", "id": "66a65080-4234-43c8-ac3a-1e7e03acd8cf", "metadata": {}, "source": [ "# 1. TrOCR 훈련 이미지 생성\n", "\n", "--- \n", "\n", "0번 모듈에서 생성한 `ocr_dataset_poc.csv`를 사용하여 OCR 훈련 이미지를 생성합니다. 훈련 이미지 생성은 오픈 소스로 공개되어 있는 TextRecognitionDataGenerator를 사용합니다.\n", "\n", "- 참조: https://github.com/Belval/TextRecognitionDataGenerator" ] }, { "cell_type": "code", "execution_count": 1, "id": "1abbdeb2-4261-4408-8366-231165958f42", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com\n", "Requirement already satisfied: lmdb in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (1.4.0)\n", "Requirement already satisfied: nltk in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (3.8.1)\n", "Requirement already satisfied: natsort in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (8.2.0)\n", "Requirement already satisfied: pillow>=7.0.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (9.3.0)\n", "Requirement already satisfied: requests>=2.20.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (2.28.1)\n", "Requirement already satisfied: tqdm>=4.23.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (4.64.1)\n", "Requirement already satisfied: beautifulsoup4>=4.6.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 7)) (4.11.1)\n", "Requirement already satisfied: diffimg in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 8)) (0.3.0)\n", "Requirement already satisfied: fire in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 9)) (0.5.0)\n", "Requirement already satisfied: kiwipiepy in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 10)) (0.14.1)\n", "Requirement already satisfied: transformers in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 11)) (4.23.1)\n", "Requirement already satisfied: datasets in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 12)) (2.9.0)\n", "Requirement already satisfied: jiwer in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 13)) (2.5.1)\n", "Requirement already satisfied: nvidia-ml-py3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 14)) (7.352.0)\n", "Collecting wikipedia\n", " Downloading wikipedia-1.4.0.tar.gz (27 kB)\n", " Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hRequirement already satisfied: joblib in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (1.2.0)\n", "Requirement already satisfied: click in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (8.1.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (2022.10.31)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (2022.9.24)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (2.1.1)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (1.26.8)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (3.4)\n", "Requirement already satisfied: soupsieve>1.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from beautifulsoup4>=4.6.0->-r requirements.txt (line 7)) (2.3.2.post1)\n", "Requirement already satisfied: termcolor in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from fire->-r requirements.txt (line 9)) (2.2.0)\n", "Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from fire->-r requirements.txt (line 9)) (1.16.0)\n", "Requirement already satisfied: kiwipiepy-model~=0.14 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (0.14.0)\n", "Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (1.21.6)\n", "Requirement already satisfied: dataclasses in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (0.6)\n", "Requirement already satisfied: filelock in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (3.8.0)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (0.13.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (5.4.1)\n", "Requirement already satisfied: packaging>=20.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (21.3)\n", "Requirement already satisfied: importlib-metadata in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (4.11.4)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (0.10.1)\n", "Requirement already satisfied: xxhash in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (3.2.0)\n", "Requirement already satisfied: fsspec[http]>=2021.11.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (2022.10.0)\n", "Requirement already satisfied: aiohttp in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (3.8.3)\n", "Requirement already satisfied: multiprocess in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.70.14)\n", "Requirement already satisfied: pandas in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (1.3.5)\n", "Requirement already satisfied: dill<0.3.7 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.3.6)\n", "Requirement already satisfied: pyarrow>=6.0.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (11.0.0)\n", "Requirement already satisfied: responses<0.19 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.18.0)\n", "Requirement already satisfied: levenshtein==0.20.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from jiwer->-r requirements.txt (line 13)) (0.20.2)\n", "Requirement already satisfied: rapidfuzz<3.0.0,>=2.3.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from levenshtein==0.20.2->jiwer->-r requirements.txt (line 13)) (2.13.7)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (4.0.2)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.8.1)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (22.1.0)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (6.0.2)\n", "Requirement already satisfied: asynctest==0.13.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (0.13.0)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.2.0)\n", "Requirement already satisfied: typing-extensions>=3.7.4 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (4.4.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from packaging>=20.0->transformers->-r requirements.txt (line 11)) (3.0.9)\n", "Requirement already satisfied: zipp>=0.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from importlib-metadata->transformers->-r requirements.txt (line 11)) (3.10.0)\n", "Requirement already satisfied: python-dateutil>=2.7.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from pandas->datasets->-r requirements.txt (line 12)) (2.8.2)\n", "Requirement already satisfied: pytz>=2017.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from pandas->datasets->-r requirements.txt (line 12)) (2022.5)\n", "Building wheels for collected packages: wikipedia\n", " Building wheel for wikipedia (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=aa79590ae292e442b5ac1e139d2342f97bdb15aabf86d5b462ac13d45e7584c4\n", " Stored in directory: /home/ec2-user/.cache/pip/wheels/76/d9/1c/4059e4887aaf06a2481aec77e84f3b9f9a010e2b6ad523b95b\n", "Successfully built wikipedia\n", "Installing collected packages: wikipedia\n", "Successfully installed wikipedia-1.4.0\n" ] } ], "source": [ "!pip install -r requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "id": "99005511-1fc6-4caf-9cb0-393b5372ac77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
documentcategory
398118누리호 발사 성공을 축하합니다news
447804그동안 두 자릿수였던 수출 증가율이 지난달 한 자릿수로 떨어지는 등 둔화세가 본격화...news
464420동원참치 캔 개발·생산 경험이 있는 동원시스템즈는 차별화된 기술을 적용한 4680 ...news
534239봉고 III EV 냉동탑차는 135kW 모터와 588kWh 배터리를 탑재해 완충 시...news
235107아 점깍은이유는 네이버가 영화 분류를 코믹으로 안하고 공포에 집어넣은 실수 때문에nsmc
618045유경준 의원은 개회사에서 “전 세계적으로 성장률 저하와 물가 상승이 동반되는 스테그...news
633123사랑은 변하고 사람은 안 변해요chatbot
20017꼴깝 떨고 앉았있네 별 시덥잖은 의미nsmc
184000청춘의지표는 여기서부터 시작이아닐까 싶다nsmc
115726차화연의 빛나는 미모nsmc
\n", "
" ], "text/plain": [ " document category\n", "398118 누리호 발사 성공을 축하합니다 news\n", "447804 그동안 두 자릿수였던 수출 증가율이 지난달 한 자릿수로 떨어지는 등 둔화세가 본격화... news\n", "464420 동원참치 캔 개발·생산 경험이 있는 동원시스템즈는 차별화된 기술을 적용한 4680 ... news\n", "534239 봉고 III EV 냉동탑차는 135kW 모터와 588kWh 배터리를 탑재해 완충 시... news\n", "235107 아 점깍은이유는 네이버가 영화 분류를 코믹으로 안하고 공포에 집어넣은 실수 때문에 nsmc\n", "618045 유경준 의원은 개회사에서 “전 세계적으로 성장률 저하와 물가 상승이 동반되는 스테그... news\n", "633123 사랑은 변하고 사람은 안 변해요 chatbot\n", "20017 꼴깝 떨고 앉았있네 별 시덥잖은 의미 nsmc\n", "184000 청춘의지표는 여기서부터 시작이아닐까 싶다 nsmc\n", "115726 차화연의 빛나는 미모 nsmc" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import multiprocessing\n", "\n", "df = pd.read_csv('ocr_dataset_poc.csv')\n", "display(df.sample(n=10, random_state=42))\n", "df['document'].to_csv('ocr_dataset_poc.txt', header=False, index=False)" ] }, { "cell_type": "markdown", "id": "cfc92037-7824-43b8-bbcc-75ffaaa8dd18", "metadata": {}, "source": [ "전체 데이터셋은 60만건을 초과하지만, 핸즈온 시에는 1000건의 샘플만 사용합니다." ] }, { "cell_type": "code", "execution_count": 3, "id": "f3484867-11ed-4c76-951f-28ae3eb0c183", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "news 333690\n", "nsmc 284530\n", "chatbot 19181\n", "Name: category, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['category'].value_counts()" ] }, { "cell_type": "code", "execution_count": 4, "id": "facf996e-d0bc-466b-949d-c83240b70059", "metadata": {}, "outputs": [], "source": [ "FULL_TRAINING = False\n", "\n", "if not FULL_TRAINING:\n", " df = df.sample(n=1000)" ] }, { "cell_type": "code", "execution_count": 5, "id": "b856cdbb-d865-447d-96fb-dcf625564888", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Missing modules for handwritten text generation.\n", "100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 1058.05it/s]\n" ] } ], "source": [ "num_cores = multiprocessing.cpu_count()\n", "dataset_dir = 'train'\n", "num_samples = len(df)\n", "\n", "!rm -rf {dataset_dir}\n", "!python3 ./trdg/run.py -i ocr_dataset_poc.txt -w 5 -t {num_cores} -f 64 -l ko -c {num_samples} -na 2 --output_dir {dataset_dir}\n", "!cp {dataset_dir}/labels.txt ./" ] }, { "cell_type": "markdown", "id": "d445385d-92fa-4570-89e3-5b0ef2fb34ca", "metadata": {}, "source": [ "선택적으로 랜덤하게 생성한 문장을 훈련 데이터셋에 추가합니다. 좀 더 다양한 형태의 데이터로 훈련해야 할 때 참고하세요. " ] }, { "cell_type": "code", "execution_count": null, "id": "23bf369c-2b00-495f-8eab-204d243c1a19", "metadata": {}, "outputs": [], "source": [ "# rand_dataset_dir = 'rand_train'\n", "# num_rand_samples = 500000\n", "\n", "# !rm -rf {rand_dataset_dir}\n", "# !python3 ./trdg/run.py -w 5 -t {num_cores} -f 64 -l ko -c {num_rand_samples} -na 2 --output_dir {rand_dataset_dir}\n", "# !cp {rand_dataset_dir}/labels.txt ./rand_labels.txt" ] } ], "metadata": { "kernelspec": { "display_name": "conda_amazonei_pytorch_latest_p37", "language": "python", "name": "conda_amazonei_pytorch_latest_p37" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }