{ "cells": [ { "cell_type": "markdown", "id": "66a65080-4234-43c8-ac3a-1e7e03acd8cf", "metadata": {}, "source": [ "# 1. TrOCR 훈련 이미지 생성\n", "\n", "--- \n", "\n", "0번 모듈에서 생성한 `ocr_dataset_poc.csv`를 사용하여 OCR 훈련 이미지를 생성합니다. 훈련 이미지 생성은 오픈 소스로 공개되어 있는 TextRecognitionDataGenerator를 사용합니다.\n", "\n", "- 참조: https://github.com/Belval/TextRecognitionDataGenerator" ] }, { "cell_type": "code", "execution_count": 1, "id": "1abbdeb2-4261-4408-8366-231165958f42", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com\n", "Requirement already satisfied: lmdb in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (1.4.0)\n", "Requirement already satisfied: nltk in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (3.8.1)\n", "Requirement already satisfied: natsort in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (8.2.0)\n", "Requirement already satisfied: pillow>=7.0.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (9.3.0)\n", "Requirement already satisfied: requests>=2.20.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (2.28.1)\n", "Requirement already satisfied: tqdm>=4.23.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (4.64.1)\n", "Requirement already satisfied: beautifulsoup4>=4.6.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 7)) (4.11.1)\n", "Requirement already satisfied: diffimg in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 8)) (0.3.0)\n", "Requirement already satisfied: fire in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 9)) (0.5.0)\n", "Requirement already satisfied: kiwipiepy in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 10)) (0.14.1)\n", "Requirement already satisfied: transformers in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 11)) (4.23.1)\n", "Requirement already satisfied: datasets in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 12)) (2.9.0)\n", "Requirement already satisfied: jiwer in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 13)) (2.5.1)\n", "Requirement already satisfied: nvidia-ml-py3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from -r requirements.txt (line 14)) (7.352.0)\n", "Collecting wikipedia\n", " Downloading wikipedia-1.4.0.tar.gz (27 kB)\n", " Preparing metadata (setup.py) ... \u001b[?25ldone\n", "\u001b[?25hRequirement already satisfied: joblib in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (1.2.0)\n", "Requirement already satisfied: click in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (8.1.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from nltk->-r requirements.txt (line 2)) (2022.10.31)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (2022.9.24)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (2.1.1)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (1.26.8)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from requests>=2.20.0->-r requirements.txt (line 5)) (3.4)\n", "Requirement already satisfied: soupsieve>1.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from beautifulsoup4>=4.6.0->-r requirements.txt (line 7)) (2.3.2.post1)\n", "Requirement already satisfied: termcolor in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from fire->-r requirements.txt (line 9)) (2.2.0)\n", "Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from fire->-r requirements.txt (line 9)) (1.16.0)\n", "Requirement already satisfied: kiwipiepy-model~=0.14 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (0.14.0)\n", "Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (1.21.6)\n", "Requirement already satisfied: dataclasses in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from kiwipiepy->-r requirements.txt (line 10)) (0.6)\n", "Requirement already satisfied: filelock in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (3.8.0)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (0.13.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (5.4.1)\n", "Requirement already satisfied: packaging>=20.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (21.3)\n", "Requirement already satisfied: importlib-metadata in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (4.11.4)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from transformers->-r requirements.txt (line 11)) (0.10.1)\n", "Requirement already satisfied: xxhash in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (3.2.0)\n", "Requirement already satisfied: fsspec[http]>=2021.11.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (2022.10.0)\n", "Requirement already satisfied: aiohttp in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (3.8.3)\n", "Requirement already satisfied: multiprocess in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.70.14)\n", "Requirement already satisfied: pandas in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (1.3.5)\n", "Requirement already satisfied: dill<0.3.7 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.3.6)\n", "Requirement already satisfied: pyarrow>=6.0.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (11.0.0)\n", "Requirement already satisfied: responses<0.19 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from datasets->-r requirements.txt (line 12)) (0.18.0)\n", "Requirement already satisfied: levenshtein==0.20.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from jiwer->-r requirements.txt (line 13)) (0.20.2)\n", "Requirement already satisfied: rapidfuzz<3.0.0,>=2.3.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from levenshtein==0.20.2->jiwer->-r requirements.txt (line 13)) (2.13.7)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (4.0.2)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.8.1)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (22.1.0)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (6.0.2)\n", "Requirement already satisfied: asynctest==0.13.0 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (0.13.0)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (1.2.0)\n", "Requirement already satisfied: typing-extensions>=3.7.4 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from aiohttp->datasets->-r requirements.txt (line 12)) (4.4.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from packaging>=20.0->transformers->-r requirements.txt (line 11)) (3.0.9)\n", "Requirement already satisfied: zipp>=0.5 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from importlib-metadata->transformers->-r requirements.txt (line 11)) (3.10.0)\n", "Requirement already satisfied: python-dateutil>=2.7.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from pandas->datasets->-r requirements.txt (line 12)) (2.8.2)\n", "Requirement already satisfied: pytz>=2017.3 in /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages (from pandas->datasets->-r requirements.txt (line 12)) (2022.5)\n", "Building wheels for collected packages: wikipedia\n", " Building wheel for wikipedia (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=aa79590ae292e442b5ac1e139d2342f97bdb15aabf86d5b462ac13d45e7584c4\n", " Stored in directory: /home/ec2-user/.cache/pip/wheels/76/d9/1c/4059e4887aaf06a2481aec77e84f3b9f9a010e2b6ad523b95b\n", "Successfully built wikipedia\n", "Installing collected packages: wikipedia\n", "Successfully installed wikipedia-1.4.0\n" ] } ], "source": [ "!pip install -r requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "id": "99005511-1fc6-4caf-9cb0-393b5372ac77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | document | \n", "category | \n", "
---|---|---|
398118 | \n", "누리호 발사 성공을 축하합니다 | \n", "news | \n", "
447804 | \n", "그동안 두 자릿수였던 수출 증가율이 지난달 한 자릿수로 떨어지는 등 둔화세가 본격화... | \n", "news | \n", "
464420 | \n", "동원참치 캔 개발·생산 경험이 있는 동원시스템즈는 차별화된 기술을 적용한 4680 ... | \n", "news | \n", "
534239 | \n", "봉고 III EV 냉동탑차는 135kW 모터와 588kWh 배터리를 탑재해 완충 시... | \n", "news | \n", "
235107 | \n", "아 점깍은이유는 네이버가 영화 분류를 코믹으로 안하고 공포에 집어넣은 실수 때문에 | \n", "nsmc | \n", "
618045 | \n", "유경준 의원은 개회사에서 “전 세계적으로 성장률 저하와 물가 상승이 동반되는 스테그... | \n", "news | \n", "
633123 | \n", "사랑은 변하고 사람은 안 변해요 | \n", "chatbot | \n", "
20017 | \n", "꼴깝 떨고 앉았있네 별 시덥잖은 의미 | \n", "nsmc | \n", "
184000 | \n", "청춘의지표는 여기서부터 시작이아닐까 싶다 | \n", "nsmc | \n", "
115726 | \n", "차화연의 빛나는 미모 | \n", "nsmc | \n", "