U C^@sddlmZmZddlZddlmZddlmZddlm Z ddl m Z dd l m Z dd lmZdd lmZdd lmZed dddgZddZddZddZGdddeZGddde jZGddde ZddZeeedgZdS))unicode_literalsprint_functionN) namedtuple) STOP_WORDS)TAG_MAP)LANG)Language)Doc)copy_reg)DummyTokenizer ShortUnitWordsurfacelemmaposcCs0zddl}|WStk r*tdYnXdS)zuMecab is required for Japanese support, so check for it. It it's not available blow up and explain how to fix it.rNzJJapanese support requires MeCab: https://github.com/SamuraiT/mecab-python3)MeCab ImportError)rr9/tmp/pip-install-6_kvzl1k/spacy/spacy/lang/ja/__init__.pytry_mecab_importsrcCsX|jdkrdS|jdkrRtd|jr0|jdStd|jrH|jdS|jdS|jS)a3If necessary, add a field to the POS tag for UD mapping. Under Universal Dependencies, sometimes the same Unidic POS tag can be mapped differently depending on the literal token or its context in the sentence. This function adds information to the POS tag to resolve ambiguous mappings. 空白u連体詞,*,*,*u[こそあど此其彼]のz,DETu[こそあど此其彼]z,PRONz,ADJ)rrematchr)tokenrrr resolve_pos!s     rc Cs||}|j}g}g}|jdkr|j}|}|jd}d|dd}t|dkr^|d}|t ||||jj |jj } |t | | dkr|t ddd|d| d8} q|j}q||fS) z@Format Mecab output into a nice data structure, based on Janome.r,r rF) parseToNodenextZposidrZfeaturesplitjoinlenappendrZrlengthlengthbool) tokenizertextnodewordsspacesrbasepartsrZscountrrrdetailed_tokens9s(      r/c@seZdZdddZddZdS)JapaneseTokenizerNcCs6|dk r|jn|||_t|_|jddS)N)vocabZ create_vocabrZTaggerr(r )selfclsnlprrr__init__Zs zJapaneseTokenizer.__init__c Csrt|j|\}}dd|D}t|j||d}g}t||D]&\}}||jt||_|j |_ q<||j d<|S)NcSsg|] }|jqSr)r).0xrrr asz.JapaneseTokenizer.__call__..)r+r, mecab_tags) r/r(r r2zipr%rrZtag_rZlemma_ user_data) r3r)Zdtokensr,r+docr:rZdtokenrrr__call___s    zJapaneseTokenizer.__call__)N)__name__ __module__ __qualname__r6r>rrrrr0Ys r0c@sFeZdZeejjZddee<eZ e Z ddddZ e d ddZdS) JapaneseDefaultscCsdS)Njar)Z_textrrrnzJapaneseDefaults.ZltrF) directionZhas_caseZ has_lettersNcCs t||SN)r0)r4r5rrrcreate_tokenizerssz!JapaneseDefaults.create_tokenizer)N)r?r@rAdictr DefaultsZlex_attr_gettersr r stop_wordsrtag_mapZwriting_system classmethodrHrrrrrBls   rBc@seZdZdZeZddZdS)JapaneserCcCs ||SrG)r()r3r)rrrmake_doc|szJapanese.make_docN)r?r@rAlangrBrJrOrrrrrNxsrNcCs ttfSrG)rNtuple)instancerrrpickle_japanesesrS) __future__rrr collectionsrrKrrLrattrsr languager tokensr compatr utilr rrrr/r0rJrBrNrSpickle__all__rrrrs&