U á€C^Tã@sXddlmZddlZddlmZddd„Zd d „Zdd d „Zd d„Zdd„Z dd„Z dS)é)Úunicode_literalsNé)Ú iob_to_biluoé FcKsªg}g}t||d}d}d} t|ƒD]h\} \} } | d\} }|sTt| ddƒ} d}| t| | ƒ¡t|ƒ|dkr$t|| ƒ}| |¡g}q$|r¦t|| ƒ}| |¡|S)aG Convert conllu files into JSON format for use with train cli. use_morphology parameter enables appending morphology to tags, which is useful for languages such as Spanish, where UD tags are not so rich. Extract NER tags if available and convert them so that they follow BILUO and the Wikipedia scheme )Úuse_morphologyFréT)Ú read_conllxÚ enumerateÚis_nerÚappendÚgenerate_sentenceÚlenÚ create_doc)Ú input_dataZn_sentsrÚlangÚ_ZdocsÚ sentencesZ conll_tuplesZchecked_for_nerÚ has_ner_tagsÚiZraw_textÚtokensÚsentenceZbracketsÚdoc©rúC/tmp/pip-install-6_kvzl1k/spacy/spacy/cli/converters/conllu2json.pyÚ conllu2json s&       rcCs(t d|¡}|rdS|dkr dSdSdS)za Check the 10th column of the first token to determine if the file contains NER tags ú([A-Z_]+)-([A-Z_]+)TÚOFN)ÚreÚmatch)ÚtagÚ tag_matchrrrr +s  r c csbd}| ¡ d¡D]H}| ¡ d¡}|r|d d¡rD| d¡q*g}|D]Î}| d¡}|\ } } } } } }}}}}d| ksLd| kr„qLz~t| ƒd} |d kr¦t|ƒdn| }|d kr¶d n|}| d krÆ| n| } |rÚ| d |n| } |ræ|nd}| | | | |||f¡WqLt|ƒ‚YqLXqLdd„t|ŽDƒ}d|gggfV|d7}|dkr||krq^qdS)Nrz Ú ú#ú ú-Ú.éÚ0ÚrootÚROOTrÚ__rcSsg|] }t|ƒ‘qSr)Úlist)Ú.0ÚtrrrÚ Rszread_conllx..)ÚstripÚsplitÚ startswithÚpopÚintr ÚprintÚzip)rrÚnrÚsentÚlinesrÚlineÚpartsÚid_ÚwordZlemmaÚposrZmorphÚheadÚdepZ_1ÚiobZtuplesrrrr9s8     rcCs„g}|D]v}t d|¡}|rt| d¡}| d¡}|dkr>d}n*|dkrLd}n|dkrh|dkrh|dkrhd }|d |}| |¡q|S) zý Simplify tags obtained from the dataset in order to follow Wikipedia scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while 'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to 'MISC'. rr&éZGPE_LOCZLOCZGPE_ORGZORGZPERZMISCr$)rrÚgroupr )r@Znew_iobrr ÚprefixÚsuffixrrrÚ simplify_tagsYs     rEcCs¤|\}}}}}}i}g} |r,t|ƒ}t|ƒ} t|ƒD]b\} } i} | | d<|| | d<|| | d<|| | | d<|| | d<|rŒ| | | d<|  | ¡q4| |d<|S)NÚidZorthrr>r?Znerr)rErr r )r7rr;r<rr>r?r@rrZbiluorrFÚtokenrrrr qs$     r cCs2i}i}||d<g|d<||d<|d |¡|S)NrFZ paragraphsr)r )rrFrZ paragraphrrrr†sr)rFN)Fr) Ú __future__rrZgoldrrr rrEr rrrrrÚs   "