U C^@spddlmZddlmZddlmZddlmZddlm Z ddl m Z dd d Z dd dZ ddZddZd S))unicode_literals)Printer) iob_to_biluo) MultiLanguage)Doc) load_model FNc Kst|d}d}d|kr(|r(|dd}||krB|rB|dd}d|krl||krl|rlt||t|||}d|kr||kr|rt|dd||d }d|kr||kr|dkr|s|d |nt||t|||||d }d|kr|d ||kr|d g}||D]} | } | s*qg} | dD]} | } | sNq8d d| dD} tt dd| D} t | dkrt d| d}| d}t | dkr| d}ndgt |}t |}| dddt |||Diq8| t |d| igdg} q|S)a$ Convert files in the CoNLL-2003 NER format and similar whitespace-separated columns into JSON format for use with train cli. The first column is the tokens, the final column is the IOB tags. If an additional second column is present, the second column is the tags. Sentences are separated with whitespace and documents can be separated using the line "-DOCSTART- -X- O O". Sample format: -DOCSTART- -X- O O I O like O London B-GPE and O New B-GPE York I-GPE City I-GPE . O )no_printz-DOCSTART- -X- O O zNSentence boundaries found, automatic sentence segmentation with `-s` disabled.FzNDocument delimiters found, automatic document segmentation with `-n` disabled.r)modelmsgzzNo sentence boundaries found to use with option `-n {}`. Use `-s` to automatically segment sentences or `-n 0` to disable.zJNo sentence boundaries found. Use `-s` to automatically segment sentences.zWNo document delimiters found. Use `-n` to automatically group sentences into documents.cSsg|]}|r|qS)strip.0linerrF/tmp/pip-install-6_kvzl1k/spacy/spacy/cli/converters/conll_ner2json.py asz"conll_ner2json.. cSsg|] }|qSr)splitrrrrrbszThe token-per-line NER file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert-tokenscSsg|]\}}}|||dqS))ZorthtagZnerr)rwrentrrrrrs sentences)idZ paragraphs)rwarn n_sents_info segment_docssegment_sents_and_docsformatrrlistziplen ValueErrorrappend) input_datan_sentsZ seg_sentsr r kwargsr doc_delimiterZ output_docsdocZ output_docsentlinescolswordsZiob_entstagsZ biluo_entsrrrconll_ner2json s          r6cCsd}|r4t|}d|jkr4|d||d}|sR|dt}|d}|d}dd|D}t |j |d} || g} d } t | D]H\} } | j r|r| |d kr| || d | d 7} | || qd| S) Nparserz1Segmenting sentences with parser from model '{}'.zhSegmenting sentences with sentencizer. (Use `-b model` for improved parser-based sentence segmentation.) sentencizerrcSsg|]}|dqS)r)rrrrrrrsz*segment_sents_and_docs..)r4rr r)rZ pipe_namesinfor&Zget_piperZ create_piperrrZvocab enumerateZ is_sent_startr+join)r0r-r/r rr8Znlpr2r4ZnlpdocZlines_with_segsZ sent_countitokenrrrr%s4     r%csZd}||fddtdtD}d}|D]}|||7}|||7}q6|S)Nr csg|]}||qSrr)rr<r-Zsentsrrrsz segment_docs..rr )rranger)r;)r,r-r/Zsent_delimiterZdocsr0rr>rr$s   r$cCs&|d||dkr"|ddS)Nz,Grouping every {} sentences into a document.rz^To generate better training data, you may want to group sentences into documents with `-n 10`.)r9r&r")rr-rrrr#s r#)r FNF)NN) __future__rZwasabirZgoldrZlang.xxrZ tokens.docrutilrr6r%r$r#rrrrs       s