U C^mL @s2dZddlmZddlZddlmZddlZddlZddlZddl Zddl m Z ddl m Z mZddlmZddl mZmZmZdd lmZdd lmZdd lmZdd lmZddlZdd lmZddlmZddlmZz ddl Z Wne!k rdZ YnXe"dZ#ddZ$dKddZ%ddZ&ddZ'dLddZ(ddZ)dMd d!Z*d"d#Z+d$d%Z,d&d'Z-dNd(d)Z.d*d+Z/d,d-Z0Gd.d/d/e1Z2Gd0d1d1e1Z3Gd2d3d3e1Z4ej5d4d5defd6d5defd7d5de6fd8d9d:efd;d9de7fd?d@dAe7fdBd9dCefdDdOdFdGZ8dHdIZ9e:dJkr.e;e8dS)PzTrain for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes .conllu format for development data, allowing the official scorer to be used. )unicode_literalsN)Path)conll17_ud_eval)TokenDoc) GoldParse) compounding minibatchminibatch_by_words) projectivize)Matcher)displacy) defaultdict)lang)zh)jaz\s+cCsdd|dDS)NcSsg|]}td|qS) )space_resubstrip).0parr2/tmp/pip-install-6_kvzl1k/spacy/bin/ud/ud_train.py +szsplit_text..z )splittextrrr split_text*srTFc Cs|s|stdt|}t|}g} g} tt||D]>\} \} } g}| D]}tt}|D]\ }}}}}}}}}}d|krqbd|krqbt|d}|dkrt|dn|}|d ||d ||d t ||dd  d ||d  ||d  |d krdn||d |dkqbdgt |d|d<t |d |d \|d <|d <|r| t|j|d|dd| t| d f|| d jdk st| ||rP|rPt ||krPt|d|\}}|jdk stg}| || ||rPt | |krP| | fSqP|r^|r^t|d|\}}| || ||r:t | |kr:| | fSq:| | fS)a(Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True, include Doc objects created using nlp.make_doc and then aligned against the gold-standard sequences. If oracle_segments=True, include Doc objects created from the gold-standard segments. At least one must be True.z8At least one of raw_text or oracle_segments must be True.-0wordstags morphologyzPOS_%sheadsdepsrootROOTspaces_entities)r#r+N) ValueErrorrread read_conllu enumerateziprlistintappend_parse_morph_stringaddlenr rvocabrr%AssertionError _make_gold)nlpZ conllu_file text_fileraw_textoracle_segmentsmax_doc_lengthlimitZ paragraphsconlludocsgoldsZdoc_idrZcd sent_annotscssentid_wordZlemmapostagmorphheaddepr,Z space_afterdocgoldrrr read_data.s\          rQcCst|dkrtSg}dddd}|dD]B}|d\}}|||}|dd }|d ||fq(t|S) Nr,onetwothree)123|=,rz%s_%s)setrgetr5lower)Z morph_stringoutput replacementsZfeaturekeyvaluerrrr6ns  r6cCsg}g}g}|D]}|dr2|r,||g}q|dr@qq|s\|rV||g}q|t|dt|ddkrtt|tq|r|||r|||S)Nz# newdoc# r& ) startswithr5rr3rr8printreprr.)file_rCrGrOlinerrrr0zs.       r0c s ttg}|D]f}dfdd|dDdD]}|||q8|d|dgt|ddqtdtd kst|dkrd d dtdd D}||} d t |f}||_ t t|j D]$} t|krd|j | <d|j| <q||fS) Nr'c3s|]}td|VqdS)r#Nr8)rrMZflatrr sz_make_gold..)r#r$r(r%r-r+TFr#r!r+css|]\}}|d|VqdS)rNr)rrIspacerrrrms)rr3extendr5r8r:joinr2Zmake_docpopr sent_startsranger'randomlabels) r<rrEZ drop_depsrsrGfieldrOrPirrlrr;s,         r;c Cs\g}t||D]H\}}|j}t|j\}}}} } } |||| | | fgfg} ||| fq|S)z]Get out the annoying 'tuples' format used by begin_training, given the GoldParse objects.)r2rZ orig_annotr5) rCrDZtuplesrOrPridsr#r$r'rvZiobsentsrrrgolds_to_gold_tuplessr{c Cs:|jddrg}|jdd@}t|D]0}|D]&}dd|D} |t|j| dq2q*W5QRX|jD]\} } t| |}qln4|jddd } t | } t| | }W5QRX|jd dd}t ||W5QRX|jddd@}t |}|jddd}t |}W5QRXt ||}W5QRX||fS) Nr&z.conlluutf8encodingcSsg|] }|dqSr!r)rrirrrrszevaluate..)r#rw)partsendswithopenr0r5rr9pipeliner3piperr/ write_conllurZ load_conlluevaluate)r<Ztext_locZgold_locZsys_locrArCrhZ conllu_docZ conllu_sentr#name componentr=Ztextsout_fileZ gold_fileZgold_udsys_fileZsys_udscoresrrrrs( "  rc sTtdstjdtdtds0tjdddtdsHtjdddt|dj}|ddd d d gt|D]\}g}jr|}fd d |D}t } @}|D]4}t t |j |j } | |s|||| qW5QRX|dj|dtjD]8\} } |dj|| d|dj| jdt| D]\} } | jj| djks|| jj| djkr&| djd| djD]}t|j|jj|j|jq| D]}t|j|jj|j|jq| dj| djdD]}t|j|jj|j|jqtd| j|| j| dqL|dqqtdS)Nget_conllu_linesmethod begins_fusedFdefault inside_fusedrZSUBTOKZsubtok+)ZDEPopcs"g|]\}}}||dqSrr)rr,startendrOrrrsz write_conllu..z# newdoc id = {i} rxz# sent_id = {i}.{j} )rxjz# text = {text} rr&rdz)Invalid parse: head outside sentence (%s) )rZ has_extension set_extensionget_token_conllur r9r7r1Z is_parsedr[Z retokenizertrr intersectionmergeupdatewriteformatrzrrMrxrfdep_r.r,r)rCrhZmergerrxmatchesZspansZ seen_tokensZ retokenizerspanZ span_tokensrrGktokenrIrrrrsJ      (  rc Cs|dd|dd|dd|djd|djd|djd|d jd|d jd|d jdd }d ddd d ddddg }|dkrtd|dd}t|j|f|dS)Nparserrj morphologizertaggerZWordsdZ SentencesZXPOSZUASZLASZFeats) Zdep_lossZ morph_lossZtag_lossr#rzr$ZuasZlasrLZEpochzP.LosszM.LossZTAGZMORPHZSENTZWORDrrc) z{:d}z{dep_loss:.1f}z{morph_loss:.1f}z {las:.1f}z {uas:.1f}z {tags:.1f}z {morph:.1f}z {sents:.1f}z {words:.1f})r\f1rfrqr)itnlossesZ ud_scoresfieldsheadertplrrrprint_progress s           rc CsH|jjrPd}||jjr$|d7}q d|||f}||jddddddddg }ng}|jj|jkrhd}n||jj|jd}t|j}g}dddd}|D]J} | d s| d s| dd\} } | | | } | d | | fq|sd}n d |}t|d|j|j|j|j|t||jddg } | d | d|S)Nr!z%d-%dr,rrUrVrW)rRrSrTbeginrz%s=%srXrcr)r,rZnborrrrMrxr3rLrerr\r5titlerqstrZlemma_Zpos_Ztag_rr]) rrxnrHlinesrMfeaturesZfeat_strr_Zfeatr`rarrrrr*s:     rcCs`|dd}t|}|jrR|s*tdt||rR|jt||d||j d<|S)Nr,rzNconfig asks for vectors, but no vectors directory set on command line (use -v)r9treebank) rspacyZblankvectorsr.rexistsr9Z from_diskmeta)corpusconfigrrr<rrrload_nlpOs  rcs||jdddid||d||d|jrJ|jd|jr\|jdD]$}|jD]}|dk rj|j|qjq`t dk r|d krt d |j fd d ||j |j |jd }|jrt||j|S)NrZset_morphologyF)rrrrKZ sent_startr&ztorch.cuda.FloatTensorcs tSN)r{rrCrDrrmz%initialize_pipeline..)devicesubword_features conv_depth bilstm_depth)Zadd_pipeZ create_pipe multitask_tagrZadd_multitask_objectivemultitask_sentr$rZ add_labeltorchZset_default_tensor_typeZbegin_trainingrrrpretrained_tok2vec_load_pretrained_tok2vec)r<rCrDrrrPrK optimizerrrrinitialize_pipeline^s.      rc Csjt|jddd}|}W5QRXg}|jD]4\}}t|dr0t|jdr0|j|||q0|S)zLoad pretrained weights for the 'token-to-vector' part of the component models, which is typically a CNN. See 'spacy pretrain'. Experimental. rbr|r}modeltok2vec) rrr/rhasattrrr from_bytesr5)r<locrhZ weights_dataZloadedrrrrrrxs  rc@s$eZdZdd d Zedd dZdS)ConfigNrdFrrT皙?cCsD|dk r |dkrd}|dkr d}tD]\}}t|||q*dS)NT)localsitemssetattr)selfrr@rrZ multitask_depZmultitask_vectorsrnr_epochmin_batch_sizemax_batch_sizebatch_by_wordsdropoutrr vectors_dirrr`rarrr__init__szConfig.__init__c CsBt|jddd}t|}W5QRX|dk r8||d<|f|S)Nrr|r}r)rrjsonload)clsrrrhcfgrrrrs z Config.load)NrdFFFNrrrrTrrTNN)N)__name__ __module__ __qualname__r classmethodrrrrrrs& rc@seZdZddZdS)DatasetcCs||_||_d|_d|_|jD]@}|jd}||krJ|drJ||_q"||kr"|dr"||_q"|jdkrd}t|j||d|jdkrd}|jjd dd dd|_ dS) Nr&rBtxtz0Could not find .txt file in {path} for {section})sectionpathr rr,) rrrBriterdirrrIOErrorrrr)rrr file_pathrmsgrrrrs    zDataset.__init__Nrrrrrrrrrsrc@seZdZddZdS) TreebankPathscKs.t||d|_t||d|_|jj|_dS)Ntraindev)rrrr)rZud_pathrrrrrrszTreebankPaths.__init__Nrrrrrrsrz%Path to Universal Dependencies corpus positionalz)Directory to write the development parsesz:UD corpus to train and evaluate on, e.g. UD_Spanish-AnCoraz"Path to json formatted config fileoptionCz Size limitrzUse GPUgzUse oracle segmentsflagGz9Path to directory with pretrained vectors, named e.g. en/v)ud_dir parses_dirrrrA gpu_deviceuse_oracle_segmentsrr&c Csddl}tjdtdtjdddtjdddtjdtjj j _ dtj j j _|dk rltj||d}n t|d}t||} ||s||td |d | jt| j||d } t| | jjjd d | jjjd d |j|d\} } t| | | ||} t|j|jd}tddd}t |j!D]p}t| | jjjd d | jjjd d |j||| d\} } t"t#| | }t$%||j&rt'||d}n t(||d}i}t)dd| D}|j|dd\}|D]P}t#|\}}|*t)dd|Dt+|| j,j-d<| j*||| |j.|dqW5QRX||dj/|d}| 0| j1B|r^t2| | j3j| j3j|\}}nt2| | j3j| j3j|\}}W5QRXt4|||qdS)NrrrrFrr)rzTrain and evaluatez using lang)rr|r})r@rAgjt?rg?)r@rAr?r>)sizecss|]}t|VqdSrrkrrOrrrrmszmain..)totalZleavecss|]}t|VqdSrrkrrrrrmsZbeam_update_prob)ZsgdZdroprzepoch-{i}.conllur)5tqdmrrrrutilZfix_random_seedrrChineseZDefaultsZ use_jiebarJapaneseZ use_janomerrrrmkdirrfrrQrrBrrr@rrrrrtrr3r2rushufflerr r sumrnextrrrrZ use_paramsZaveragesrrr)rrrrrArrrr pathsr<rCrDrZ batch_sizesZ beam_probrxZXsZbatchesrZ n_train_wordsZpbarbatchZ batch_docsZ batch_goldZout_pathZ parsed_docsrrrrmains                rc CsVd||djd<tdjddd(}tj|ddd d d }||W5QRXdS) NzBatch %drrz/tmp/parses.htmlrr|r}rNT)stylepage) user_datarrr renderr)rxZ to_renderrhhtmlrrr_render_parses4sr__main__)TFNN)rj)N)N)Nrr&NF)<__doc__ __future__rZplacpathlibrrerrZ spacy.utilZbin.udrZ spacy.tokensrrZ spacy.goldrrr r Zspacy.syntax.nonprojr Z spacy.matcherr r collectionsrrurZ spacy.langrrr ImportErrorcompilerrrQr6r0r;r{rrrrrrrobjectrrr annotationsrr4rrrcallrrrrs               @  ! *!% %       U