B Anð]™Eã@søddlmZmZmZddlZddlZddlmZddlZddl m Z ddl m Z ddlZddlZddlmZddlZddlZddlmZdd lmZmZmZmZdd lmZdd lmZdd lmZej d dde!fdddefdddefdddefdddefddde!fddde!fddde!fddde"fddde"fddd e"fd!dd"e!fd#dd$efd%dd&efd'dd(e!fd)dd*e!fd+dd,e#fd-dd.e!fd/d0d1e$fd2d0d3e$fd4d0d5e$fd6d0d7e$fd8dPd@dA„ƒZ%ej&dBdC„ƒZ'dDdE„Z(dFdG„Z)dHdI„Z*dJdK„Z+dLdM„Z,dQdNdO„Z-dS)Ré)Úunicode_literalsÚdivisionÚprint_functionN)ÚPath)ÚModel)Ú default_timer)ÚPrinteré)Úcreate_default_optimizer)ÚPROBÚIS_OOVÚCLUSTERÚLANG)Ú GoldCorpus)Úutil)ÚaboutzModel languageÚ positionalz"Output directory to store model inz(Location of JSON-formatted training dataz+Location of JSON-formatted development dataz2Path to jsonl file with unlabelled text documents.ÚoptionÚrtz"Name of model to update (optional)Úbz,Comma-separated names of pipeline componentsÚpzModel to load vectors fromÚvzNumber of iterationsÚnzNumber of examplesÚnszUse GPUÚgz Model versionÚVz*Optional path to meta.json to use as base.ÚmzkPath to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.Zt2vz7Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'Úptz4Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'Úetz*Amount of corruption for data augmentationÚnlz!Beam widths to evaluate, e.g. 4,8ÚbwzUse gold preprocessingÚflagÚGz,Make parser learn gold-standard tokenizationÚTz"Display more information for debugZVVz$Run data diagnostics before trainingÚD)ÚlangÚ output_pathÚ train_pathÚdev_pathÚraw_textÚ base_modelÚpipelineÚvectorsÚn_iterÚ n_examplesÚuse_gpuÚversionÚ meta_pathÚ init_tok2vecÚparser_multitasksÚentity_multitasksÚ noise_levelÚeval_beam_widthsÚ gold_preprocÚ learn_tokensÚverboseÚdebugútagger,parser,nerééÿÿÿÿú0.0.0ÚçFcCsHtƒ}t ¡t |¡t |¡}t |¡}t | ¡} |dk rLtt |¡ƒ}|rX| ¡sh|j d|dd|rt| ¡s„|j d|dd| dk r¤|  ¡s¤|j d| dd| r²t  | ¡ni}| ¡rÜdd„|  ¡DƒrÜ|  d d ¡| ¡sì|  ¡t t d d ¡t d d ¡t dd¡¡}t t dd¡t dd¡t dd¡¡}|sFdg}n0dd„| d¡Dƒ}d|krn| d¡| ¡|dgk}dd„ˆ d¡Dƒ‰| d ˆ¡¡|r4| d |¡¡t |¡‰ˆj|krè|j d ˆj|¡dd‡fdd„ˆjDƒ}ˆj|ŽxlˆD]"}|ˆjkr ˆ ˆ |¡¡q Wn@| d |¡¡t |¡}|ƒ‰xˆD]}ˆ ˆ |¡¡qZW|rŠˆ ˆ d¡¡|rª| d |¡¡tˆ|ƒd |fd!|fg}x\|D]T\}} | rÀ|ˆkrè|  d" |¡¡ˆ |¡}x|  d¡D]}!| |!¡qþWqÀW| d# | ¡¡t ||| d$‰ˆ !¡}"|rPt"t#j$ƒ}#nˆj%‡fd%d&„| d'}#dˆ_&| dk rt'ˆ| ƒ}$| d( |$¡¡d)d*d+d,d-d.d/d0d1d2d3g }%d4d5d5d6d6d6d6d6d6d6d6g }&|râ|% (dd7¡|& (dd6¡|&t)d8d„|%Dƒƒd9d:œ}'t*d;ƒ|j+|%f|'Ž|j+dd?})|r‚t. /|¡tj0‡fd@dA„|DƒdBdC}*d>}+t1j1|"dDdE¶},i}-xªtj2|)|dCD]˜}.|.s¸qªt3|.Ž\}/}0ˆj4|/|0|#t5|ƒ|-dF|rþtt5|*ƒƒ}1ˆj6|1|#|-dGt7t8j9 :dHd>¡ƒs*|, 4t;dIdA„|/Dƒƒ¡|+t;dJdA„|/Dƒƒ7}+qªWWdQRXˆ <|#j=¡nt dD¡|dK|(}2ˆ >|2¡t ?|2¡}3x:|D]0}4x*|3j@D] \}5}6tA|6dLƒr |4|6jBdM<q WtˆjC|3|dNƒ}7t;dOdA„|7Dƒƒ}8tDƒ}9|3 E|7|¡}:tDƒ};| d>krd}<|8|;|9}=nl|8|;|9}t H|>|:jI¡ˆj|dR<ˆj|dS<dTtJjK|dU<|4dkrî|8|=|WWdˆ <|#j=¡|da}Aˆ >|A¡WdQRX| Tdb|A¡| Udc¡tV||ˆjƒ}BWdQRX| Tdd|B¡XdS)ez« Train or update a spaCy model. Requires data to be formatted in spaCy's JSON format. To convert data from other formats, use the `spacy convert` command. NzTraining data not foundé)ÚexitszDevelopment data not foundzCan't find model meta.jsoncSsg|]}| ¡r|‘qS©)Úis_dir)Ú.0rrCrCúr/home/app_decipher_dev_19-4/dev/decipher-analysis/serverless-application/helper/df_spacy/python/spacy/cli/train.pyú qsztrain..zOutput directory is not emptyzÍThis can lead to unintended side effects when saving the model. Please use an empty directory or a different path instead. If the specified output path doesn't exist, the directory will be created for you.Z dropout_fromgš™™™™™É?Z dropout_toÚ dropout_decaygZ batch_fromgY@Zbatch_tog@@Zbatch_compoundgj¼t“ð?cSsg|] }t|ƒ‘qSrC)Úint)rEr rCrCrFrGŽsú,cSsg|] }| ¡‘qSrC)Ústrip)rErrCrCrFrG—szTraining pipeline: {}zStarting with base model '{}'zQModel language ('{}') doesn't match language specified as `lang` argument ('{}') csg|]}|ˆkr|‘qSrCrC)rEÚpipe)r+rCrFrG¢szStarting with blank model '{}'Zmerge_subtokenszLoading vector from model '{}'ÚparserÚnerz:Can't use multitask objective without '{}' in the pipelinez"Counting training words (limit={}))ÚlimitcsˆjS)N)Z train_tuplesrC)ÚcorpusrCrFÚÌóztrain..)Údevicez!Loaded pretrained tok2vec for: {}ZItnzDep LosszNER LossZUASzNER PzNER RzNER FzTag %zToken %zCPU WPSzGPU WPSéé ézBeam W.cSsg|]}d‘qS)ÚrrC)rEÚirCrCrFrGÛsr )ÚwidthsÚalignsÚspacingr?cSsg|] }d|‘qS)ú-rC)rEÚwidthrCrCrFrGßsrYr)r5r7Ú max_lengthc3s|]}ˆ |d¡VqdS)ÚtextN)Zmake_doc)rEr)ÚnlprCrFú èsztrain..é)ÚsizeF)ÚtotalÚleave)ÚsgdÚdropÚlosses)rfrhÚ LOG_FRIENDLYcss|]}t|ƒVqdS)N)Úlen)rEÚdocrCrCrFraþscss|]}t|ƒVqdS)N)rj)rErkrCrCrFraÿszmodel%dÚcfgÚ beam_width)r7css|]}t|dƒVqdS)rN)rj)rEZdoc_goldrCrCrFra sÚcpuz accuracy.jsonr%r+z>=%sÚ spacy_version)ÚnwordsrnÚgpuÚspeedÚaccuracyZ beam_accuracyZ beam_speed)r]r,ÚkeysÚnamer,rur0z meta.json)rmÚcpu_wpsÚgpu_wpsz model-finalzSaved model to output directoryzCreating best model...zCreated best model)WrrÚfix_random_seedÚ set_env_logÚ ensure_pathÚlistÚsrslyÚ read_jsonlÚexistsÚfailÚ read_jsonÚiterdirÚwarnÚmkdirÚdecayingÚenv_optÚ compoundingÚsplitÚappendÚsortr_ÚformatÚ load_modelr%Ú pipe_namesZ disable_pipesÚadd_pipeÚ create_pipeÚget_lang_classÚ _load_vectorsZget_pipeZadd_multitask_objectiverZ count_trainr rÚopsÚbegin_trainingZ _optimizerÚ_load_pretrained_tok2vecÚinsertÚtupleÚprintÚrowÚrangeÚ train_docsÚrandomÚshuffleÚ minibatchÚtqdmÚminibatch_by_wordsÚzipÚupdateÚnextZrehearserIÚosÚenvironÚgetÚsumÚ use_paramsÚaveragesÚto_diskÚload_model_from_pathr+ÚhasattrrlÚdev_docsÚtimerÚevaluateÚ use_devicerMÚ write_jsonÚscoresrÚ __version__Ú setdefaultÚvocabÚvectors_lengthrjr,Ún_keysruÚ _get_progressÚgoodÚloadingÚ_collate_best_model)Cr%r&r'r(r)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9r:ÚmsgÚmetaZ dropout_ratesZ batch_sizesZhas_beam_widthsZ other_pipesrLZlang_clsZmultitask_optionsZ pipe_nameZ multitasksZ objectiveZ n_train_wordsÚ optimizerÚ componentsZrow_headZ row_widthsZ row_settingsrXr™Z raw_batchesZ words_seenÚpbarrhÚbatchÚdocsZgoldsZ raw_batchZepoch_model_pathZ nlp_loadedrmruÚ componentr«rpZ start_timeZscorerZend_timerwrvZacc_locZmeta_locZprogressZfinal_model_pathZbest_model_pathrC)rPr`r+rFÚtrainsrI                               &                        (  rÂccs2ttj dd¡ƒrdVntj|dd}|VdS)NrirF)rdre)rIr¢r£r¤r)rdr¾rCrCrFÚ_create_progress_barUsrÃcCsztj||jdxd|jD]Z}i}x>|jj ¡D].\}}|ttttfkr.||j ƒ||jj |<q.W|j f|Žd|_ qWdS)N)r³F) rr‹r³Zlex_attr_gettersÚitemsr r r rZorth_ÚstringsÚ set_attrsZis_oov)r`r,ÚlexÚvaluesÚattrÚfuncrCrCrFr^s  rc Csf| d¡}| ¡}WdQRXg}x>|jD]4\}}t|dƒr*t|jdƒr*|j |¡| |¡q*W|S)z—Load pre-trained weights for the 'token-to-vector' part of the component models, which is typically a CNN. See 'spacy pretrain'. Experimental. ÚrbNÚmodelÚtok2vec)ÚopenÚreadr+rªrÌrÍÚ from_bytesrˆ)r`ÚlocÚfile_Z weights_dataZloadedrurÁrCrCrFr“ks  r“c Cs´i}x|D]}t||ƒ||<q W|d}t |d|¡xf| ¡D]Z\}}t ||¡t ||||¡t |d¡}x t|ƒD]}|||d|<q„WqBWt |d|¡|S)Nz model-bestz model-finalz accuracy.jsonrsz meta.json) Ú _find_bestÚshutilÚcopytreerÄÚrmtreer|r€Ú _get_metricsr¯) r»r&r½ZbestsrÁZ best_destZbest_component_srcÚaccsÚmetricrCrCrFr¹ys r¹csvg}xX| ¡D]L}| ¡r|jddkrt |d¡‰‡fdd„t|ƒDƒ}| ||f¡qW|rnt|ƒdSdSdS)Nr=z model-finalz accuracy.jsoncsg|]}ˆ |d¡‘qS)g)r¤)rErÙ)rØrCrFrGŽsz_find_best..rA)rrDÚpartsr|r€r×rˆÚmax)Zexperiment_dirrÁZ accuraciesZ epoch_modelr°rC)rØrFrÓ‰s rÓcCs(|dkr dS|dkrdS|dkr$dSdS)NrM)ZlasÚuasÚ token_accÚtagger)Útags_accrN)Úents_fÚents_pÚents_r)rÝrC)rÁrCrCrFr×–sr×c Csþi}xdD] }d||<q W| dd¡|d<| dd¡|d<| dd¡|d<| |¡||d <|pbd|d <|d  |d¡d  |d¡d  |d ¡d  |d ¡d  |d¡d  |d¡d  |d¡d  |d¡d |d ¡d |d ¡g }|dk rú| d|¡|S)N) Údep_lossÚtag_lossrÜrßrÝrárâràrvrwgrMrãrNZner_lossrÞrärvrwz{:.3f}rÜrárâràrßrÝz{:.0f}rA)r¤r rŠr”) ÚitnrhZ dev_scoresrmrvrwr°ÚcolÚresultrCrCrFr¶ s.             r¶)NNr;Nr<rr=r>NNr?r?r@r?FFFF)Nr@r@).Ú __future__rrrÚplacr¢ÚpathlibrrZthinc.neural._classes.modelrZtimeitrr¬rÔr|ÚwasabirÚ contextlibršZ_mlr Úattrsr r r rÚgoldrr?rrÚ annotationsÚstrrIÚfloatÚboolrÂÚcontextmanagerrÃrr“r¹rÓr×r¶rCrCrCrFÚsš                          }