U á€C^ákã@s>ddlmZmZmZddlZddlZddlmZddlm Z ddl m Z ddl Z ddlZddlmZddlZddlZddlmZdd lmZmZmZmZdd lmZdd lmZdd lmZdd lm Z ej!ddde"fdddefdddefdddefdddefddde"fddde"fddde"fddde#fddde#fd dd!e#fd"dd#e#fd$dd%e"fd&dd'efd(dd)efd*dd+e"fd,dd-e"fd.dd/e$fd0dd1e$fd2dd3e"fd4d5d6e%fd7d5d8e%fd9d5d:e%fd;dde"fd?d5d@e%fdAd5dBe%fdCd`dLdM„ƒZ&dNdO„Z'ej(dPdQ„ƒZ)dRdS„Z*dTdU„Z+dVdW„Z,dXdY„Z-dZd[„Z.d\d]„Z/dad^d_„Z0dS)bé)Úunicode_literalsÚdivisionÚprint_functionN)ÚPath)ÚModel)Ú default_timer)Úmsgé)Úcreate_default_optimizer)ÚPROBÚIS_OOVÚCLUSTERÚLANG)Ú GoldCorpus)Úpath2str)Úutil)ÚaboutzModel languageÚ positionalz"Output directory to store model inz(Location of JSON-formatted training dataz+Location of JSON-formatted development dataz2Path to jsonl file with unlabelled text documents.ÚoptionÚrtz"Name of model to update (optional)Úbz,Comma-separated names of pipeline componentsÚpzModel to load vectors fromÚvzNumber of iterationsÚnzBMaximum number of training epochs without dev accuracy improvementÚnezNumber of examplesÚnszUse GPUÚgz Model versionÚVz*Optional path to meta.json to use as base.ÚmzkPath to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.Zt2vz7Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'Úptz4Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'Úetz*Amount of corruption for data augmentationÚnlz5Amount of orthography variation for data augmentationZovlz!Beam widths to evaluate, e.g. 4,8ÚbwzUse gold preprocessingÚflagÚGz,Make parser learn gold-standard tokenizationÚTz6Textcat classes aren't mutually exclusive (multilabel)ZTMLzTextcat model architectureÚtaz9Textcat positive label for binary classes with two labelsÚtplz"Display more information for debugZVVz$Run data diagnostics before trainingÚD)ÚlangÚ output_pathÚ train_pathÚdev_pathÚraw_textÚ base_modelÚpipelineÚvectorsÚn_iterÚn_early_stoppingÚ n_examplesÚuse_gpuÚversionÚ meta_pathÚ init_tok2vecÚparser_multitasksÚentity_multitasksÚ noise_levelÚorth_variant_levelÚeval_beam_widthsÚ gold_preprocÚ learn_tokensÚtextcat_multilabelÚ textcat_archÚtextcat_positive_labelÚverboseÚdebugútagger,parser,nerééÿÿÿÿú0.0.0ÚçFÚbowcVs¶ ddl}t ¡t |¡t |¡}t |¡}t | ¡} t |¡}|dk rXtt |¡ƒ}|rd| ¡stt j d|dd|r€| ¡st j d|dd| dk r°|  ¡s°t j d| dd| r¾t  | ¡ni}| ¡rèdd „|  ¡Dƒrèt   d d ¡| ¡sø| ¡t t d d ¡t dd ¡t dd¡¡}t t dd¡t dd¡t dd¡¡}|sRdg}n0dd „| d¡Dƒ}d|krz| d¡| ¡|dgk}dd „ˆ d¡Dƒ‰t  d ˆ¡¡|rÌt  d |¡¡t |¡‰ˆj|krôt j d ˆj|¡ddˆ ‡fdd „ˆjDƒ¡ˆD]¶} | ˆjkrj| dkr6d|i}!n| d krP| ||d!œ}!ni}!ˆ ˆj| |!d"¡n\| d krˆ d ¡j}"|"d#|"d$|"d%d!œ}#| ||d!œ}!|#|!krt j d& |#|!¡ddqnrt  d' |¡¡t |¡}$|$ƒ‰ˆD]L} | dkrd|i}!n| d kr"| ||d!œ}!ni}!ˆ ˆj| |!d"¡qð|r^t  d( |¡¡t ˆ|ƒd|fd)|fg}%|%D]P\}&}'|'rr|&ˆkršt   d* |&¡¡ˆ |&¡} |' d¡D]}(|  !|(¡q®qrt  d+ | ¡¡t"||| d,‰ˆ #¡})|rüt$t%j&ƒ}*nˆj'‡fd-d.„| d/}*dˆ_(|dk rˆ d ¡jd1},|rz||,krzt j d2 |¡dd|r¢t*|,ƒd3kr¢t j d4 |¡ddˆj+ˆ||dd5d6}-t,ƒ}.|rd7}/|-D]8\}0}1|. -|1j. /¡¡t|1j. 0¡ƒ 1d8¡dkrÊd5}/qÊ|/s|st   d9¡|s„|-D]^\}0}1|. -|1j. /¡¡t|1j. 0¡ƒ 1d8¡dkr$|s$t   d:¡d7ˆ d ¡jd#<d5}q„q$|r²t,|,ƒ|.kr²t j d; |,t|.ƒ¡dd|rÐt  d< d= 2|,¡¡¡nn|röt*|,ƒd3kröt  d> |¡¡nHt*|,ƒdkr4t*|,ƒd3krt   d?¡t  d@ d= 2|,¡¡¡n t   dA¡t3ˆ| |ƒ\}2}3dBd „|2Dƒ}4|4t4dCd „|2Dƒƒd3dDœ}5t5dEƒt j6|2f|5Žt j6dFd „|5dGDƒf|5Žzžd}8d}9t=|ƒD]†}:ˆj+ˆ|||dd5dL}-|rt> ?|¡tj@‡fdMdN„|DƒdOdP};d}<|j|)d7dQ²}=i}>tjA|-|dPD]˜}?|?s8q*tB|?Ž\}@}Aˆj-|@|A|*tC|ƒ|>dR|r~ttC|;ƒƒ}BˆjD|B|*|>dStEtFjG HdTd¡ƒsª|= -tIdUdN„|@Dƒƒ¡||KjR|3| rT|End|N|Mdm}Q|:dk r¸d ˆk r¸|KjR Hdni¡}R|R ]¡D],\}S}T|T Hdod¡dk rŠt   dp |S¡¡ qŠt j6|Qf|5Žq | dk r8t^|ƒ}U|U|9k rð|8d7}8nd}8|U}9|8| k r8t  dq |:|8¡¡t  dr |9|U¡¡W5QR£ qFW5QRXq¼W5ˆ 7|*j8¡|dH}6ˆ 9|6¡W5QRXt  :dI|6¡t  ;dJ¡t<||ˆjƒ}7W5QRXt  :dK|7¡XdS)sz« Train or update a spaCy model. Requires data to be formatted in spaCy's JSON format. To convert data from other formats, use the `spacy convert` command. rNzTraining data not foundé)ZexitszDevelopment data not foundzCan't find model meta.jsoncSsg|]}| ¡r|‘qS©)Úis_dir©Ú.0rrLrLú2/tmp/pip-install-6_kvzl1k/spacy/spacy/cli/train.pyÚ msztrain..zOutput directory is not emptyzÍThis can lead to unintended side effects when saving the model. Please use an empty directory or a different path instead. If the specified output path doesn't exist, the directory will be created for you.Z dropout_fromgš™™™™™É?Z dropout_toZ dropout_decayrIZ batch_fromgY@Zbatch_tog@@Zbatch_compoundgj¼t“ð?cSsg|] }t|ƒ‘qSrL)Úint)rOr"rLrLrPrQŠsú,cSsg|] }| ¡‘qSrL)ÚstriprNrLrLrPrQ“szTraining pipeline: {}zStarting with base model '{}'zQModel language ('{}') doesn't match language specified as `lang` argument ('{}') csg|]}|ˆkr|‘qSrLrLrN)r/rLrPrQžsÚparserr>Útextcat)Úexclusive_classesÚ architectureÚpositive_label)ÚconfigrWrXrYztThe base textcat model configuration doesnot match the provided training options. Existing cfg: {}, provided cfg: {}zStarting with blank model '{}'zLoading vector from model '{}'Únerz:Can't use multitask objective without '{}' in the pipelinez"Counting training words (limit={}))ÚlimitcsˆjS©N)Z train_tuplesrL)ÚcorpusrLrPÚîóztrain..)Zdevicez!Loaded pretrained tok2vec for: {}ÚlabelszTThe textcat_positive_label (tpl) '{}' does not match any label in the training data.r zŽA textcat_positive_label (tpl) '{}' was provided for training data that does not appear to be a binary classification problem with two labels.T)r:r=Ú max_lengthÚignore_misalignedFgð?z¬The textcat training instances look like they have mutually-exclusive classes. Remove the flag '--textcat-multilabel' to train a classifier with mutually-exclusive classes.zºSome textcat training instances do not have exactly one positive label. Modifying training options to include the flag '--textcat-multilabel' for classes that are not mutually exclusive.znCannot extend textcat model using data with different labels. Base model labels: {}, training data labels: {}.zMTextcat evaluation score: ROC AUC score macro-averaged across the labels '{}'z, z5Textcat evaluation score: F1-score for the label '{}'z“If the textcat component is a binary classifier with exclusive classes, provide '--textcat_positive_label' for an evaluation on the positive class.zHTextcat evaluation score: F1-score macro-averaged across the labels '{}'zOUnsupported textcat configuration. Use `spacy debug-data` for more information.cSsg|] }t|ƒ‘qSrL©Úlen)rOÚwrLrLrPrQMscSsg|]}d‘qS)ÚrrL)rOÚirLrLrPrQNs)ÚwidthsZalignsÚspacingrHcSsg|] }d|‘qS)ú-rL)rOÚwidthrLrLrPrQRsriú model-finalzSaved model to output directoryzCreating best model...zCreated best model)r:r;r=rbrcc3s|]}ˆ |d¡VqdS)ÚtextN)Zmake_doc)rOr)ÚnlprLrPÚ bsztrain..é)Úsize©ÚtotalZleave)ÚsgdZdropÚlosses)rurvÚ LOG_FRIENDLYcss|]}t|ƒVqdSr]rd©rOÚdocrLrLrPrpxscss|]}t|ƒVqdSr]rdrxrLrLrPrpyszmodel%dÚcfgÚ beam_width)r=rccss|]}t|dƒVqdS)rNrd)rOZdoc_goldrLrLrPrpŠs)rBÚcpuú accuracy.jsonr)r/z>=%sZ spacy_version)Únwordsr|ZgpuÚspeedÚaccuracyZ beam_accuracyZ beam_speed)rlr0ÚkeysÚnamer0r‚r5ú meta.json)r{Úcpu_wpsÚgpu_wpsÚtextcats_per_catZ roc_auc_scorezGTextcat ROC AUC score is undefined due to only one value in label '{}'.z%Early stopping, best iteration is: {}z+Best score = {}; Final iteration score = {})_ÚtqdmrZfix_random_seedZ set_env_logZ ensure_pathÚlistÚsrslyZ read_jsonlÚexistsrÚfailÚ read_jsonÚiterdirÚwarnÚmkdirZdecayingZenv_optZ compoundingÚsplitÚappendÚsortrnÚformatÚ load_modelr)Z disable_pipesZ pipe_namesZadd_pipeZ create_pipeZget_piperzZget_lang_classÚ _load_vectorsZadd_multitask_objectiverZ count_trainr rÚopsZbegin_trainingZ _optimizerÚ_load_pretrained_tok2vecreÚ train_docsÚsetÚupdateZcatsrÚvaluesÚcountÚjoinÚ_configure_training_outputÚtupleÚprintÚrowZ use_paramsZaveragesZto_diskZgoodZloadingÚ_collate_best_modelÚrangeÚrandomÚshuffleZ minibatchZminibatch_by_wordsÚzipÚnextZrehearserRÚosÚenvironÚgetÚsumZload_model_from_pathr/ÚhasattrÚdev_docsÚtimerÚevaluateZ use_deviceÚ write_jsonÚscoresrÚ __version__Ú setdefaultÚvocabZvectors_lengthr0Zn_keysr‚ÚmetaÚ _get_progressÚitemsÚ_score_for_model)Vr)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9r:r;r<r=r>r?r@rArBrCr‡rµZ dropout_ratesZ batch_sizesÚhas_beam_widthsÚpipeZpipe_cfgZ textcat_cfgZbase_cfgZlang_clsZmultitask_optionsZ pipe_nameZ multitasksZ objectiveZ n_train_wordsZ optimizerÚ componentsZtextcat_labelsr˜Z train_labelsZmultilabel_foundrnÚgoldÚrow_headÚ output_statsZ row_widthsZ row_settingsZfinal_model_pathZbest_model_pathZiter_since_bestZ best_scorerhZ raw_batchesZ words_seenÚpbarrvÚbatchZdocsZgoldsZ raw_batchZepoch_model_pathZ nlp_loadedr{r‚Ú componentr­r~Ú start_timeZscorerÚend_timer…r„Zacc_locZmeta_locÚprogressr†ÚcatZ cat_scoreZ current_scorerL)r^ror/rPÚtrainsÈC       þ   ý   ý     ÿý    ý  ýý ýú     ý   ÿÿ     ÿýþüû  ÿ  ÿ þüÿÿÿÿÿÿÿÿú ÿ û $      ýÿ     ýÿ   ý    ý ü     ù ÿÿ    ÿÿÿÿ"  rÆcCsžtƒ}|d}|d}d|kr,| |d¡d|krN| |d|dd¡d |krx| |d |d |d d ¡d|krŽ| |d¡t|ƒt|ƒS)zS Returns mean score between tasks in pipeline that can be used for early stopping. r/r€ÚtaggerÚtags_accrUÚuasÚlasr r[Úents_pÚents_rÚents_férVÚ textcat_score)rˆr‘r«re)rµZmean_accZpipesÚaccrLrLrPr¸õs"r¸ccs:ddl}ttj dd¡ƒr"dVn|j|dd}|VdS)NrrwFrs)r‡rRr¨r©rª)rtr‡r¿rLrLrPÚ_create_progress_bars rÑcCsrtj||jd|jD]V}i}|jj ¡D].\}}|ttttfkr*||j ƒ||jj |<q*|j f|Žd|_ qdS)N)r´F) rr”r´Zlex_attr_gettersr·r r r rZorth_ÚstringsÚ set_attrsZis_oov)ror0Úlexr›ÚattrÚfuncrLrLrPr•s  r•c Csb| d¡}| ¡}W5QRXg}|jD]4\}}t|dƒr(t|jdƒr(|j |¡| |¡q(|S)z–Load pretrained weights for the 'token-to-vector' part of the component models, which is typically a CNN. See 'spacy pretrain'. Experimental. ÚrbÚmodelÚtok2vec)ÚopenÚreadr/r¬rØrÙÚ from_bytesr‘)roÚlocÚfile_Z weights_dataZloadedr‚rÁrLrLrPr—s   r—c Cs¼i}|D]}t||ƒ||<q|d}t t|dƒt|ƒ¡| ¡D]b\}}t t||ƒ¡t t||ƒt||ƒ¡t |d¡}t|ƒD]}|||d|<qqDt  |d|¡|S)Nz model-bestrmr}r€rƒ) Ú _find_bestÚshutilÚcopytreerr·Úrmtreer‰rŒÚ _get_metricsr°) rµr*r»ZbestsrÁZ best_destZbest_component_srcÚaccsÚmetricrLrLrPr¢,s  ÿ r¢csrg}| ¡D]L}| ¡r |jddkr t |d¡‰‡fdd„t|ƒDƒ}| ||f¡q |rjt|ƒdSdSdS)NrFrmr}csg|]}ˆ |d¡‘qS)rI)rª)rOrå©rärLrPrQCsz_find_best..rK)rrMÚpartsr‰rŒrãr‘Úmax)Zexperiment_dirrÁZ accuraciesZ epoch_modelr±rLrærPrß>s  rßcCs(|dkr dS|dkrdS|dkr$dSdS)NrU)rÊrÉÚ token_accrÇ)rÈr[)rÍrËrÌ)rérL)rÁrLrLrPrãKsrãcCs dg}g}|D]¦}|dkr8| ddg¡| ddg¡q|dkrb| dd d g¡| d d d g¡q|dkr| ddddg¡| ddddg¡q|dkr| ddg¡| ddg¡q| ddg¡| ddg¡|d krò| d!g¡| d"g¡|r| d#d$¡||fS)%NZItnrÇz Tag Loss z Tag % Útag_lossrÈrUz Dep Loss z UAS z LAS Údep_lossrÉrÊr[z NER Loss zNER P zNER R zNER F Úner_lossrËrÌrÍrVz Textcat LossZTextcatÚ textcat_lossrÏzToken %zCPU WPSrér„rzGPU WPSr…rKzBeam W.)ÚextendÚinsert)r/r4r¹r½r¾rºrLrLrPržUs.   ržc CsÔi}|D] }d||<q| dd¡|d<| dd¡|d<| dd¡|d<| dd¡|d <||d <|pdd|d <| |¡g} |D]*}d } | d ¡r’d} |  |  ||¡¡q||dg} |  | ¡|dk rÐ|  d|¡| S)NrIrUrër[rìrÇrêrVrír„r…z{:.3f}Z_wpsz{:.0f}rK)rªršÚendswithr‘r“rîrï) ÚitnrvZ dev_scoresr¾r{r„r…r±ÚstatZformatted_scoresÚ format_specÚresultrLrLrPr¶qs*       r¶)NNrDNrENrrFrGNNrHrHrIrIrHFFFrJNFF)NrIrI)1Ú __future__rrrZplacr¨ÚpathlibrZthinc.neural._classes.modelrZtimeitrr®ràr‰ZwasabirÚ contextlibr¤Z_mlr Úattrsr r r rr¼rÚcompatrrHrrÚ annotationsÚstrrRÚfloatÚboolrÆr¸ÚcontextmanagerrÑr•r—r¢rßrãržr¶rLrLrLrPÚsª                                    ä$å B    ÿ