U C^9@sddlmZmZddlZddlZddlZddlZddlZddlm Z ddl m Z ddl m Z mZddlmZddlmZddlmZddlZd d lmZd d lmZd d lmZmZd d lmZm Z m!Z!m"Z"d dlm#Z#m$Z$d dl%m&Z&ddl'm(Z(ej)ddde*fdddde*fddde+fddde+fddde+fddde+fdd d!e,fd"dd#e+fd$dd%e+fd&dd'e+fd(dd)e*fd*d+dd,e-fd-dd.e+fd/dd0e+fd1dd2e+fd3dd4e+fd5dd6e+fd7dd8e+fd9dd:e fd;dd rJr=)rrLrLrLrL)ZwidthsZaligns#z# Wordsz Total LossZLosszw/sFc s|rdnd}jzd||fd}|jW5QRXjjj|d}dd}|t |dW5QRXW5QRXdS) Nz.tempz model%d%s.binwb)nr_wordloss epoch_lossepochz log.jsonla ) Z use_paramsZaveragesopenwritetok2vecto_bytesrQrRrSsrslyZ json_dumps)rTis_tempZ is_temp_strfile_logmodel optimizerr(Ztracker5/tmp/pip-install-6_kvzl1k/spacy/spacy/cli/pretrain.py _save_models zpretrain.._save_modelcss|]}|dfVqdSNrb).0textrbrbrc szpretrain..)sizecSsg|] \}}|qSrbrb)rfrg_rbrbrc szpretrain..)r5r6) objectivedropi)r\zSkipped {count} empty values)countzSuccessfully finished pretrain)rM)F)2dictlocals isinstancerstrrZfix_random_seedr torchZset_default_tensor_typer infoexistsmkdirZgoodr[ write_jsonfailZloadinglistZ read_jsonlrandomshufflergformatZ load_modelvocabvectorsnamecreate_pretraining_modelrrresearchintgroupropsProgressTrackerdividerrowrange enumerateZminibatch_by_words make_docs make_updateupdatewords_per_epochrSwarn))r&r'r(r)r*r/r,r.r-r+r0r1r2r3r8r4r5r6r7r9r:r;configkeyZhas_gpurtZtextsnlprG componentsZ model_nameZ row_settingsrdZ skip_counterrTZbatch_idbatchdocsrorRprogressrbr_rcpretrains_             $          rrnL2c Cs:|j||d\}}t|j|||\}}|||dt|S)zPerform an update over a single batch of documents. docs (iterable): A batch of `Doc` objects. drop (float): The dropout rate. optimizer (callable): An optimizer. RETURNS loss: A float for the loss. )rm)Zsgd)Z begin_updateget_vectors_lossrfloat) r`rrarmrlZ predictionsZbackproprRZ gradientsrbrbrcrs rc Csg}d}|D]}t|ts2ttjjt||dd|kr`|d}|sP|d7}q t|j|d}n.rrr r@)r1) rr~rdatasumrrr ZE142r})rrZ predictionrlidstargetZd_targetrRrbrbrcr+s rcCsp|jjjjd}tttdddt|dd}t|t}t||}t |j|}||_ ||_ | | dg|S)a0Define a network for the pretraining. We simply add an output layer onto the tok2vec input model. The tok2vec input model needs to be a model that takes a batch of Doc objects (as a list), and returns a list of arrays. Each array in the output needs to have one row per token in the doc. ri,r>)piecesrn)Z drop_factorzGive it a doc to infer shapes)r~rrshaperLNrrrrrY output_layerZbegin_trainingr)rrYZ output_sizerr`rbrbrcrCs    rc@seZdZdddZddZdS)r@BcCs:d|_d|_d|_t|_||_t|_d|_d|_ dS)Nrnr) rR prev_lossrQrrrItime last_time last_updaterS)selfrIrbrbrc__init__[s zProgressTracker.__init__c Cs|j|7_|j|7_tdd|D}|j||7<|j|7_|j|j}||jkr|t|j}|j|_t|_|j|j }||jt |jddt |ddt |f}t |j|_ |SdSdS)Ncss|]}t|VqdSre)rrrbrbrcrhhsz)ProgressTracker.update..rJ)r)rK) rRrSrrrQrrIrrr _smart_roundrr) rrTrRrZwords_in_batchZwords_since_updateZwpsZ loss_per_wordstatusrbrbrcres(       zProgressTracker.updateN)r)__name__ __module__ __qualname__rrrbrbrbrcrZs rrJcCsVttt|}||d}|dkr0tt|St||}dt|d}||SdS)z=Round large numbers as integers, smaller numbers as decimals.rz%.fN)rrsrmin)figurer)Z max_decimalZn_digitsZ n_decimalZ format_strrbrbrcr~s   r)r<r=rr>rFrr?r@FrArBrCrDrErNNN)rnr)r)rJr=)6 __future__rrZplacr{rrr collectionsrpathlibrZ thinc.v2vrrZ thinc.miscrrZthinc.neural.utilr Zwasabir r[errorsr rr attrsrrZ_mlrrrrrrrOrZtrainr annotationsrsrboolrrrrrrobjectrrrbrbrbrcs                       < +  $