B Anð]3)ã@s˜ddlmZmZddlZddlZddlZddlZddlmZddl m Z ddl m Z m Z ddlmZddlmZddlmZddlZd d lmZd d lmZmZd d lmZmZmZmZd d lm Z d dl!m"Z"ej#ddde$fdddde$fddde%fddde%fddde%fdddde&fddde%fddd e%fd!dd"e%fd#dd$e&fd%dd&e%fd' dBd1d2„ƒZ'dCd5d6„Z(d7d8„Z)dDd9d:„Z*d;d<„Z+Gd=d>„d>e,ƒZ-dEd@dA„Z.dS)Fé)Úprint_functionÚunicode_literalsN)ÚCounter)ÚPath)ÚAffineÚMaxout)Ú LayerNorm)Ú prefer_gpu)ÚPrinteré)ÚDoc)ÚIDÚHEAD)ÚTok2VecÚflattenÚchainÚcreate_default_optimizer)Úmasked_language_model)Úutilz+Path to jsonl file with texts to learn fromÚ positionalz+Name or path to vectors model to learn fromz$Directory to write models each epochzWidth of CNN layersÚoptionÚcwzDepth of CNN layersZcdzEmbedding rowsZer)z3Whether to use the static vectors as input featuresÚflagZuvZDropoutÚdz"Number of words per training batchÚbszMax words per example.ZxwzMin words per example.Únwz!Seed for random number generatorsÚsz Number of iterations to pretrainÚi) Ú texts_locÚ vectors_modelÚ output_dirÚwidthÚdepthÚ embed_rowsÚ use_vectorsÚdropoutÚ batch_sizeÚ max_lengthÚ min_lengthÚseedÚnr_iteré`ééÐFçš™™™™™É?éèé¸ éôéc  CsÈttƒƒ} tƒ}t | ¡tƒ}| |r,dnd¡t|ƒ}| ¡sT|  ¡|  d¡t   |d| ¡|  d¡|dkrÐt|ƒ}| ¡s–|j d|dd | d ¡tt  |¡ƒ}Wd QRX|  d ¡t |¡n| d ¡t  d¡}| d |¡¡t |¡}Wd QRX|  d |¡¡|s$d n|jjj}t|t||||ddddƒ}t|jƒ}tdd}| d¡dddœ}|jd*|ŽxBt |ƒD]4}xŠtj!dd„|Dƒ| dD]n}t"|dd„|Dƒ| | d }t#||||d!}| $|||¡}|rª|j|f|Ž|dkrª|j%|d"krªPqªW| &|j'¡v|d#| (d$¡}| )|j* +¡¡Wd QRX|j,|j-|j.|d%œ}|d& (d'¡}| )t  /|¡d(¡Wd QRXWd QRXd)|_.|dkrŠt |¡qŠWd S)+a Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, using an approximate language-modelling objective. Specifically, we load pre-trained vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which match the pre-trained ones. The weights are saved to a directory after each epoch. You can then pass a path to one of these pre-trained weights files to the 'spacy train' command. This technique may be especially helpful if you have little labelled data. However, it's still quite experimental, so your mileage may vary. To load the weights back in during 'spacy train', you need to ensure all settings are the same between pretraining and training. The API and errors around this need some improvement. z Using GPUz Not using GPUzCreated output directoryz config.jsonzSaved settings to config.jsonú-zInput text file doesn't existé)ÚexitszLoading input texts...NzLoaded input textsz Reading input text from stdin...zLoading model '{}'...zLoaded model '{}'réT)Ú conv_depthÚpretrained_vectorsÚ bilstm_depthÚcnn_maxout_piecesÚsubword_featuresi')Ú frequencyzPre-training tok2vec layer)r6é r=ér,)Úrr?r?r?r?)ÚwidthsÚaligns©ú#z# Wordsz Total LossZLosszw/scss|]}|dfVqdS)N©)Ú.0ÚtextrDrDúu/home/app_decipher_dev_19-4/dev/decipher-analysis/serverless-application/helper/df_spacy/python/spacy/cli/pretrain.pyú vszpretrain..)ÚsizecSsg|] \}}|‘qSrDrD)rErFÚ_rDrDrGú zszpretrain..)r'r()Údropi€–˜z model%d.binÚwb)Únr_wordÚlossÚ epoch_lossÚepochz log.jsonlÚaÚ g)rB)0ÚdictÚlocalsr rÚfix_random_seedr ÚinforÚexistsÚmkdirÚgoodÚsrslyÚ write_jsonÚfailÚloadingÚlistÚ read_jsonlÚrandomÚshufflerFÚformatÚ load_modelÚvocabÚvectorsÚnameÚcreate_pretraining_modelrrÚopsÚProgressTrackerÚdividerÚrowÚrangeÚminibatch_by_wordsÚ make_docsÚ make_updateÚupdateÚwords_per_epochÚ use_paramsÚaveragesÚopenÚwriteÚtok2vecÚto_bytesrNrOrPÚ json_dumps)rrr r!r"r#r$r%r*r&r'r(r)ÚconfigÚmsgZhas_gpuÚtextsÚnlpr8ÚmodelÚ optimizerZtrackerÚ row_settingsrQÚbatchÚdocsrOÚprogressÚfile_ÚlogrDrDrGÚpretrains„-                ( r†çÚL2c Cs:|j||d\}}t|j|||ƒ\}}|||dt|ƒS)zÛPerform an update over a single batch of documents. docs (iterable): A batch of `Doc` objects. drop (float): The droput rate. optimizer (callable): An optimizer. RETURNS loss: A float for the loss. )rL)Úsgd)Ú begin_updateÚget_vectors_lossriÚfloat) r~r‚rrLÚ objectiveZ predictionsÚbackproprOÚ gradientsrDrDrGrp•s rpc Cs¤g}xš|D]’}|d}d|kr2t|j|dd}n | |¡}d|krz|d}tj|dd}| t|ƒdf¡}| tg|¡}t|ƒ|kr t|ƒ|kr |  |¡q W|S)NrFÚtokens)ÚwordsÚheadsÚuint64)Údtyper4) r reÚmake_docÚnumpyÚasarrayÚreshapeÚlenÚ from_arrayrÚappend) r}rr(r'r‚ÚrecordrFÚdocr’rDrDrGro¦s  rocCsT| dd„|Dƒ¡}|djjj|}|dkrD||}|d ¡}nt|ƒ‚||fS)a Compute a mean-squared error loss between the documents' vectors and the prediction. Note that this is ripe for customization! We could compute the vectors in some other word, e.g. with an LSTM language model, or use some other type of objective. cSsg|]}| t¡ ¡‘qSrD)Úto_arrayr Úravel)rErrDrDrGrKÄsz$get_vectors_loss..rrˆr )rrerfÚdataÚsumÚNotImplementedError)rir‚Z predictionrÚidsÚtargetÚd_scoresrOrDrDrGr‹¸s r‹cCsp|jjjjd}tttdddƒt|ddƒ}t|tƒ}t||ƒ}t |j|ƒ}||_ ||_ |  |  d¡g¡|S)a0Define a network for the pretraining. We simply add an output layer onto the tok2vec input model. The tok2vec input model needs to be a model that takes a batch of Doc objects (as a list), and returns a list of arrays. Each array in the output needs to have one row per token in the doc. r4i,r6)Úpiecesg)Ú drop_factorzGive it a doc to infer shapes)rerfr ÚshaperÚLNrrrrrwÚ output_layerÚbegin_trainingr•)r}rwZ output_sizerªr~rDrDrGrhÎs   rhc@seZdZddd„Zdd„ZdS)rjé@BcCs:d|_d|_d|_tƒ|_||_t ¡|_d|_d|_ dS)Ngr) rOÚ prev_lossrNrrrr<ÚtimeÚ last_timeÚ last_updaterP)Úselfr<rDrDrGÚ__init__æs zProgressTracker.__init__c CsÒ|j|7_|j|7_tdd„|Dƒƒ}|j||7<|j|7_|j|j}||jkrÊ|t ¡|j}|j|_t ¡|_|j|j }||jt |jddt |ddt |ƒf}t |jƒ|_ |SdSdS)Ncss|]}t|ƒVqdS)N)r™)rErrDrDrGrHósz)ProgressTracker.update..r=)r!r>) rOrPr¡rrrNr°r<r®r¯r­Ú _smart_roundÚintrŒ) r±rQrOr‚Zwords_in_batchZwords_since_updateZwpsZ loss_per_wordÚstatusrDrDrGrqðs&        zProgressTracker.updateN)r¬)Ú__name__Ú __module__Ú __qualname__r²rqrDrDrDrGrjås rjr=cCsVttt|ƒƒƒ}||d}|dkr0tt|ƒƒSt||ƒ}dt|ƒd}||SdS)z=Round large numbers as integers, smaller numbers as decimals.r4z%.ÚfN)r™Ústrr´Úmin)Úfigurer!Z max_decimalZn_digitsZ n_decimalZ format_strrDrDrGr³ s   r³) r+r,r-Fr.r/r0r1r2r)r‡rˆ)rˆ)r=r,)/Ú __future__rrÚplacrar–r®Ú collectionsrÚpathlibrÚ thinc.v2vrrÚ thinc.miscrr©Úthinc.neural.utilr Úwasabir r[rr Úattrsr rÚ_mlrrrrrÚrÚ annotationsrºr´rŒr†rpror‹rhÚobjectrjr³rDrDrDrGÚs\                   b  $