U C^$' @s~ddlmZddlZddlZddlZddlmZddlmZddl m Z ddl Z ddl Z ddl Z ddlZddlmZddlmZdd lmZmZmZdd lmZmZz ddlZWnek rdZYnXd Zejd d defdd defdddefdddefdddefdddefdddefdddefdddefd d3d d!Z d"d#Z!d$d%Z"d4d&d'Z#d5d(d)Z$d*d+Z%d6d/d0Z&d1d2Z'dS)7)unicode_literalsN) literal_eval)Path) PreshCounter)msg)Vectors)ErrorsWarnings user_warning) ensure_pathget_lang_classizModel language positionalzModel output directoryz"Location of words frequencies fileoptionfz+Location of JSONL-formatted attributes filejz(Optional location of brown clusters datacz(Optional vectors file in Word2Vec formatvz&Optional number of vectors to prune toVz?Optional name for the word vectors, e.g. en_core_web_lg.vectorsZvnz Optional name for the model metamn) lang output_dir freqs_loc jsonl_loc clusters_loc vectors_loc prune_vectors vectors_name model_namec  Cs"|dk rZ|dk s|dk rFdg} |r,| d|r:| dtddt|}t|} n:t|}t|}|dk r|stjd|dd t||} t d t || |d } W5QRXt d |dk rt | |||t | jj} t | j} t d d| | |s|| || S)z Create a new model from raw data, like word frequencies, Brown clusters and word vectors. If vectors are provided in Word2Vec format, they can be either a .txt or zipped as a .zip or .tar.gz. Nz-jz-fz-czIncompatible argumentsaThe -f and -c arguments are deprecated, and not compatible with the -j argument, which should specify the same information. Either merge the frequencies and clusters data into the JSONL-formatted file (recommended), or use only the -f and -c files, without the other lexical attributes.z!Can't find words frequencies fileZexitszCreating model...)namezSuccessfully created modelzSucessfully compiled vocabz{} entries, {} vectors)appendrwarnr srslyZ read_jsonlexistsfailread_attrs_from_deprecatedloading create_modelgood add_vectorslenvocabvectorsformatmkdirZto_disk)rrrrrrrrrsettings lex_attrsnlpZ vec_added lex_addedr67/tmp/pip-install-6_kvzl1k/spacy/spacy/cli/init_model.py init_modelsB            r8cCst|}tt|r&tt|dS|jddrPddtt|dDS|jddrt t|}| }||d}d d|DS|jdd d Sd S) z%Handle .gz, .tar.gz or unzipped fileszr:gzrgzcss|]}|dVqdSutf8Ndecode.0liner6r6r7 lszopen_file..rziprcss|]}|dVqdSr:r<r>r6r6r7rAqsr;)encodingN) r tarfile is_tarfilestropenpartsendswithgzipzipfileZipFilenamelist)loczip_filenamesfile_r6r6r7 open_filefsrSc Csddlm}|dk rBtdt|\}}W5QRXtdn it}}|rztdt|}W5QRXtdni}g}t|ddd d }t |r|t |D]P\}\} } | || d } | |krt || ddd d | d<nd| d<| | q|S)NrtqdmzCounting frequencies...zCounted frequencieszReading clusters...z Read clusterscSs|dS)Nr r6)itemr6r6r7z,read_attrs_from_deprecated..T)keyreverse)orthidprobrrcluster) rUrr) read_freqsr+DEFAULT_OOV_PROB read_clusterssorteditemsr- enumerateintr#) rrrUprobs_clustersr3Z sorted_probsiwordr]attrsr6r6r7r(vs*         r(c Cst|}|}|jD] }d|_qd}|D]>}d|kr6q(|j|d}|jf|d|_|d7}|d7}q(t|jrtdd|jDd}nt}|jj d|i|r||j d <|S) Nrr2r[Fr css|] }|jVqdSN)r])r?lexr6r6r7rAszcreate_model..oov_probr") r r.rank set_attrsis_oovr-minr`cfgupdatemeta) rr3r"Z lang_classr4lexemer5rkrnr6r6r7r*s(     r*c CsBt|}|r`|jddr`tt|dd|j_|jD] }|j r<|jjj |j |j dq|j|dS)Nrz.npzrb)data)rowzReading vectors from {}zLoaded vectors from {})NNF)rxkeysz%s_model.vectorsrr/r"r )r rIrJrnumpyloadrHr.r/roaddr[rr)r0 read_vectorsr+rqrur"r) r4rrr"rm vectors_dataZ vector_keysrjrvr6r6r7r,s0      r,c Csddlm}t|}tddt|D}tj|dd}g}t||D]t\}}|}| d|j d}| d} t ||j dkrt jtjj||d dd tj|dd ||<|| qL||fS) NrrTcss|]}t|VqdSrl)re)r?sizer6r6r7rAszread_vectors..r)shapedtype r )line_numrOr!)r)rUrStuplenextsplitr{zerosrdrstriprsplitrpopr-rr'r ZE094r0Zasarrayr#) rrUrrrZ vectors_keysrir@piecesrjr6r6r7r~s   r~d2c CsXddlm}t}d}|N}t|D]>\}} | dd\} } } t| } ||d| || 7}q(W5QRX|t |} i}|}||D]} | dd\} } } t| } t| } | |kr| |krt | |krz t | }Wn"t k rt d| }YnX|t| }t || ||<qW5QRXt |d| }||fS)NrrT rr z'%s')rUrrHrdrrreincZsmoothmathlogr-r SyntaxErrorZsmoother)r max_lengthZ min_doc_freqZmin_freqrUcountstotalrrir@freqZdoc_freqrYZ log_totalrfrjZ smooth_countrnr6r6r7r_s4      r_c Csddlm}i}tdkr"ttj|p}||D]`}z$|\}}}tdk rZt|}Wntk rtYq4YnXt |dkr|||<q4d||<q4W5QRXt | D]P\}}| |kr||| <| |kr||| <||kr|||<q|S)NrrT0)rUftfyr r ZW004rHrZfix_text ValueErrorrelistrclowertitleupper)rrUrhrr@r^rjrr6r6r7ras.            ra)NNNNrNN)N)N)rrr)( __future__rZplacrr{astrpathlibrZpreshed.counterrrErKrLr%Zwasabirr/rerrorsr r r utilr r r ImportErrorr` annotationsrGrer8rSr(r*r,r~r_rar6r6r6r7sb                 9