B An]@sddlmZmZddlZddlZddlZddlZddlmZddl m Z ddl m Z m Z ddl mZddlZddlmZdd lmZdd lmZdd lmZmZmZmZdd lmZmZmZdd lmZmZm Z ddlm!Z!ddl"m#Z#m$Z$ddl%m&Z&ddl'm(Z(ddl)m*Z*m+Z+ddl,m-Z-ddl.m/Z/m0Z0ddl.m1Z1ddl2m3Z3ddl4m5Z5ddl6m7Z7m8Z8ddl9m:Z:m;Z;mZ>ddl=m?Z?Gddde@ZAGddde@ZBd d!ZCGd"d#d#eDZEd$d%ZFdS)&)absolute_importunicode_literalsN) OrderedDict)contextmanager)copydeepcopy)Model) Tokenizer)Vocab) Lemmatizer)DependencyParser TensorizerTaggerEntityRecognizer)SimilarityHookTextCategorizerSentenceSegmenter)merge_noun_chunksmerge_entitiesmerge_subtokens) EntityRuler)izip basestring_) GoldParse)Scorer)link_vectors_to_modelscreate_default_optimizer)IS_STOP)TOKENIZER_PREFIXESTOKENIZER_SUFFIXES)TOKENIZER_INFIXES) TOKEN_MATCH)TAG_MAP) LEX_ATTRSis_stop)ErrorsWarningsdeprecation_warning)util)aboutc@seZdZedddZedddZedddZdd d gZeZ e e Z e e Ze eZeeZiZeZiZiZiZiZiZeZiZd d d d ZdS) BaseDefaultsNcCst|j|j|j|jS)N)r lemma_index lemma_exc lemma_rules lemma_lookup)clsnlpr2q/home/app_decipher_dev_19-4/dev/decipher-analysis/serverless-application/helper/df_spacy/python/spacy/language.pycreate_lemmatizer%szBaseDefaults.create_lemmatizerc Csz||}t|j}tjt|jd|t<t||j |d}x<|j D].\}}x$| D]\}}|j |||qVWqDW|S)N)Zstops)lex_attr_getterstag_map lemmatizer)r4dictr5 functoolspartialr% stop_wordsrr r6 morph_rulesitems morphologyadd_special_case) r0r1r7r5vocabtag_strexcorth_strattrsr2r2r3 create_vocab+s  zBaseDefaults.create_vocabcCs|j}|j}|jr t|jjnd}|jr8t|jjnd}|jrPt |jj nd}|dk rb|j n| |}t ||||||dS)N)rules prefix_search suffix_searchinfix_finditer token_match)tokenizer_exceptionsrJprefixesr)compile_prefix_regexsearchsuffixescompile_suffix_regexinfixescompile_infix_regexfinditerr@rEr )r0r1rFrJrGrHrIr@r2r2r3create_tokenizer;szBaseDefaults.create_tokenizertaggerparsernerltrT) directionhas_case has_letters)N)N)N) __name__ __module__ __qualname__ classmethodr4rErT pipe_namesr"rJtuplerrLr rOr!rQr8r#r6rKsetr;r.r-r,r/r<r$r5syntax_iteratorswriting_systemr2r2r2r3r+$s,    r+c @seZdZdZeZdZddddddddddd dd dd dd dd dddddd ZdddifddZe ddZ e ddZ e j ddZ e ddZ e ddZe ddZe d d!Ze d"d#Ze d$d%Zd&d'Zefd(d)ZdWd*d+Zd,d-Zd.d/Zd0d1Zd2d3Zgdfd4d5Zd6d7Zd8d9ZdXd;d<ZdYd=d>Zd?d@Z dZdAdBZ!d[dCdDZ"d\dGdHZ#e$dIdJZ%dEdKdLgdEdfdMdNZ&e'dfdOdPZ(e'dfdQdRZ)e'dfdSdTZ*e'dfdUdVZ+dS)]Languagea[A text-processing pipeline. Usually you'll load this once per process, and pass the instance around your application. Defaults (class): Settings, data and factory methods for creating the `nlp` object and processing pipeline. lang (unicode): Two-letter language ID, i.e. ISO code. DOCS: https://spacy.io/api/language NcCs |j|S)N)DefaultsrT)r1r2r2r3szLanguage.cKst|jf|S)N)rr@)r1cfgr2r2r3rgtrhcKst|jf|S)N)rr@)r1rir2r2r3rgurhcKst|jf|S)N)r r@)r1rir2r2r3rgvrhcKst|jf|S)N)rr@)r1rir2r2r3rgwrhcKst|jf|S)N)rr@)r1rir2r2r3rgxrhcKst|jf|S)N)rr@)r1rir2r2r3rgyrhcKst|jf|S)N)rr@)r1rir2r2r3rgzrhcKstS)N)r)r1rir2r2r3rg{rhcKstS)N)r)r1rir2r2r3rg|rhcKstS)N)r)r1rir2r2r3rg}rhcKs t|f|S)N)r)r1rir2r2r3rg~rh) tokenizer tensorizerrUrVrW similaritytextcat sentencizerrrrZ entity_rulerTi@BcKstd}|j|t||_d|_|dkrl|jj}||f| di}|j j dkrl| di d|j _ ||_ |dkr|jj }||f| di}||_g|_||_d|_dS)a0Initialise a Language object. vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via `Language.Defaults.create_vocab`. make_doc (callable): A function that takes text and returns a `Doc` object. Usually a `Tokenizer`. meta (dict): Custom meta data for the Language class. Is written to by models to add model meta data. max_length (int) : Maximum number of characters in a single text. The current v2 models may run out memory on extremely long texts, due to large internal allocations. You should segment these texts into meaningful units, e.g. paragraphs, subsections etc, before passing them to spaCy. Default maximum length is 1,000,000 characters (1mb). As a rule of thumb, if all pipeline components are enabled, spaCy's default models currently requires roughly 1GB of temporary memory per 100,000 characters in one text. RETURNS (Language): The newly constructed object. Zspacy_factoriesNTr@vectorsnamerj)r)get_entry_points factoriesupdater8_meta_pathrfrEgetrorpr@rTrjpipeline max_length _optimizer)selfr@make_docrxmetakwargsZuser_factoriesfactoryr2r2r3__init__s"    zLanguage.__init__cCs|jS)N)ru)rzr2r2r3pathsz Language.pathcCs|jd|jj|jdd|jdd|jddtj|jdd |jd d |jd d |jd d |jd d |jjt|jj |jj j |jj j d|jd<|j |jd<|jS)Nlangrpmodelversionz0.0.0 spacy_versionz>={} descriptionauthoremailurllicense)widthrokeysrprorw) rt setdefaultr@rformatr* __version__vectors_lengthlenron_keysrpr`)rzr2r2r3r|s  z Language.metacCs ||_dS)N)rt)rzvaluer2r2r3r|scCs |dS)Nrk)get_pipe)rzr2r2r3rkszLanguage.tensorizercCs |dS)NrU)r)rzr2r2r3rUszLanguage.taggercCs |dS)NrV)r)rzr2r2r3rVszLanguage.parsercCs |dS)NrW)r)rzr2r2r3entityszLanguage.entitycCs |dS)Nmatcher)r)rzr2r2r3rszLanguage.matchercCsdd|jDS)zwGet names of available pipeline components. RETURNS (list): List of component name strings, in order. cSsg|] \}}|qSr2r2).0 pipe_name_r2r2r3 sz'Language.pipe_names..)rw)rzr2r2r3r`szLanguage.pipe_namescCs:x|jD]\}}||kr|SqWttjj||jddS)zGet a pipeline component for a given component name. name (unicode): Name of pipeline component to get. RETURNS (callable): The pipeline component. DOCS: https://spacy.io/api/language#get_pipe )rpoptsN)rwKeyErrorr&E001rr`)rzrpr componentr2r2r3rszLanguage.get_pipecCsN||jkr8|dkr&ttjj|dnttjj|d|j|}||f|S)a0Create a pipeline component from a factory. name (unicode): Factory name to look up in `Language.factories`. config (dict): Configuration parameters to initialise component. RETURNS (callable): Pipeline component. DOCS: https://spacy.io/api/language#create_pipe Zsbd)rp)rrrr&E108rE002)rzrpconfigr~r2r2r3 create_pipes  zLanguage.create_pipec Cst|dsLtjjt||d}t|trD||jkrD|tjj|d7}t ||dkrt|drf|j }n:t|drx|j }n(t|drt|j dr|j j }nt|}||j krt tjj||j dtt|t|t|t|gd krt tj||f}|s t|||gs|j|n|r0|jd |nt|rZ||j krZ|j|j ||nJ|r||j kr|j|j |d |nt tjj|p||j ddS) aAdd a component to the processing pipeline. Valid components are callables that take a `Doc` object, modify it and return it. Only one of before/after/first/last can be set. Default behaviour is "last". component (callable): The pipeline component. name (unicode): Name of pipeline component. Overwrites existing component.name attribute if available. If no name is set and the component exposes no name attribute, component.__name__ is used. An error is raised if a name already exists in the pipeline. before (unicode): Component name to insert component directly before. after (unicode): Component name to insert component directly after. first (bool): Insert component first / not first in the pipeline. last (bool): Insert component last / not last in the pipeline. DOCS: https://spacy.io/api/language#add_pipe __call__)rrp)rNrpr\ __class__)rprrr )hasattrr&E003rrepr isinstancerrrE004 ValueErrorrpr\rr`E007sumboolE006anyrwappendinsertindexr) rzrrpbeforeafterfirstlastmsgpiper2r2r3add_pipes:       $ zLanguage.add_pipecCs ||jkS)a$Check if a component name is present in the pipeline. Equivalent to `name in nlp.pipe_names`. name (unicode): Name of the component. RETURNS (bool): Whether a component of the name exists in the pipeline. DOCS: https://spacy.io/api/language#has_pipe )r`)rzrpr2r2r3has_pipe6s zLanguage.has_pipecCs:||jkr ttjj||jd||f|j|j|<dS)zReplace a component in the pipeline. name (unicode): Name of the component to replace. component (callable): Pipeline component. DOCS: https://spacy.io/api/language#replace_pipe )rprN)r`rr&rrrwr)rzrprr2r2r3 replace_pipeAs zLanguage.replace_pipecCsh||jkr ttjj||jd||jkr@ttjj||jd|j|}||j|df|j|<dS)zRename a pipeline component. old_name (unicode): Name of the component to rename. new_name (unicode): New name of the component. DOCS: https://spacy.io/api/language#rename_pipe )rprr N)r`rr&rrrrrw)rzold_namenew_nameir2r2r3 rename_pipeMs    zLanguage.rename_pipecCs4||jkr ttjj||jd|j|j|S)zRemove a component from the pipeline. name (unicode): Name of the component to remove. RETURNS (tuple): A `(name, component)` tuple of the removed component. DOCS: https://spacy.io/api/language#remove_pipe )rpr)r`rr&rrrwpopr)rzrpr2r2r3 remove_pipe\s zLanguage.remove_pipecCst||jkr(ttjjt||jd||}|dkr>i}xl|jD]b\}}||krXqFt|dszttj jt ||d||f| |i}|dkrFttj j|dqFW|S)aApply the pipeline to some text. The text can span multiple sentences, and can contain arbtrary whitespace. Alignment into the original string is preserved. text (unicode): The text to be processed. disable (list): Names of the pipeline components to disable. component_cfg (dict): An optional dictionary with extra keyword arguments for specific components. RETURNS (Doc): A container for accessing the annotations. DOCS: https://spacy.io/api/language#call )lengthrxNr)rrp)rp) rrxrr&E088rr{rwrrtypervE005)rztextdisable component_cfgdocrpprocr2r2r3rhs   zLanguage.__call__cGst|f|S)a^Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a DisabledPipes object is returned, that has a `.restore()` method you can use to undo your changes. DOCS: https://spacy.io/api/language#disable_pipes ) DisabledPipes)rznamesr2r2r3 disable_pipesszLanguage.disable_pipescCs ||S)N)rj)rzrr2r2r3r{szLanguage.make_doccst|t|kr,ttjjt|t|dt|dkr.get_gradsrsdrop)sgdlosses)r)N)r IndexErrorr&E009rryrropsziprrr{rralphab1b2listrwrandomshufflerrvrrsr=)rzdocsgoldsrrrrZ gold_objsZdoc_objsrgoldrpipesrprr}rrrr2)rr3rssJ               zLanguage.updatecs"t|dkrdS|dkr4|jdkr.ttj|_|j}t|}x,t|D] \}}t|trF| |||<qFWt|j }t ||dkri}idfdd }|j |_ |j|_|j|_xh|D]`\} } t| dsqi| j|f||d|| ix&D]\} \} } || | | dqWqW|S) aMake a "rehearsal" update to the models in the pipeline, to prevent forgetting. Rehearsal updates run an initial copy of the model over some data, and update the model so its current predictions are more like the initial ones. This is useful for keeping a pre-trained model on-track, even if you're updating it with a smaller set of examples. docs (iterable): A batch of `Doc` objects. drop (float): The droput rate. sgd (callable): An optimizer. RETURNS (dict): Results from the update. EXAMPLE: >>> raw_text_batches = minibatch(raw_texts) >>> for labelled_batch in minibatch(zip(train_docs, train_golds)): >>> docs, golds = zip(*train_docs) >>> nlp.update(docs, golds) >>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)] >>> nlp.rehearse(raw_batch) rNcs||f|<dS)Nr2)rrr)rr2r3rsz$Language.rehearse..get_gradsrehearse)rr)r)N)rryrrrr enumeraterrr{rwrrrrrrrrvr=)rzrrrrrrrrrprrrrr2)rr3rs6        zLanguage.rehearseccsHx&|jD]\}}t|dr||}qWx|D]\}}||fVq.WdS)a,Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map. docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects. YIELDS (tuple): Tuples of preprocessed `Doc` and `GoldParse` objects. preprocess_goldN)rwrr)rz docs_goldsrprrrr2r2r3rs  zLanguage.preprocess_goldc Ks@|dkrdd}nBx@|D]6\}}x,|D]$\}}x|dD]}|j|}q:Wq(WqW|dddkrt|d|jjjjddkrtj |jjj|jj_t |j|jjjjdr|jjj |d<|dkrt tj}||_ |dkri}xN|jD]D\} } t| d r|| i} | || j|f|j|j d | qW|j S) aAllocate models, pre-process training data and acquire a trainer and optimizer. Used as a contextmanager. get_gold_tuples (function): Function returning gold data component_cfg (dict): Config parameters for specific components. **cfg: Config parameters. RETURNS: An optimizer. DOCS: https://spacy.io/api/language#begin_training NcSsgS)Nr2r2r2r2r3rgrhz)Language.begin_training..r devicerpretrained_vectorsbegin_training)rwr)r@rvr)use_gpurodatashaperrasarrayrrprryrwrrsr) rzZget_gold_tuplesrrrirZannots_bracketsZannotswordrprr}r2r2r3r s8        zLanguage.begin_trainingcKs|dddkrJt|d|jjjjddkrJtj |jjj|jj_t |j|jjjjdrr|jjj |d<|dkrt tj}||_ x(|jD]\}}t|drt|j|_qW|j S)aContinue training a pre-trained model. Create and return an optimizer, and initialize "rehearsal" for any pipeline component that has a .rehearse() method. Rehearsal is used to prevent models from "forgetting" their initialised "knowledge". To perform rehearsal, collect samples of text you want the models to retain performance on, and call nlp.rehearse() with a batch of Doc objects. rrrr rN_rehearsal_model)rvr)rr@rorrrrrrrprryrwrrrr)rzrrirprr2r2r3resume_training4s    zLanguage.resume_trainingFc s|dkrt}|dkri}t|\}}t|}t|}xX|jD]N\}||id|tds~fdd|D}q>j|f}q>WxJt||D]<\} } |rt| |did||j | | fqW|S)N batch_sizerc3s|]}|fVqdS)Nr2)rr)r}rr2r3 Zsz$Language.evaluate..scorerverbose) rrrrwrvrrrprintscore) rzrrrrrrrrprrr2)r}rr3evaluateLs(      zLanguage.evaluatec +s~fdd|jD}x.|D]&}y t|Wqtk r>YqXqWdVx.|D]&}y t|WqPtk rtYqPXqPWdS)aReplace weights of models in the pipeline with those provided in the params dictionary. Can be used as a contextmanager, in which case, models go back to their original weights after the block. params (dict): A dictionary of parameters keyed by model ID. **cfg: Config parameters. EXAMPLE: >>> with nlp.use_params(optimizer.averages): >>> nlp.to_disk('/tmp/checkpoint') cs$g|]\}}t|dr|qS) use_params)rr)rrpr)paramsr2r3rssz'Language.use_params..N)rwnext StopIteration)rzrricontextscontextr2)rr3res       zLanguage.use_paramsric#s|dkrttj|rxt|\}} dd|D}dd| D} j||||d} x t| | D]\} } | | fVq^WdSfdd|D} |dkri}xZjD]P\}}||krq||i}| d|t |d r|j| f|} qt || |} qWt }t }d}d }x| D]} | V|r|| |d krH|| |d 7}n`t|d kr||}}|dkrxtjj}n,jj|\}}j||j|d }qWdS) aProcess texts as a stream, and yield `Doc` objects in order. texts (iterator): A sequence of texts to process. as_tuples (bool): If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False. batch_size (int): The number of texts to buffer. disable (list): Names of the pipeline components to disable. cleanup (bool): If True, unneeded strings are freed to control memory use. Experimental. component_cfg (dict): An optional dictionary with extra keyword arguments for specific components. YIELDS (Doc): Documents in the order of the original text. DOCS: https://spacy.io/api/language#pipe rcss|]}|dVqdS)rNr2)rtcr2r2r3rsz Language.pipe..css|]}|dVqdS)r Nr2)rrr2r2r3rs)rrrNc3s|]}|VqdS)N)r{)rr)rzr2r3rsrrri'r )r(r'W016 itertoolsteerrrwrvrr_pipeweakrefWeakSetaddrrr@strings_cleanup_stale_strings _reset_cacherj)rztexts as_tuples n_threadsrrcleanuprZ text_context1Z text_context2rrrrrprr}Z recent_refsZold_refsZoriginal_strings_dataZnr_seenrr r2)rzr3rsZ             z Language.pipecs|dk rttj|}t|}t}fdd|d<fdd|d<xDjD]:\}}t|dsbqN||krlqNt|dsxqN|fd d||<qNWfd d|d <t|||dS) a]Save the current state to a directory. If a model is loaded, this will include the model. path (unicode or Path): Path to a directory, which will be created if it doesn't exist. exclude (list): Names of components or serialization fields to exclude. DOCS: https://spacy.io/api/language#to_disk Ncsjj|dgdS)Nr@)exclude)rjto_disk)p)rzr2r3rgrhz"Language.to_disk..rjcs|dtjS)Nw)openwritesrsly json_dumpsr|)r)rzr2r3rgrhz meta.jsonrprcSs|j|dgdS)Nr@)r)r)rrr2r2r3rgrhcs j|S)N)r@r)r)rzr2r3rgrhr@) r(r'W014r) ensure_pathrrwrr)rzrrr serializersrprr2)rzr3rs"     zLanguage.to_diskcs|dk rttj|}t|}t}fdd|d<fdd|d<fdd|d<x8jD].\}}||krpq^t|d s|q^|fd d||<q^W|dsd|krt |dg}t ||||_ S) aLoads state from a directory. Modifies the object in place and returns it. If the saved `Language` object contains a model, the model will be loaded. path (unicode or Path): A path to a directory. exclude (list): Names of components or serialization fields to exclude. RETURNS (Language): The modified `Language` object. DOCS: https://spacy.io/api/language#from_disk Ncsjt|S)N)r|rsr read_json)r)rzr2r3rgrhz$Language.from_disk..z meta.jsoncsj|otS)N)r@ from_disk_fix_pretrained_vectors_name)r)rzr2r3rgrhr@csjj|dgdS)Nr@)r)rjr)r)rzr2r3rgrhrjrcSs|j|dgdS)Nr@)r)r)rrr2r2r3rgrh) r(r'rr)rrrwrexistsrrru)rzrrr deserializersrprr2)rzr3rs&    zLanguage.from_diskc s|dk rttj|}t}fdd|d<fdd|d<fdd|d<x8jD].\}}||krfqTt|d srqT|fd d||<qTWt|||}t||S) aSerialize the current state to a binary string. exclude (list): Names of components or serialization fields to exclude. RETURNS (bytes): The serialized form of the `Language` object. DOCS: https://spacy.io/api/language#to_bytes Ncs jS)N)r@to_bytesr2)rzr2r3rg"rhz#Language.to_bytes..r@csjjdgdS)Nr@)r)rjr#r2)rzr2r3rg#rhrjcs tjS)N)rrr|r2)rzr2r3rg$rhz meta.jsonr#cSs|jdgdS)Nr@)r)r#)rr2r2r3rg*rh) r(r'rrrwrr)get_serialization_excluder#)rzrrr}rrprr2)rzr3r#s  zLanguage.to_bytesc s|dk rttj|}t}fdd|d<fdd|d<fdd|d<x8jD].\}}||krfqTt|d srqT|fd d||<qTWt|||}t|||S) aLoad state from a binary string. bytes_data (bytes): The data to load from. exclude (list): Names of components or serialization fields to exclude. RETURNS (Language): The `Language` object. DOCS: https://spacy.io/api/language#from_bytes Ncsjt|S)N)r|rsr json_loads)b)rzr2r3rg;rhz%Language.from_bytes..z meta.jsoncsj|otS)N)r@ from_bytesr )r&)rzr2r3rg<rhr@csjj|dgdS)Nr@)r)rjr')r&)rzr2r3rg=rhrjr'cSs|j|dgdS)Nr@)r)r')r&rr2r2r3rgCrh) r(r'rrrwrr)r$r')rz bytes_datarrr}r"rprr2)rzr3r'.s   zLanguage.from_bytes)NNNNN)rNNN)NNN)NNN)N)FrNN),r\r]r^__doc__r+rfrrrrpropertyrr|setterrkrUrVrrr`rr8rrrrrrrrr{rsrrrrrrrrrarrr#r'r2r2r2r3redsh  '          4     5 4 +   # M recCsd|jkr0|jddr0|jdd|jj_nX|jjjsFd|jj_nBd|jkr~d|jkr~d|jd|jdf}||jj_n ttj|jjjdkrt |jx@|j D]6\}}t |dsq|j di|jjj|j dd<qWdS) Nrorprz %s_%s.vectorsrriZdeprecation_fixes vectors_name)r|rvr@rorpsizerr&E092rrwrrir)r1r,rprr2r2r3r Is      r c@s0eZdZdZddZddZddZdd Zd S) rz)Manager for temporary pipeline disabling.cs>|_||_tj|_t||fdd|DdS)Nc3s|]}|VqdS)N)r)rrp)r1r2r3risz)DisabledPipes.__init__..)r1rrrworiginal_pipelinerrextend)rzr1rr2)r1r3ras   zDisabledPipes.__init__cCs|S)Nr2)rzr2r2r3 __enter__kszDisabledPipes.__enter__cGs |dS)N)restore)rzargsr2r2r3__exit__nszDisabledPipes.__exit__csTjjj}j_fdd|D}|rD|j_ttjj|dgdd<dS)zARestore the pipeline to its state when DisabledPipes was created.cs g|]\}}j|s|qSr2)r1r)rrpr)rzr2r3rtsz)DisabledPipes.restore..)rN)r1rwr/rr&E008r)rzcurrent unexpectedr2)rzr3r2qs zDisabledPipes.restoreN)r\r]r^r)rr1r4r2r2r2r2r3r^s  rccsLt|}xdD]}||kr||qWx|D]}||f|}|Vq.WdS)N)rr)r8r)funcrr}argrr2r2r3r|s   r)G __future__rrrrr r9 collectionsr contextlibrrrZ thinc.neuralrrrjr r@r r7r rwr rrrrrrrrrrcompatrrrrrr_mlrrrDrZlang.punctuationrr r!Zlang.tokenizer_exceptionsr"Z lang.tag_mapr#Zlang.lex_attrsr$r%errorsr&r'r(rr)r*objectr+rer rrrr2r2r2r3sN               @j