U C^!5@sddlmZddlmZmZddlZddlmZddlm Z ddl m Z ddl m Z mZmZdd lmZdd lmZmZd Zed d ddgdGdddeZdS))unicode_literals) defaultdict OrderedDictN) component)Errors) basestring_) ensure_pathto_disk from_disk)Span)Matcher PhraseMatcherz||Z entity_rulerzdoc.entsztoken.ent_typez token.ent_iob)Zassignsc@seZdZdZd"ddZeddZdd Zd d Zd d Z e ddZ e ddZ e ddZ ddZddZddZddZddZddZd d!ZdS)# EntityRuleraThe EntityRuler lets you add spans to the `Doc.ents` using token-based rules or exact phrase matches. It can be combined with the statistical `EntityRecognizer` to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. After initialization, the component is typically added to the pipeline using `nlp.add_pipe`. DOCS: https://spacy.io/api/entityruler USAGE: https://spacy.io/usage/rule-based-matching#entityruler NFcKs||_|dd|_tt|_tt|_t|j|d|_ |dk rl| dkrPd}||_ t |j|j |d|_ nd|_ t |j|d|_ |dt|_tt|_|d }|dk r||dS) aInitialize the entitiy ruler. If patterns are supplied here, they need to be a list of dictionaries with a `"label"` and `"pattern"` key. A pattern can either be a token pattern (list) or a phrase pattern (string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`. nlp (Language): The shared nlp object to pass the vocab to the matchers and process phrase patterns. phrase_matcher_attr (int / unicode): Token attribute to match on, passed to the internal PhraseMatcher as `attr` validate (bool): Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate` patterns (iterable): Optional patterns to load in. overwrite_ents (bool): If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. **cfg: Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. RETURNS (EntityRuler): The newly constructed object. DOCS: https://spacy.io/api/entityruler#init Zoverwrite_entsF)validateNZTEXTZORTH)attrr ent_id_seppatterns)nlpget overwriterlisttoken_patternsphrase_patternsr vocabmatcherupperphrase_matcher_attrrphrase_matcherDEFAULT_ENT_ID_SEPrdict_ent_ids add_patterns)selfrrrcfgrr%=/tmp/pip-install-6_kvzl1k/spacy/spacy/pipeline/entityruler.py__init__s*      zEntityRuler.__init__cKs ||f|SNr%)clsrr$r%r%r&from_nlpHszEntityRuler.from_nlpcCs8tdd|jD}tdd|jD}||S)z5The number of all patterns added to the entity ruler.css|]}t|VqdSr(len.0pr%r%r& Nsz&EntityRuler.__len__..css|]}t|VqdSr(r+r-r%r%r&r0Os)sumrvaluesr)r#Zn_token_patternsZn_phrase_patternsr%r%r&__len__LszEntityRuler.__len__cCs||jkp||jkS)z+Whether a label is present in the patterns.)rr)r#labelr%r%r& __contains__RszEntityRuler.__contains__c s(t||t||}tdd|D}dd}t||dd}t|j}g}t}|D]\}tdd|Dr|jsq\|kr\d |kr\||jkr|j|\}} t ||d } | r| D] } | | _ qnt ||d } | | fd d|D}| t q\|||_|S) zFind matches in document and add them as entities. doc (Doc): The Doc object in the pipeline. RETURNS (Doc): The Doc with added entities, if available. DOCS: https://spacy.io/api/entityruler#call cSs$g|]\}}}||kr|||fqSr%r%)r.Zm_idstartendr%r%r& `sz(EntityRuler.__call__..cSs|d|d|dfS)Nrr%)mr%r%r&bz&EntityRuler.__call__..T)keyreversecss|] }|jVqdSr()Zent_type)r.tr%r%r&r0hsz'EntityRuler.__call__..r9)r4cs$g|]}|jkr|jks|qSr%)r6r7)r.er7r6r%r&r8us )rrrsetsortedZentsanyrr!r Zent_id_appendupdaterange) r#docmatchesZ get_sort_keyentitiesZ new_entitiesZ seen_tokensZmatch_idr4ent_idspantokenr%rAr&__call__Vs6        zEntityRuler.__call__cCs&t|j}||jt|S)zAll labels present in the match patterns. RETURNS (set): The string labels. DOCS: https://spacy.io/api/entityruler#labels )rBrkeysrFrtuple)r#Z all_labelsr%r%r&labels|szEntityRuler.labelscCs<t}|jD]&}|j|kr ||\}}||q t|S)zAll entity ids present in the match patterns `id` properties. RETURNS (set): The string entity ids. DOCS: https://spacy.io/api/entityruler#ent_ids )rBrQr _split_labeladdrP)r#Z all_ent_idsl_rKr%r%r&ent_idss    zEntityRuler.ent_idscCsg}|jD]@\}}|D]2}||\}}||d}|rB||d<||qq|jD]B\}}|D]4}||\}}||jd}|r||d<||qfqZ|S)zGet all patterns that were added to the entity ruler. RETURNS (list): The original patterns, one dictionary per pattern. DOCS: https://spacy.io/api/entityruler#patterns )r4patternid)ritemsrRrErtext)r#Z all_patternsr4rrW ent_labelrKr/r%r%r&rs   zEntityRuler.patternsc CsTz2|jj|j}dd|jj|ddD}Wntk rJg}YnX|j||D]}|d}d|kr|}|||d}|j|}||df|j |<|d}t |t r|j | ||q^t |tr|j| |q^ttjj|dq^|jD]\}}|j||q|j D]\}}|j||q,W5QRXdS) aAdd patterns to the entitiy ruler. A pattern can either be a token pattern (list of dicts) or a phrase pattern (string). For example: {'label': 'ORG', 'pattern': 'Apple'} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]} patterns (list): The patterns to add. DOCS: https://spacy.io/api/entityruler#add_patterns cSsg|]}|qSr%r%)r.piper%r%r&r8sz,EntityRuler.add_patterns..r9Nr4rXrW)rW)rZ pipe_namesindexname ValueErrorZ disable_pipes _create_labelrZ_normalize_keyr! isinstancerrrErrrZE097formatrYrSr) r#rZ current_indexZsubsequent_pipesentryr4r[r=rWr%r%r&r"s2      zEntityRuler.add_patternscCs.|j|kr||jd\}}n|}d}||fS)zSplit Entity label into ent_label and ent_id if it contains self.ent_id_sep RETURNS (tuple): ent_label, ent_id r9N)rrsplit)r#r4r[rKr%r%r&rRs  zEntityRuler._split_labelcCst|trd||j|}|S)zJoin Entity label with ent_id if the pattern has an `id` attribute RETURNS (str): The ent_label joined with configured `ent_id_sep` z{}{}{})rarrbr)r#r4rKr%r%r&r`s zEntityRuler._create_labelcKs~t|}t|trp||d||dd|_|dd|_|jdk r`t|j j |jd|_ |dt |_ n |||S)aLoad the entity ruler from a bytestring. patterns_bytes (bytes): The bytestring to load. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler#from_bytes rrFrNrr)srslyZ msgpack_loadsrar r"rrrrrrrrr)r#Zpatterns_byteskwargsr$r%r%r& from_bytess    zEntityRuler.from_bytescKs2td|jfd|jfd|jfd|jff}t|S)zSerialize the entity ruler patterns to a bytestring. RETURNS (bytes): The serialized patterns. DOCS: https://spacy.io/api/entityruler#to_bytes rrrr)rrrrrrfZ msgpack_dumps)r#rgserialr%r%r&to_bytesszEntityRuler.to_bytesc st|}|d}|r0t|}|nidfddi}dfddi}t||idd_d _ d t _ j d k rt j jj d _t||iS) aqLoad the entity ruler from a file. Expects a file containing newline-delimited JSON (JSONL) with one entry per line. path (unicode / Path): The JSONL file to load. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler#from_disk .jsonlrcst|dSNrk)r"rf read_jsonl with_suffixr/r#r%r&r;(sz'EntityRuler.from_disk..r$cst|Sr()rFrf read_jsonror$r%r&r;,r<rFrrNre)r rnis_filerfrmr"r rrrrrrrrr)r#pathrgZdepr_patterns_pathrZdeserializers_patternsZdeserializers_cfgr%r$r#r&r s.          zEntityRuler.from_diskc s^t|}jjjdfddfddd}|jdkrNt|jn t||idS)a/Save the entity ruler patterns to a directory. The patterns will be saved as newline-delimited JSON (JSONL). path (unicode / Path): The JSONL file to save. **kwargs: Other config paramters, mostly for consistency. DOCS: https://spacy.io/api/entityruler#to_disk )rrrcst|djSrl)rf write_jsonlrnrrorpr%r&r;Jsz%EntityRuler.to_disk..cs t|Sr()rf write_jsonrorrr%r&r;Mr<)rr$rkN) r rrrsuffixrfrvrr )r#rtrgZ serializersr%rur&r :s    zEntityRuler.to_disk)NF)__name__ __module__ __qualname____doc__r' classmethodr*r3r5rNpropertyrQrVrr"rRr`rhrjr r r%r%r%r&rs( + &   '  $r) __future__r collectionsrrrflanguagererrorsrcompatrutilr r r tokensr rr rrobjectrr%r%r%r&s