B An]@stddlmZddlmZddlZddlmZddlmZddl m Z ddl m Z dd l mZmZGd d d eZdS) )unicode_literals) defaultdictN)Errors) basestring_) ensure_path)Span)Matcher PhraseMatcherc@steZdZdZdZddZddZddZd d Ze d d Z e d dZ ddZ ddZ ddZddZddZdS) EntityRuleraThe EntityRuler lets you add spans to the `Doc.ents` using token-based rules or exact phrase matches. It can be combined with the statistical `EntityRecognizer` to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. After initialization, the component is typically added to the pipeline using `nlp.add_pipe`. DOCS: https://spacy.io/api/entityruler USAGE: https://spacy.io/usage/rule-based-matching#entityruler entity_rulercKs`||_|dd|_tt|_tt|_t|j|_ t |j|_ |d}|dk r\| |dS)aInitialize the entitiy ruler. If patterns are supplied here, they need to be a list of dictionaries with a `"label"` and `"pattern"` key. A pattern can either be a token pattern (list) or a phrase pattern (string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`. nlp (Language): The shared nlp object to pass the vocab to the matchers and process phrase patterns. patterns (iterable): Optional patterns to load in. overwrite_ents (bool): If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. **cfg: Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. RETURNS (EntityRuler): The newly constructed object. DOCS: https://spacy.io/api/entityruler#init Zoverwrite_entsFpatternsN) nlpget overwriterlisttoken_patternsphrase_patternsr vocabmatcherr phrase_matcher add_patterns)selfrcfgr r}/home/app_decipher_dev_19-4/dev/decipher-analysis/serverless-application/helper/df_spacy/python/spacy/pipeline/entityruler.py__init__s     zEntityRuler.__init__cCs8tdd|jD}tdd|jD}||S)z5The number of all patterns added to the entity ruler.css|]}t|VqdS)N)len).0prrr 9sz&EntityRuler.__len__..css|]}t|VqdS)N)r)rrrrrr :s)sumrvaluesr)rZn_token_patternsZn_phrase_patternsrrr__len__7szEntityRuler.__len__cCs||jkp||jkS)z+Whether a label is present in the patterns.)rr)rlabelrrr __contains__=szEntityRuler.__contains__c st||t||}tdd|D}dd}t||dd}t|j}g}t}x|D]z\}tdd|Dr|jsq^|kr^d |kr^|t ||d fd d|D}| t q^W|||_|S) zFind matches in document and add them as entities. doc (Doc): The Doc object in the pipeline. RETURNS (Doc): The Doc with added entities, if available. DOCS: https://spacy.io/api/entityruler#call cSs$g|]\}}}||kr|||fqSrr)rZm_idstartendrrr Ksz(EntityRuler.__call__..cSs|d|d|dfS)Nrr)mrrrMz&EntityRuler.__call__..T)keyreversecss|] }|jVqdS)N)ent_type)rtrrrr Ssz'EntityRuler.__call__..r))r$cs$g|]}|jkr|jks|qSr)r&r')re)r'r&rrr(Ys) rrrsetsortedentsanyrappendrupdaterange)rdocmatchesZ get_sort_keyentitiesZ new_entitiesZ seen_tokensZmatch_idr)r'r&r__call__As"   zEntityRuler.__call__cCs&t|j}||jt|S)zAll labels present in the match patterns. RETURNS (set): The string labels. DOCS: https://spacy.io/api/entityruler#labels )r2rkeysr7rtuple)rZ all_labelsrrrlabels_szEntityRuler.labelscCsvg}x4|jD]&\}}x|D]}|||dqWqWx6|jD](\}}x|D]}|||jdqTWqFW|S)zGet all patterns that were added to the entity ruler. RETURNS (list): The original patterns, one dictionary per pattern. DOCS: https://spacy.io/api/entityruler#patterns )r$pattern)ritemsr6rtext)rZ all_patternsr$r r@rrrr ks  zEntityRuler.patternscCsxl|D]d}|d}|d}t|tr<|j|||qt|trX|j||qttj j |dqWx*|j D]\}}|j j |df|qzWx*|j D]\}}|jj |df|qWdS)aAdd patterns to the entitiy ruler. A pattern can either be a token pattern (list of dicts) or a phrase pattern (string). For example: {'label': 'ORG', 'pattern': 'Apple'} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]} patterns (list): The patterns to add. DOCS: https://spacy.io/api/entityruler#add_patterns r$r@)r@N) isinstancerrr6rrr ValueErrorrE097formatrAraddr)rr entryr$r@rrrr|s   zEntityRuler.add_patternscKst|}|||S)aLoad the entity ruler from a bytestring. patterns_bytes (bytes): The bytestring to load. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler#from_bytes )srsly msgpack_loadsr)rZpatterns_byteskwargsr rrr from_bytess  zEntityRuler.from_bytescKs t|jS)zSerialize the entity ruler patterns to a bytestring. RETURNS (bytes): The serialized patterns. DOCS: https://spacy.io/api/entityruler#to_bytes )rI msgpack_dumpsr )rrKrrrto_bytesszEntityRuler.to_bytescKs*t|}|d}t|}|||S)aqLoad the entity ruler from a file. Expects a file containing newline-delimited JSON (JSONL) with one entry per line. path (unicode / Path): The JSONL file to load. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler#from_disk z.jsonl)r with_suffixrI read_jsonlr)rpathrKr rrr from_disks    zEntityRuler.from_diskcKs$t|}|d}t||jdS)a_Save the entity ruler patterns to a directory. The patterns will be saved as newline-delimited JSON (JSONL). path (unicode / Path): The JSONL file to load. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler z.jsonlN)rrOrI write_jsonlr )rrQrKrrrto_disks  zEntityRuler.to_diskN)__name__ __module__ __qualname____doc__namerr#r%r<propertyr?r rrLrNrRrTrrrrr s    r ) __future__r collectionsrrIerrorsrcompatrutilrtokensrrr r objectr rrrrs