ó <¿CVc@s0dZddlmZdefd„ƒYZdS(s( Multi-Word Expression Tokenizer A ``MWETokenizer`` takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs: >>> from nltk.tokenize import MWETokenizer >>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')]) >>> tokenizer.add_mwe(('in', 'spite', 'of')) >>> tokenizer.tokenize('Testing testing testing one two three'.split()) ['Testing', 'testing', 'testing', 'one', 'two', 'three'] >>> tokenizer.tokenize('This is a test in spite'.split()) ['This', 'is', 'a', 'test', 'in', 'spite'] >>> tokenizer.tokenize('In a little or a little bit or a lot in spite of'.split()) ['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of'] iÿÿÿÿ(t TokenizerItMWETokenizercBs2eZdZddd„Zdd„Zd„ZRS(sT A tokenizer that processes tokenized text and merges multi-word expressions into single tokens: >>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+') >>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split()) ['An', "hors+d'oeuvre", 'tonight,', 'sir?'] :type mwes: list(list(str)) :param mwes: A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings. :type separator: str :param separator: String that should be inserted between words in a multi-word expression token. t_cCsF|sg}ntƒ|_||_x|D]}|j|ƒq+WdS(N(tdictt_mwest _separatortadd_mwe(tselftmwest separatortmwe((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/mwe.pyt__init__5s cCsr|dkr|j}n|rd|d|krBtƒ||d>> tokenizer = MWETokenizer([('a', 'b'), ('a', 'b', 'c'), ('a', 'x')]) >>> tokenizer._mwes {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}} The key True marks the end of a valid MWE iit_trieN(tNoneRRRtTrue(RR R((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/mwe.pyR>s"cCsúd}t|ƒ}g}xÛ||krõ|||jkr×|}|j}x¦||krƒ|||krƒ|||}|d}qLWt|kr¹|j|jj|||!ƒƒ|}qò|j||ƒ|d7}q|j||ƒ|d7}qW|S(Nii(tlenRRtappendRtjoin(Rttexttitntresulttjttrie((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/mwe.pyttokenizeUs$ N(t__name__t __module__t__doc__R RRR(((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/mwe.pyR#s N(Rtnltk.tokenize.apiRR(((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/mwe.pyts