ó <żCVc@s<dZddlZddlmZdefd„ƒYZdS(s  Penn Treebank Tokenizer The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed. i˙˙˙˙N(t TokenizerItTreebankWordTokenizerc BsıeZdZejdƒejdƒejdƒejdƒejdƒejdƒejdƒejdƒgZejd ƒejd ƒgZejd ƒejd ƒgZd „ZRS(sÙ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by ``word_tokenize()``. It assumes that the text has already been segmented into sentences, e.g. using ``sent_tokenize()``. This tokenizer performs the following steps: - split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll`` - treat most punctuation characters as separate tokens - split off commas and single quotes, when followed by whitespace - separate periods that appear at the end of line >>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> TreebankWordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] >>> s = "They'll save and invest more." >>> TreebankWordTokenizer().tokenize(s) ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] >>> s = "hi, my name can't hello," >>> TreebankWordTokenizer().tokenize(s) ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ','] s(?i)\b(can)(not)\bs(?i)\b(d)('ye)\bs(?i)\b(gim)(me)\bs(?i)\b(gon)(na)\bs(?i)\b(got)(ta)\bs(?i)\b(lem)(me)\bs(?i)\b(mor)('n)\bs(?i)\b(wan)(na) s(?i) ('t)(is)\bs(?i) ('t)(was)\bs(?i)\b(whad)(dd)(ya)\bs(?i)\b(wha)(t)(cha)\bcCs´tjdd|ƒ}tjdd|ƒ}tjdd|ƒ}tjdd|ƒ}tjd d|ƒ}tjd d |ƒ}tjd d |ƒ}tjdd|ƒ}tjdd |ƒ}tjdd|ƒ}tjdd |ƒ}tjdd|ƒ}d|d}tjdd|ƒ}tjdd|ƒ}tjdd|ƒ}tjdd|ƒ}x#|jD]}|jd|ƒ}qhWx#|jD]}|jd|ƒ}qŽW|jƒS(Ns^\"s``s(``)s \1 s ([ (\[{<])"s\1 `` s ([:,])([^\d])s \1 \2s([:,])$s\.\.\.s ... s[;@#$%&]s \g<0> s([^\.])(\.)([\]\)}>"\']*)\s*$s\1 \2\3 s[?!]s([^'])' s\1 ' s[\]\[\(\)\{\}\<\>]s--s -- t t"s '' s (\S)(\'\')s\1 \2 s([^' ])('[sS]|'[mM]|'[dD]|') s)([^' ])('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) s \1 \2 (tretsubt CONTRACTIONS2t CONTRACTIONS3tsplit(tselfttexttregexp((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/treebank.pyttokenize>s.  ( t__name__t __module__t__doc__RtcompileRRt CONTRACTIONS4R (((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/treebank.pyRs         (RRtnltk.tokenize.apiRR(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/treebank.pyts