ó <¿CVc@sªdZddlmZddlmZmZddlmZmZdefd„ƒYZ defd„ƒYZ d efd „ƒYZ d efd „ƒYZ d d„Z dS(uq Simple Tokenizers These tokenizers divide strings into substrings using the string ``split()`` method. When tokenizing using a particular delimiter string, use the string ``split()`` method directly, as this is more efficient. The simple tokenizers are *not* available as separate functions; instead, you should just use the string ``split()`` method directly: >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> s.split() ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.'] >>> s.split(' ') ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.'] >>> s.split('\n') ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] The simple tokenizers are mainly useful because they follow the standard ``TokenizerI`` interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a `CorpusReader`. iÿÿÿÿ(tunicode_literals(t TokenizerItStringTokenizer(tstring_span_tokenizetregexp_span_tokenizetSpaceTokenizercBseZdZdZRS(u­Tokenize a string using the space character as a delimiter, which is the same as ``s.split(' ')``. >>> from nltk.tokenize import SpaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> SpaceTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.'] u (t__name__t __module__t__doc__t_string(((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyR)s t TabTokenizercBseZdZdZRS(uäTokenize a string use the tab character as a delimiter, the same as ``s.split('\t')``. >>> from nltk.tokenize import TabTokenizer >>> TabTokenizer().tokenize('a\tb c\n\t d') ['a', 'b c\n', ' d'] u (RRRR (((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyR 6st CharTokenizercBs eZdZd„Zd„ZRS(u„Tokenize a string into individual characters. If this functionality is ever required directly, use ``for char in string``. cCs t|ƒS(N(tlist(tselfts((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyttokenizeFsccs?x8ttdt|ƒdƒƒD]\}}||fVq WdS(Ni(t enumeratetrangetlen(R Rtitj((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyt span_tokenizeIs,(RRRRR(((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyR As t LineTokenizercBs,eZdZdd„Zd„Zd„ZRS(uVTokenize a string into its lines, optionally discarding blank lines. This is similar to ``s.split('\n')``. >>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.'] :param blanklines: Indicates how blank lines should be handled. Valid values are: - ``discard``: strip blank lines out of the token list before returning it. A line is considered blank if it contains only whitespace characters. - ``keep``: leave all blank lines in the token list. - ``discard-eof``: if the string ends with a newline, then do not generate a corresponding token ``''`` after that newline. udiscardcCs;d}||kr.tddj|ƒƒ‚n||_dS(Nudiscardukeepu discard-eofuBlank lines must be one of: %su (udiscardukeepu discard-eof(t ValueErrortjoint _blanklines(R t blanklinestvalid_blanklines((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyt__init__ds  cCs}|jƒ}|jdkrCg|D]}|jƒr"|^q"}n6|jdkry|ry|djƒ ry|jƒqyn|S(Nudiscardu discard-eofiÿÿÿÿ(t splitlinesRtrstriptstriptpop(R Rtlinestl((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyRls (ccsT|jdkr1x>t|dƒD] }|VqWnxt|dƒD] }|VqAWdS(Nukeepu\nu \n(\s+\n)*(RRR(R Rtspan((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyRws  (RRRRRR(((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyRMs  udiscardcCst|ƒj|ƒS(N(RR(ttextR((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyt line_tokenize„sN(Rt __future__Rtnltk.tokenize.apiRRtnltk.tokenize.utilRRRR R RR%(((sf/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/simple.pyt$s   7