ó <¿CVc@s«dZddlZddlZddlmZddlTddlTddlTde fd„ƒYZ de e fd„ƒYZ d e fd „ƒYZ d e fd „ƒYZdS( s; A reader for corpora that consist of plaintext documents. iÿÿÿÿN(t string_types(t*tPlaintextCorpusReadercBsƒeZdZeZeƒejjdƒe dd„Z d d„Z d d„Z d d„Zd d„Zd„Zd „Zd „ZRS( sÆ Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor. This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the ``CorpusView`` class variable. stokenizers/punkt/english.pickletutf8cCs5tj||||ƒ||_||_||_dS(sÝ Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage: >>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt') # doctest: +SKIP :param root: The root directory for this corpus. :param fileids: A list or regexp specifying the fileids in this corpus. :param word_tokenizer: Tokenizer for breaking sentences or paragraphs into words. :param sent_tokenizer: Tokenizer for breaking paragraphs into words. :param para_block_reader: The block reader used to divide the corpus into paragraph blocks. N(t CorpusReadert__init__t_word_tokenizert_sent_tokenizert_para_block_reader(tselftroottfileidstword_tokenizertsent_tokenizertpara_block_readertencoding((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR(s  cCs_|dkr|j}nt|tƒr3|g}ntg|D]}|j|ƒjƒ^q=ƒS(sT :return: the given file(s) as a single string. :rtype: str N(tNonet_fileidst isinstanceRtconcattopentread(R R tf((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pytrawCs   cCsJtg|j|ttƒD]*\}}}|j||jd|ƒ^qƒS(s~ :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) R(RtabspathstTruet CorpusViewt_read_word_block(R R tpathtenctfileid((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pytwordsLscCsh|jdkrtdƒ‚ntg|j|ttƒD]*\}}}|j||jd|ƒ^q7ƒS(s² :return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. :rtype: list(list(str)) s%No sentence tokenizer for this corpusRN(RRt ValueErrorRRRRt_read_sent_block(R R RRR((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pytsentsVscCsh|jdkrtdƒ‚ntg|j|ttƒD]*\}}}|j||jd|ƒ^q7ƒS(sÜ :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. :rtype: list(list(list(str))) s%No sentence tokenizer for this corpusRN(RRR RRRRt_read_para_block(R R RRR((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pytparasdscCs@g}x3tdƒD]%}|j|jj|jƒƒƒqW|S(Ni(trangetextendRttokenizetreadline(R tstreamRti((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyRrs#cCs\g}xO|j|ƒD]>}|jg|jj|ƒD]}|jj|ƒ^q5ƒqW|S(N(RR&RR'R(R R)R"tparatsent((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR!xs  3cCs\g}xO|j|ƒD]>}|jg|jj|ƒD]}|jj|ƒ^q5ƒqW|S(N(RtappendRR'R(R R)R$R+R,((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR#s  3N(t__name__t __module__t__doc__tStreamBackedCorpusViewRtWordPunctTokenizertnltktdatat LazyLoadertread_blankline_blockRRRRR"R$RR!R#(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyRs       t CategorizedPlaintextCorpusReadercBs\eZdZd„Zd„Zddd„Zddd„Zddd„Zddd„Z RS(sy A reader for plaintext corpora whose documents are divided into categories based on their file identifiers. cOs'tj||ƒtj|||ŽdS(s Initialize the corpus reader. Categorization arguments (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to the ``CategorizedCorpusReader`` constructor. The remaining arguments are passed to the ``PlaintextCorpusReader`` constructor. N(tCategorizedCorpusReaderRR(R targstkwargs((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyRscCsH|dk r'|dk r'tdƒ‚n|dk r@|j|ƒS|SdS(Ns'Specify fileids or categories, not both(RR R (R R t categories((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyt_resolve—s   cCstj||j||ƒƒS(N(RRR<(R R R;((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyRžscCstj||j||ƒƒS(N(RRR<(R R R;((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR¡scCstj||j||ƒƒS(N(RR"R<(R R R;((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR"¤scCstj||j||ƒƒS(N(RR$R<(R R R;((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR$§sN( R.R/R0RR<RRRR"R$(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyR7‡s t*PortugueseCategorizedPlaintextCorpusReadercBseZd„ZRS(cOs=tj||ƒtjjdƒ|d²s     (R0tcodecst nltk.dataR3t nltk.compatRt nltk.tokenizetnltk.corpus.reader.utiltnltk.corpus.reader.apiRRR8R7R=R>(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/plaintext.pyt s     p$