ó <¿CVc@s¸dZddlZddlZddlZddlmZddlmZddlm Z ddl Tddl m Z ddl TddlTdefd „ƒYZd efd „ƒYZdS( sN A reader for corpora that contain chunked (and optionally tagged) documents. iÿÿÿÿN(tBracketParseCorpusReader(tcompat(tTree(t*(t tagstr2treetChunkedCorpusReadercBsÈeZdZdeeddeƒeddd„Zdd„Z dd„Z dd„Z dd „Z ddd „Z ddd „Zddd „Zddd „Zddd„Zddd„Zd„ZRS(s& Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using ``nltk.chunk.tagstr2tree``. ts tgapstutf8c Cs/tj||||ƒ||||f|_dS(s’ :param root: The root directory for this corpus. :param fileids: A list or regexp specifying the fileids in this corpus. N(t CorpusReadert__init__t_cv_args( tselftroottfileidst extensiont str2chunktreetsent_tokenizertpara_block_readertencodingttagset((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyR $s cCsb|dkr|j}nt|tjƒr6|g}ntg|D]}|j|ƒjƒ^q@ƒS(sT :return: the given file(s) as a single string. :rtype: str N(tNonet_fileidst isinstanceRt string_typestconcattopentread(R Rtf((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pytraw2s   c CsJtg|j|tƒD]-\}}t||dddd|jŒ^qƒS(s~ :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) i(RtabspathstTruetChunkedCorpusViewR (R RRtenc((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pytwords;sc CsJtg|j|tƒD]-\}}t||dddd|jŒ^qƒS(s² :return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. :rtype: list(list(str)) ii(RRRR R (R RRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pytsentsDsc CsJtg|j|tƒD]-\}}t||dddd|jŒ^qƒS(sÜ :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. :rtype: list(list(list(str))) ii(RRRR R (R RRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pytparasNsc CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(s¾ :return: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples ``(word,tag)``. :rtype: list(tuple(str,str)) iit target_tagset(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt tagged_wordsXsc CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(s­ :return: the given file(s) as a list of sentences, each encoded as a list of ``(word,tag)`` tuples. :rtype: list(list(tuple(str,str))) iiR%(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt tagged_sentsbsc CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(sð :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of ``(word,tag)`` tuples. :rtype: list(list(list(tuple(str,str)))) iiR%(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt tagged_paraslsc CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(sv :return: the given file(s) as a list of tagged words and chunks. Words are encoded as ``(word, tag)`` tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over ``(word,tag)`` tuples or word strings. :rtype: list(tuple(str,str) and Tree) iiR%(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt chunked_wordsvs c CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(s6 :return: the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as ``(word, tag)`` tuples (if the corpus has tags) or word strings (if the corpus has no tags). :rtype: list(Tree) iiR%(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt chunked_sents‚s c CsPtg|j|tƒD]3\}}t||ddddd||jŒ^qƒS(so :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as ``(word, tag)`` tuples (if the corpus has tags) or word strings (if the corpus has no tags). :rtype: list(list(Tree)) iR%(RRRR R (R RRRR!((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt chunked_parasŽs cCs#gt|ƒD]}t|ƒ^q S(N(tread_blankline_blockR(R tstreamtt((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt _read_blockšsN(t__name__t __module__t__doc__RtRegexpTokenizerRR,RR RR"R#R$R&R'R(R)R*R+R/(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyRs"        R cBs)eZddd„Zd„Zd„ZRS(c Csktj||d|ƒ||_||_||_||_||_||_| |_| |_ | |_ dS(NR( tStreamBackedCorpusViewR t_taggedt_group_by_sentt_group_by_parat_chunkedt_str2chunktreet_sent_tokenizert_para_block_readert_source_tagsett_target_tagset( R tfileidRttaggedt group_by_sentt group_by_paratchunkedRRRt source_tagsetR%((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyR žs        cCsçg}xÚ|j|ƒD]É}g}x”|jj|ƒD]€}|j|d|jd|jƒ}|jsw|j|ƒ}n|js|j ƒ}n|j r¨|j |ƒq5|j |ƒq5W|j rÒ|j |ƒq|j |ƒqW|S(NRCR%(R;R:ttokenizeR9R<R=R5t_untagR8tleavesR6tappendtextendR7(R R-tblocktpara_strtparatsent_strtsent((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/chunked.pyt read_block¬s"     cCslxet|ƒD]W\}}t|tƒr8|j|ƒq t|tƒrX|d|| s    „