ó <¿CVc@sËdZddlZddlmZddlmZddlTddlTej dƒZ ej dƒZ ej dƒZ ej d ƒZ d efd „ƒYZd eefd „ƒYZdefd„ƒYZdS(sO Corpus reader for corpora that consist of parenthesis-delineated parse trees. iÿÿÿÿN(tTree(tmap_tag(t*s\((\d+) ([^\s()]+) ([^\s()]+)\)s\(([^\s()]+) ([^\s()]+)\)s\([^\s()]+ ([^\s()]+)\)s \s*\(\s*\(tBracketParseCorpusReadercBsSeZdZd ddd d„Zd„Zd„Zd„Zd d„Zd„Z RS( sÛ Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the "combined" section of the Penn Treebank, e.g. "(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))". tunindented_parentutf8cCs5tj||||ƒ||_||_||_dS(sÁ :param root: The root directory for this corpus. :param fileids: A list or regexp specifying the fileids in this corpus. :param comment_char: The character which can appear at the start of a line to indicate that the rest of the line is a comment. :param detect_blocks: The method that is used to find blocks in the corpus; can be 'unindented_paren' (every unindented parenthesis starts a new parse) or 'sexpr' (brackets are matched). :param tagset: The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods. N(t CorpusReadert__init__t _comment_chart_detect_blockst_tagset(tselftroottfileidst comment_chart detect_blockstencodingttagset((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR!s  cCsº|jdkr"t|d|jƒS|jdkr;t|ƒS|jdkr¤t|ddƒ}|jr g|D]+}tjdtj|jƒd|ƒ^ql}n|Sd s¶td ƒ‚dS( NtsexprRt blanklineRtstart_res^\(s (?m)^%s.*tisbad block type( R tread_sexpr_blockRtread_blankline_blocktread_regexp_blocktretsubtescapetAssertionError(R tstreamttoksttok((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt _read_block6s  8cCsStj|ƒr%|jƒdd!}ntjdd|ƒ}tjdd|ƒ}|S(Niiÿÿÿÿs\((.)\)s(\1 \1)s"\(([^\s()]+) ([^\s()]+) [^\s()]+\)s(\1 \2)(tEMPTY_BRACKETStmatchtstripRR(R tt((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt _normalizeGs cCsàytj|j|ƒƒSWn¿tk rÛ}tjjdƒ|jd krµxgtddƒD]S}y9t|j|d|ƒƒ}tjjd|ƒ|SWq[tk r­q[Xq[Wntjjdƒtd|j |ƒƒSXdS( Ns(Bad tree detected; trying to recover... smismatched parensiit)s( Recovered by adding %d close paren(s) s' Recovered by returning a flat parse. tS(smismatched parens( Rt fromstringR%t ValueErrortsyststderrtwritetargstranget_tag(R R$tetntv((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt_parseRs  cCs‡gtj|j|ƒƒD]\}}||f^q}|rƒ||jkrƒg|D]'\}}|t|j||ƒf^qS}n|S(N(tTAGWORDtfindallR%R R(R R$Rtptwt tagged_sent((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR/fs77cCstj|j|ƒƒS(N(tWORDR5R%(R R$((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt_wordlsN( t__name__t __module__t__doc__tNoneRR R%R3R/R:(((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRs    t#CategorizedBracketParseCorpusReadercBs¿eZdZd„Zd„Zd d d„Zd d d„Zd d d„Zd d d„Z d d d d„Z d d d d„Z d d d d „Z d d d „Z d d d „Zd d d „ZRS(sª A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider cOs'tj||ƒtj|||ŽdS(st Initialize the corpus reader. Categorization arguments (C{cat_pattern}, C{cat_map}, and C{cat_file}) are passed to the L{CategorizedCorpusReader constructor }. The remaining arguments are passed to the L{BracketParseCorpusReader constructor }. N(tCategorizedCorpusReaderRR(R R-tkwargs((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRvs cCsH|dk r'|dk r'tdƒ‚n|dk r@|j|ƒS|SdS(Ns'Specify fileids or categories, not both(R>R)R (R R t categories((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt_resolve‚s   cCstj||j||ƒƒS(N(RtrawRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRD‰scCstj||j||ƒƒS(N(RtwordsRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyREŒscCstj||j||ƒƒS(N(RtsentsRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRFscCstj||j||ƒƒS(N(RtparasRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRG’scCstj||j||ƒ|ƒS(N(Rt tagged_wordsRC(R R RBR((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRH•scCstj||j||ƒ|ƒS(N(Rt tagged_sentsRC(R R RBR((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRI˜scCstj||j||ƒ|ƒS(N(Rt tagged_parasRC(R R RBR((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRJ›scCstj||j||ƒƒS(N(Rt parsed_wordsRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRKžscCstj||j||ƒƒS(N(Rt parsed_sentsRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRL¡scCstj||j||ƒƒS(N(Rt parsed_parasRC(R R RB((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRM¤sN(R;R<R=RRCR>RDRERFRGRHRIRJRKRLRM(((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR?os tAlpinoCorpusReadercBs>eZdZddd„Zed„Zdd„Zd„ZRS(sÆ Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter 'ordered' to the overridden _normalize function. The _parse function can then remain untouched. s ISO-8859-1c Cs)tj||dddd|d|ƒdS(Ns alpino\.xmlRRRR(RR(R R RR((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR²scCsŸ|d dkrdStjdd|ƒ}|rGtjdd|ƒ}ntjdd |ƒ}tjd d |ƒ}tjd d|ƒ}tjd d|ƒ}|S(s¸Normalize the xml sentence element in t. The sentence elements , although embedded in a few overall xml elements, are seperated by blank lines. That's how the reader can deliver them one at a time. Each sentence has a few category subnodes that are of no use to us. The remaining word nodes may or may not appear in the proper order. Each word node has attributes, among which: - begin : the position of the word in the sentence - pos : Part of Speech: the Tag - word : the actual word The return value is a string with all xml elementes replaced by clauses: either a cat clause with nested clauses, or a word clause. The order of the bracket clauses closely follows the xml. If ordered == True, the word clauses include an order sequence number. If ordered == False, the word clauses only have pos and word parts. i s s(\1s> s (\1 \2 \3)s- s(\1 \2)s R&s.*s(RR(R R$tordered((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR%¸scCsÎgtj|j|dtƒƒD]$\}}}t|ƒ||f^q}|jƒ|r¢||jkr¢g|D]*\}}}|t|j||ƒf^qo}n(g|D]\}}}||f^q©}|S(NRO(t SORTTAGWRDR5R%tTruetinttsortR R(R R$RtoR6R7R8((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR/Ös I :(cCs,|j|ƒ}g|D]\}}|^qS(s(Return a correctly ordered list if words(R/(R R$R8R7R6((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyR:ßsN( R;R<R=R>RtFalseR%R/R:(((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyRN¨s    (R=R*t nltk.treeRtnltk.tagRtnltk.corpus.reader.utiltnltk.corpus.reader.apiRtcompileRPR4R9R!tSyntaxCorpusReaderRR@R?RN(((sr/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bracket_parse.pyt s   U8