ó <¿CVc@s„dZddlmZddlmZmZmZdefd„ƒYZd d„Z de fd„ƒYZ d efd „ƒYZ d S( sACorpus reader for the XML version of the British National Corpus.iÿÿÿÿ(tconcat(tXMLCorpusReadert XMLCorpusViewt ElementTreetBNCCorpusReadercBs‰eZdZed„Zdeed„Zdeeed„Zdeed„Z deeed„Z deeeed„Z d„Z RS( s6Corpus reader for the XML version of the British National Corpus. For access to the complete XML data structure, use the ``xml()`` method. For access to simple word lists and tagged word lists, use ``words()``, ``sents()``, ``tagged_words()``, and ``tagged_sents()``. You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554 If you extracted the archive to a directory called `BNC`, then you can instantiate the reader as:: BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml') cCs tj|||ƒ||_dS(N(Rt__init__t_lazy(tselftroottfileidstlazy((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyRscCs|j|td||ƒS(sT :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) :param strip_space: If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens. :param stem: If true, then use word stems instead of word strings. N(t_viewstFalsetNone(RR t strip_spacetstem((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pytwords#s cCs+|r dnd}|j|t|||ƒS(s  :return: the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples ``(word,tag)``. :rtype: list(tuple(str,str)) :param c5: If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used. :param strip_space: If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens. :param stem: If true, then use word stems instead of word strings. tc5tpos(R R (RR RRRttag((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyt tagged_words/s cCs|j|td||ƒS(sˆ :return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. :rtype: list(list(str)) :param strip_space: If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens. :param stem: If true, then use word stems instead of word strings. N(R tTrueR (RR RR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pytsents?s c Cs7|r dnd}|j|dtd|d|d|ƒS(s :return: the given file(s) as a list of sentences, each encoded as a list of ``(word,tag)`` tuples. :rtype: list(list(tuple(str,str))) :param c5: If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used. :param strip_space: If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens. :param stem: If true, then use word stems instead of word strings. RRtsentRRR(R R(RR RRRR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyt tagged_sentsLs c CsP|jrtn|j}tg|j|ƒD]}||||||ƒ^q+ƒS(sPA helper function that instantiates BNCWordViews or the list of words/sentences.(Rt BNCWordViewt_wordsRtabspaths(RR RRRRtftfileid((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR [sc CsJg}tj|ƒjƒ}x|jdƒD]}g} xÃt|ƒD]µ} | j} | sbd} n|sn|r}| jƒ} n|r˜| jd| ƒ} n|dkr¼| | jdƒf} n0|dkrì| | jd| jdƒƒf} n| j| ƒqDW|r#|jt |j d| ƒƒq+|j | ƒq+Wd|ksFt ‚|S(sÐ Helper used to implement the view methods -- returns a list of words or a list of sentences, optionally tagged. :param fileid: The name of the underlying file. :param bracket_sent: If true, include sentence bracketing. :param tag: The name of the tagset to use, or None for no tags. :param strip_space: If true, strip spaces from word tokens. :param stem: If true, then substitute stems for words. s.//stthwRRtnN(Rtparsetgetroottfindallt_all_xmlwords_inttexttstriptgettappendt BNCSentencetattribtextendR tAssertionError( RRt bracket_sentRRRtresulttxmldoctxmlsentRtxmlwordtword((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR`s,      $ N( t__name__t __module__t__doc__RRR R RRRRR R(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyRs   cCsV|dkrg}nx:|D]2}|jdkrA|j|ƒqt||ƒqW|S(Ntctw(R6R7(R RR(R$(teltR.tchild((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR$†s   R)cBseZdZd„ZRS(s‹ A list of words, augmented by an attribute ``num`` used to record the sentence identifier (the ``n`` attribute from the XML). cCs||_tj||ƒdS(N(tnumtlistR(RR:titems((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR–s (R3R4R5R(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR)‘sRc Bs_eZdZeddddddddgƒZd „Zd „Zd „Zd „Zd „Z RS(sN A stream backed corpus view specialized for use with the BNC corpus. tpbtgaptvocalteventtuncleartshifttpausetaligncCs±|rd}nd}||_||_||_||_d|_d|_d|_d|_t j |||ƒ|j ƒ|j |j d|jƒ|jƒidd6|_dS(sG :param fileid: The name of the underlying file. :param sent: If true, include sentence bracketing. :param tag: The name of the tagset to use, or None for no tags. :param strip_space: If true, strip spaces from word tokens. :param stem: If true, then substitute stems for words. s.*/ss.*/s/(.*/)?(c|w)s .*/teiHeader$iN((t_sentt_tagt _strip_spacet_stemR ttitletauthorteditortrespsRRt_opent read_blockt_streamt handle_headertcloset _tag_context(RRRRRRttagspec((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR©s            cCsÔ|jdƒ}|r4djd„|Dƒƒ|_n|jdƒ}|rhdjd„|Dƒƒ|_n|jdƒ}|rœdjd„|Dƒƒ|_n|jdƒ}|rÐd jd „|Dƒƒ|_ndS( NstitleStmt/titles css|]}|jjƒVqdS(N(R%R&(t.0RI((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pys ÍsstitleStmt/authorcss|]}|jjƒVqdS(N(R%R&(RTRJ((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pys ÑsstitleStmt/editorcss|]}|jjƒVqdS(N(R%R&(RTRK((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pys ÕsstitleStmt/respStmts css(|]}djd„|DƒƒVqdS(s css|]}|jjƒVqdS(N(R%R&(RTtresp_elt((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pys ÛsN(tjoin(RTtresp((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pys Ús(R#RVRIRJRKRL(RR8tcontextttitlestauthorsteditorsRL((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyRPÉscCs'|jr|j|ƒS|j|ƒSdS(N(REt handle_sentt handle_word(RR8RX((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyt handle_eltßs  cCsµ|j}|sd}n|js*|jr9|jƒ}n|jrW|jd|ƒ}n|jdkr~||jdƒf}n3|jdkr±||jd|jdƒƒf}n|S(NRRRR(R%RGRHR&R'RF(RR8R2((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR]ås   $cCs³g}x–|D]Ž}|jd krK|g|D]}|j|ƒ^q,7}q |jd krs|j|j|ƒƒq |j|jkr td|jƒ‚q q Wt|jd|ƒS( NtmwthitcorrttruncR7R6sUnexpected element %sR (R_shiRastrunc(R7R6(RR]R(ttags_to_ignoret ValueErrorR)R*(RR8RR9R7((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR\ós )( R3R4R5tsetRcRRPR^R]R\(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyR›s!   N( R5tnltk.corpus.reader.utilRtnltk.corpus.reader.xmldocsRRRRR R$R;R)R(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/bnc.pyts x