ó <¿CVc@sêdZddlmZmZddlZyddlmZWn!ek r_ddlmZnXddl m Z ddl m Z ddl mZdd lmZdd lmZdd lTd efd „ƒYZdefd„ƒYZdS(u‚ Corpus reader for corpora whose documents are xml files. (note -- not named 'xml' to avoid conflicting w/ standard xml package) iÿÿÿÿ(tprint_functiontunicode_literalsN(t cElementTree(t ElementTree(tcompat(tSeekableUnicodeStreamReader(tWordPunctTokenizer(tElementWrapper(t CorpusReader(t*tXMLCorpusReadercBs>eZdZed„Zdd„Zdd„Zdd„ZRS(u Corpus reader for corpora whose documents are xml files. Note that the ``XMLCorpusReader`` constructor does not take an ``encoding`` argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info. cCs ||_tj|||ƒdS(N(t _wrap_etreeRt__init__(tselftroottfileidst wrap_etree((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR %s cCs’|dkr1t|jƒdkr1|jd}nt|tjƒsRtdƒ‚ntj|j |ƒj ƒƒj ƒ}|j rŽt |ƒ}n|S(Niiu(Expected a single file identifier string(tNonetlent_fileidst isinstanceRt string_typest TypeErrorRtparsetabspathtopentgetrootR R(R tfileidtelt((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pytxml)s!$ c Cs£|j|ƒ}|j|ƒ}tƒ}|jƒ}g}xc|D][}|j}|dk r@t|tƒr||j|ƒ}n|j |ƒ} |j | ƒq@q@W|S(uE Returns all of the words and punctuation symbols in the specified file that were in text nodes -- ie, tags are ignored. Like the xml() method, fileid can only specify one file. :return: the given file's text nodes as a list of words and punctuation symbols :rtype: list(str) N( RtencodingRt getiteratorttextRRtbytestdecodettokenizetextend( R RRRtword_tokenizertiteratortouttnodeR ttoks((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pytwords7s      cCsb|dkr|j}nt|tjƒr6|g}ntg|D]}|j|ƒjƒ^q@ƒS(N(RRRRRtconcatRtread(R Rtf((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pytrawPs   N( t__name__t __module__t__doc__tFalseR RRR*R.(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR s    t XMLCorpusViewcBs—eZdZeZdZd d„Zd„Zd„Z e j de j e j BƒZe j dƒZe j de j e j BƒZd„Zd d d „ZRS( um A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note: ``XMLCorpusView`` is not used by ``XMLCorpusReader`` itself, but may be used by subclasses of ``XMLCorpusReader``.) Every XML corpus view has a "tag specification", indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, separated by '/', indicating the ancestry of the element. Some examples: - ``'foo'``: A top-level element whose tag is ``foo``. - ``'foo/bar'``: An element whose tag is ``bar`` and whose parent is a top-level element whose tag is ``foo``. - ``'.*/foo'``: An element whose tag is ``foo``, appearing anywhere in the xml tree. - ``'.*/(foo|bar)'``: An wlement whose tag is ``foo`` or ``bar``, appearing anywhere in the xml tree. The view items are generated from the selected XML elements via the method ``handle_elt()``. By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via the ``elt_handler`` constructor parameter. icCsa|r||_ntj|dƒ|_idd6|_|j|ƒ}tj||d|ƒdS(uW Create a new corpus view based on a specified XML file. Note that the ``XMLCorpusView`` constructor does not take an ``encoding`` argument, because the unicode encoding is specified by the XML files themselves. :type tagspec: str :param tagspec: A tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view. :param elt_handler: A function used to transform each element to a value for the view. If no handler is specified, then ``self.handle_elt()`` is called, which returns the element as an ElementTree object. The signature of elt_handler is:: elt_handler(elt, tagspec) -> value u\ZiRN((t handle_elttretcompilet_tagspect _tag_contextt_detect_encodingtStreamBackedCorpusViewR (R Rttagspect elt_handlerR((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR zs  cCst|tƒr$|jƒjƒ}n$t|dƒ}|jƒ}WdQX|jtjƒr^dS|jtjƒrtdS|jtjƒrŠdS|jtj ƒr dS|jtj ƒr¶dSt j d|ƒ}|rá|j dƒjƒSt j d |ƒ}|r |j dƒjƒSdS( Nurbu utf-16-beu utf-16-leu utf-32-beu utf-32-leuutf-8s!\s*<\?xml\b.*\bencoding="([^"]+)"is!\s*<\?xml\b.*\bencoding='([^']+)'(Rt PathPointerRtreadlinet startswithtcodecst BOM_UTF16_BEt BOM_UTF16_LEt BOM_UTF32_BEt BOM_UTF32_LEtBOM_UTF8R5tmatchtgroupR"(R Rtstinfiletm((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR9s*cCs|S(u Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the ``elt_handler`` constructor argument, this method simply returns ``elt``. :return: The view value corresponding to ``elt``. :type elt: ElementTree :param elt: The element that should be converted. :type context: str :param context: A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string ``'foo/bar/baz'`` indicates that the element is a ``baz`` element whose parent is a ``bar`` element and whose grandparent is a top-level ``foo`` element. ((R Rtcontext((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR4¶su; [^<]* ( (() | # comment () | # doctype decl (<[^!>][^>]*>)) # tag or PI [^<]*)* \Zu<\s*/?\s*([^\s>]+)u6 # Include these so we can skip them: (?P )| (?P )| (?P <\?.*?\?> )| (?P ]*(\[[^\]]*])?\s*>)| # These are the ones we actually care about: (?P <\s*[^>/\?!\s][^>]*/\s*> )| (?P <\s*[^>/\?!\s][^>]*> )| (?P <\s*/[^>/\?!\s][^>]*> )cCs_d}t|tƒr$|jƒ}nx4trZ|j|jƒ}||7}|jj|ƒr_|Stj d|ƒj dƒdkr¿|jƒt |ƒtj d|ƒj ƒ}t d|ƒ‚n|sÔt dƒ‚n|jdƒ}|dkr'|jj|| ƒrWt|tƒr1|j|ƒ|j|ƒn|jt |ƒ| dƒ|| Sq'q'Wd S( u{ Read a string from the given stream that does not contain any un-closed tags. In particular, this function first reads a block from the stream of size ``self._BLOCK_SIZE``. It then checks if that block contains an un-closed tag. If it does, then this function either backtracks to the last '<', or reads another block. uu[<>]iu>uUnexpected ">" near char %su&Unexpected end of file: tag not closeduiÿÿÿÿuUnmatched tag <%s>...u EMPTY_ELT_TAGiu i$u (backtrack)uasciiuxmlcharrefreplaceN( RR7R4tlistR8tgetRLtAssertionErrorRRR\RRt _XML_PIECEtfinditert_DEBUGtprinttjoinRGt _XML_TAG_NAMERFtappendR5tstartRRQtpopRTRUttupleRt fromstringtencode(R RVR;R<RKteltst elt_startt elt_depthtelt_textRXt xml_fragmenttpiecetnameRZR((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyt read_blocks„    *!   !   ! #(        "N(R/R0R1R2RbRNRR R9R4R5R6tDOTALLtVERBOSEROReR`R\Rs(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyR3Vs #    0(R1t __future__RRR@t xml.etreeRRt ImportErrortnltkRt nltk.dataRt nltk.tokenizeRtnltk.internalsRtnltk.corpus.reader.apiRtnltk.corpus.reader.utilR R:R3(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/xmldocs.pyt s   9