ó <¿CVc@s dZddlZddlmZddlmZddlmZmZe Z yddl m Z e Z WnMek r·yddlj jZ Wq¸ek r³ddlj jZ q¸XnXddlZd„Zddd „ƒYZd dd „ƒYZd efd „ƒYZdS(s9 A reader for corpora whose documents are in MTE format. iÿÿÿÿN(treduce(tcompat(tconcattTaggedCorpusReader(tetreecCs-tr|j|d|ƒS|j||ƒSdS(Nt namespaces(t lxmlAvailabletxpathtfindall(troottpathtns((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRst MTEFileReadercBs!eZdZidd6dd6ZdZdZd„Zed„ƒZed „ƒZ ed „ƒZ ed „ƒZ ed d „ƒZ ed„ƒZ ed d„ƒZed„ƒZed d„ƒZd„Zd„Zd„Zd„Zd d„Zd„Zd d„Zd„Zd d„ZRS(s© Class for loading the content of the multext-east corpus. It parses the xml files and does some tag-filtering depending on the given method parameters. shttp://www.tei-c.org/ns/1.0tteis$http://www.w3.org/XML/1998/namespacetxmls{http://www.tei-c.org/ns/1.0}s&{http://www.w3.org/XML/1998/namespace}cCs5tj|ƒ}t|jƒd|jƒd|_dS(Ns./tei:text/tei:bodyi(RtparseRtgetrootR t_MTEFileReader__root(tselft file_pathttree((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt__init__&scCsUgt|d|jƒD];}|j|jdksH|j|jdkr|j^qS(Ns.//*twtc(RR ttagttag_nsttext(Rt text_rootR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt_words*scCs/gt|d|jƒD]}tj|ƒ^qS(Ns.//tei:s(RR R R(RRts((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt_sents/scCs/gt|d|jƒD]}tj|ƒ^qS(Ns.//tei:p(RR R R(RRtp((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt_paras3scCs6gt|d|jƒD]}|j|jdf^qS(Ns.//tei:wtlemma(RR Rtattrib(RRR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _lemma_words7stcCsÄ|dks|dkrNgt|d|jƒD]}|j|jdf^q.Stjdtjdd|ƒdƒ}gt|d|jƒD]2}|j|jdƒrŠ|j|jdf^qŠSdS(NR$s.//tei:wtanat^t-t.s.*$( tNoneRR RR"tretcompiletsubtmatch(RRttagsR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _tagged_words;s 6&cCs/gt|d|jƒD]}tj|ƒ^qS(Ns.//tei:s(RR R R#(RRR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _lemma_sentsEscCsWggt|d|jƒD]}tj||ƒ^qD]}t|ƒdkr5|^q5S(Ns.//tei:si(RR R R/tlen(RRR.Rtt((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _tagged_sentsIscCs/gt|d|jƒD]}tj|ƒ^qS(Ns.//tei:p(RR R R0(RRR((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _lemma_parasNscCsWggt|d|jƒD]}tj||ƒ^qD]}t|ƒdkr5|^q5S(Ns.//tei:pi(RR R R3R1(RRR.RR2((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt _tagged_parasRscCstj|jƒS(N(R RR(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytwordsWscCstj|jƒS(N(R RR(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytsentsZscCstj|jƒS(N(R R R(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytparas]scCstj|jƒS(N(R R#R(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt lemma_words`scCstj|j|ƒS(N(R R/R(RR.((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt tagged_wordscscCstj|jƒS(N(R R0R(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt lemma_sentsfscCstj|jƒS(N(R R3R(RR.((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt tagged_sentsiscCstj|jƒS(N(R R4R(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt lemma_paraslscCstj|jƒS(N(R R5R(RR.((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt tagged_parasos(t__name__t __module__t__doc__R Rtxml_nsRt classmethodRRR R#R/R0R3R4R5R6R7R8R9R:R;R<R=R>(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR s4          tMTETagConvertercBsweZdZi dd6dd6dd6dd6d d 6d d 6d d6dd6dd6dd6dd6dd6Zed„ƒZRS(su Class for converting msd tags to universal tags, more conversion options are currently not implemented. tADJtAtADPtStADVtRtCONJtCtDETtDtNOUNtNtNUMtMtPRTtQtPRONtPtVERBtVR(tXR'cCsG|ddks|dn|d}|tjkr<d}ntj|S(sö This function converts the annotation from the Multex-East to the universal tagset as described in Chapter 5 of the NLTK-Book Unknown Tags will be mapped to X. Punctuation marks are not supported in MSD tags, so it#iR'(RDtmapping_msd_universal(Rt indicator((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytmsd_to_universal}s$ (R?R@RAR[t staticmethodR](((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRDrs tMTECorpusReadercBs¼eZdZdddd„Zd„Zd„Zdd„Zdd„Zdd„Z dd„Z dd „Z dd dd „Z dd „Z dd dd „Zdd„Zdd dd„ZRS(sæ Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset tutf8cCstj||||ƒdS(s/ Construct a new MTECorpusreader for a set of documents located at the given root directory. Example usage: >>> root = '/...path to corpus.../' >>> reader = MTECorpusReader(root, 'oana-*.xml', 'utf8') # doctest: +SKIP :param root: The root directory for this corpus. (default points to location in multext config file) :param fileids: A list or regexp specifying the fileids in this corpus. (default is oana-en.xml) :param enconding: The encoding of the given files (default is utf8) N(RR(RR tfileidstencoding((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR“s csr|dkrˆj}nt|tjƒr6|g}nt‡fd†|ƒ}td„|ƒ}|sndGHn|S(Ncs |ˆjkS(N(t_fileids(tx(R(sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt¥scSs |dkS(Ns oana-bg.xmls oana-mk.xml(s oana-bg.xmls oana-mk.xml((Rd((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe§ss$No valid multext-east file specified(R)Rct isinstanceRt string_typestfilter(RRa((Rsh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyt __fileids¡s  cCs|jdƒjƒS(s‰ Prints some information about this corpus. :return: the content of the attached README file :rtype: str s 00README.txt(topentread(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytreadme¬scCs5tg|j|ƒD]}|j|ƒjƒ^qƒS(sœ :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a single string. :rtype: str (Rt_MTECorpusReader__fileidsRjRk(RRatf((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pytraw´scCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(sº :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) cSs||S(N((tatb((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReÂs(RRmR tosR tjoint_rootR6(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR6¼scCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(sò :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings :rtype: list(list(str)) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReËs(RRmR RrR RsRtR7(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR7ÄscCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(s :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string :rtype: list(list(list(str))) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReÔs(RRmR RrR RsRtR8(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR8ÍscCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(s :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma) :rtype: list(tuple(str,str)) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReÝs(RRmR RrR RsRtR9(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR9ÖstmsdcCsŠtd„g|j|ƒD]0}ttjj|j|ƒƒjd|ƒ^qgƒ}|dkrqtd„|ƒS|dkr|SdGHdS(s8 :param fileids: A list specifying the fileids that should be used. :param tagset: The tagset that should be used in the returned object, either "universal" or "msd", "msd" is the default :param tags: An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag :return: the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag) :rtype: list(tuple(str, str)) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReêsR.t universalcSs|dtj|dƒfS(Nii(RDR](twt((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReìsRusUnknown tagset specified.N( RRmR RrR RsRtR:tmap(RRattagsetR.RnR6((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR:ßs U  cCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(s? :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma) :rtype: list(list(tuple(str, str))) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyReús(RRmR RrR RsRtR;(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR;òscCsŠtd„g|j|ƒD]0}ttjj|j|ƒƒjd|ƒ^qgƒ}|dkrqtd„|ƒS|dkr|SdGHdS(sE :param fileids: A list specifying the fileids that should be used. :param tagset: The tagset that should be used in the returned object, either "universal" or "msd", "msd" is the default :param tags: An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag :return: the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples :rtype: list(list(tuple(str, str))) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyResR.RvcSstd„|ƒS(NcSs|dtj|dƒfS(Nii(RDR](Rw((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe s(Rx(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe sRusUnknown tagset specified.N( RRmR RrR RsRtR<Rx(RRaRyR.RnR7((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR<ýs U  cCsMtd„g|j|ƒD]*}ttjj|j|ƒƒjƒ^qgƒS(sj :param fileids: A list specifying the fileids that should be used. :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma) :rtype: list(List(List(tuple(str, str)))) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRes(RRmR RrR RsRtR=(RRaRn((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR=scCsŠtd„g|j|ƒD]0}ttjj|j|ƒƒjd|ƒ^qgƒ}|dkrqtd„|ƒS|dkr|SdGHdS(s| :param fileids: A list specifying the fileids that should be used. :param tagset: The tagset that should be used in the returned object, either "universal" or "msd", "msd" is the default :param tags: An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples :rtype: list(list(list(tuple(str, str)))) cSs||S(N((RpRq((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe&sR.RvcSstd„|ƒS(NcSstd„|ƒS(NcSs|dtj|dƒfS(Ni(RDR](Rw((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe(s(Rx(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe(s(Rx(R((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyRe(sRusUnknown tagset specified.N( RRmR RrR RsRtR>Rx(RRaRyR.RnR8((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR>s U  N(R?R@RAR)RRmRlRoR6R7R8R9R:R;R<R=R>(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyR_Œs      (((RARrt functoolsRtnltkRtnltk.corpus.readerRRtFalseRtlxmlRtTruet ImportErrortxml.etree.cElementTreet cElementTreetxml.etree.ElementTreet ElementTreeR*RR RDR_(((sh/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/mte.pyts$      V