є <┐CVc@ sЮdZddlmZmZmZddlmZddlmZddl m Z ddlmZddl Z ddlmZmZdd lmZdd lmZmZddlmZmZddlmZdd lmZmZmZdefdДГYZ edefdДГYГZ!defdДГYZ"edefdДГYГZ#de#fdДГYZ$dДZ%e&dkrЕe%ГndddddgZ'dS(u This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity. i (tprint_functiontdivisiontunicode_literals(tlog(tdefaultdict(treduce(tisliceN(tFreqDisttLidstoneProbDist(tConditionalFreqDist(t tokenwraptLazyConcatenation(t f_measuretBigramAssocMeasures(tBigramCollocationFinder(tpython_2_unicode_compatiblet text_typetCountertContextIndexcB s\eZdZedДГZd d dДdДZdДZdДZddДZ e dДZRS( u A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function. cC s`|dkr ||djГnd}|t|ГdkrP||djГnd}||fS(u;One left token and one right token, normalized to lowercaseiiu*START*u*END*(tlowertlen(ttokenstitlefttright((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt_default_context's&0cC s|S(N((tx((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt.sc s▓|И_ИИ_|r$|И_nИjИ_|r^gИD]}||Гr=|^q=ЙntЗЗfdЖtИГDГГИ_tЗЗfdЖtИГDГГИ_dS(Nc3 s6|],\}}Иj|ГИjИ|ГfVqdS(N(t_keyt _context_func(t.0Rtw(tselfR(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 7sc3 s6|],\}}ИjИ|ГИj|ГfVqdS(N(RR(RRR(R R(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 9s(Rt_tokensRRtCFDt enumeratet_word_to_contextst_context_to_words(R Rtcontext_functfiltertkeytt((R Rs[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__init__.s (cC s|jS(uw :rtype: list(str) :return: The document that this context index was created from. (R!(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR<scC se|j|Г}t|j|Г}i}x6|jjГD]%\}}t|t|ГГ||qsN( RR+R$trangeRRtintersectiont ValueErrortjoinR(R twordstfail_on_unknownRtcontextsRtemptytfd((R:R s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytcommon_contexts[s"&3N(t__name__t __module__t__doc__tstaticmethodRtNoneR*RR1R9tFalseRD(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR s tConcordanceIndexcB sGeZdZdДdДZdДZdДZdДZdddДZRS( us An index that can be used to look up the offset locations at which a given word occurs in a document. cC s|S(N((R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR|scC se||_||_ttГ|_x=t|ГD]/\}}|j|Г}|j|j|Гq.WdS(uщ Construct a new concordance index. :param tokens: The document (list of tokens) that this concordance index was created from. This list can be used to access the context of a given word occurrence. :param key: A function that maps each token to a normalized version that will be used as a key in the index. E.g., if you use ``key=lambda s:s.lower()``, then the index will be case-insensitive. N(R!RRtlistt_offsetsR#tappend(R RR(tindexR-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*|s cC s|jS(u{ :rtype: list(str) :return: The document that this concordance index was created from. (R!(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRШscC s|j|Г}|j|S(uф :rtype: list(int) :return: A list of the offset positions at which the given word occurs. If a key function was specified for the index, then given word's key will be looked up. (RRM(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytoffsetsаscC s dt|jГt|jГfS(Nu+(RR!RM(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__repr__кsiKic C s|t|Гdd}|d}|j|Г}|rt|t|ГГ}td|t|ГfГxз|D]Т}|dkrГPnd|dj|j|||!Г}dj|j|d||!Г} ||}| | } t||j|| Г|d8}qmWn tdГdS( uP Print a concordance for ``word`` with the specified context window. :param word: The target word :type word: str :param width: The width of each line, in characters (default=80) :type width: int :param lines: The number of lines to display (default=25) :type lines: int iiuDisplaying %s of %s matches:iu iu No matchesN(RRPtmintprintR>R!( R R-twidthtlinest half_widthtcontextRPRRR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytprint_concordanceоs" ! (RERFRGR*RRPRQRX(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRKvs t TokenSearchercB s eZdZdДZdДZRS(uт A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets -- e.g., ``''``. The regular expression passed to the ``findall()`` method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have ``'.'`` not match the angle brackets. cC s djdД|DГГ|_dS(Nucs s|]}d|dVqdS(uN((RR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys ╪s(R>t_raw(R R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*╫scC s╒tjdd|Г}tjdd|Г}tjdd|Г}tjdd|Г}tj||jГ}x<|D]4}|jdГrp|jdГrptd ГВqpqpWg|D]}|d d!jdГ^qп}|S( u" Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g. >>> from nltk.text import TokenSearcher >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*>") you rule bro; telling you bro; u twizted bro >>> text1.findall("(<.*>)") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the :param regexp: A regular expression :type regexp: str u\suuu)>)u (?]u$Bad regexp for TokenSearcher.findallii u><(tretsubtfindallRZt startswithtendswithR=tsplit(R tregexpthitsth((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR]┌s )(RERFRGR*R](((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRY═s tTextcB s╤eZdZeZddДZdДZdДZdddДZ ddd ДZ d ДZdДZdДZ dd ДZddДZdДZdДZdДZdДZejdГZdДZdДZdДZRS(u█ A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the ``Text`` class, and use the appropriate analysis function or class directly instead. A ``Text`` is typically initialized from a given document or corpus. E.g.: >>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) cC sд|jrt|Г}n||_|r3||_nmd|d kr||d jdГ}djdД|d|!DГГ|_n$djdД|d DГГd|_d S( uv Create a Text object. :param tokens: The source text. :type tokens: sequence of str u]iu cs s|]}t|ГVqdS(N(R(Rttok((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys .sics s|]}t|ГVqdS(N(R(RRe((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 0siu...N(t_COPY_TOKENSRLRtnameROR>(R RRgtend((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*s &cC s2t|tГr#|j|j|j!S|j|SdS(N(t isinstancetsliceRtstarttstop(R R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__getitem__6scC s t|jГS(N(RR(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__len__<siOicC sGd|jkr-t|jddДГ|_n|jj|||ГdS(uй Print a concordance for ``word`` with the specified context window. Word matching is not case-sensitive. :seealso: ``ConcordanceIndex`` u_concordance_indexR(cS s |jГS(N(R(ts((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRLsN(t__dict__RKRt_concordance_indexRX(R R-RTRU((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytconcordanceCsiic s■d|jko*|j|ko*|j|ks║||_||_ddlm}|jdГЙtj|j|Г}|j dГ|j ЗfdЖГtГ}|j|j |Г|_ng|jD]\}}|d|^q─}tt|dd ГГd S(uA Print collocations derived from the text, ignoring stopwords. :seealso: find_collocations :param num: The maximum number of collocations to print. :type num: int :param window_size: The number of tokens spanned by a collocation (default=2) :type window_size: int u _collocationsi (t stopwordsuenglishic s"t|Гdkp!|jГИkS(Ni(RR(R(t ignored_words(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRcsu t separatoru; N(Rpt_numt_window_sizetnltk.corpusRsR?Rt from_wordsRtapply_freq_filtertapply_word_filterR tnbesttlikelihood_ratiot _collocationsRSR ( R tnumtwindow_sizeRstfindertbigram_measurestw1tw2tcolloc_strings((Rts[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytcollocationsPs - *cC s|jj|ГS(uJ Count the number of times this word appears in the text. (Rtcount(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRЗiscC s|jj|ГS(uQ Find the index of the first occurrence of the word in the text. (RRO(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyROoscC s tВdS(N(tNotImplementedError(R tmethod((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytreadabilityusc sсd|jkr6t|jddДddДГ|_nИjГЙ|jjЙИИjГkr╙tИИГЙtЗЗЗfdЖИjГDГГ}g|j |ГD]\}}|^qи}t t|ГГn t dГdS( u~ Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first. :param word: The word used to seed the similarity search :type word: str :param num: The number of words to generate (default=20) :type num: int :seealso: ContextIndex.similar_words() u_word_context_indexR'cS s |jГS(N(tisalpha(R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRЗsR(cS s |jГS(N(R(Ro((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRИsc3 s?|]5}И|D]$}|Иkr|Иkr|VqqdS(N((RRR8(RAtwciR-(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys Рsu No matchesN(RpRRt_word_context_indexRR$t conditionsR+Rtmost_commonRSR (R R-RRCRt_R?((RARМR-s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytsimilarys ((cC s┐d|jkr-t|jddДГ|_nyn|jj|tГ}|sXtdГnBg|j|ГD]\}}|^qh}ttdД|DГГГWnt k r║}t|ГnXdS(uV Find contexts where the specified words appear; list most frequent common contexts first. :param word: The word used to seed the similarity search :type word: str :param num: The number of words to generate (default=20) :type num: int :seealso: ContextIndex.common_contexts() u_word_context_indexR(cS s |jГS(N(R(Ro((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRжsuNo common contexts were foundcs s#|]\}}|d|VqdS(u_N((RRГRД((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys оsN( RpRRRНRDR6RSRПR R=(R R?RRCRRРtranked_contextste((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRDШs (cC s!ddlm}|||ГdS(u№ Produce a plot showing the distribution of the words through the text. Requires pylab to be installed. :param words: The words to be plotted :type words: list(str) :seealso: nltk.draw.dispersion_plot() i (tdispersion_plotN(t nltk.drawRФ(R R?RФ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRФ│s cG s|jГj|МdS(uc See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot() N(tvocabtplot(R targs((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRЧ┐scC s(d|jkr!t|Г|_n|jS(u. :seealso: nltk.prob.FreqDist u_vocab(RpRt_vocab(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRЦ╞scC sld|jkr!t|Г|_n|jj|Г}g|D]}dj|Г^q:}tt|dГГdS(uЄ Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g. >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*>") you rule bro; telling you bro; u twizted bro >>> text1.findall("(<.*>)") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the :param regexp: A regular expression :type regexp: str u_token_searcheru u; N(RpRYt_token_searcherR]R>RSR (R RaRbRc((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR]╧s "u\w+|[\.\!\?]cC s╩|d}x1|dkr=|jj||Гr=|d8}q W|dkrT||nd}|d}x7|t|ГkrЭ|jj||ГrЭ|d7}qgW|t|Гkr║||nd}||fS(u┘ One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ``ContextIndex`` that is created for ``similar()`` and ``common_contexts()``. iiu*START*u*END*(t_CONTEXT_REtmatchR(R RRtjRR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt_contextєs & ,"cC sd|jS(Nu (Rg(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__str__scC sd|jS(Nu (Rg(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRQsN(RERFRGR6RfRIR*RmRnRrRЖRЗRORКRСRDRФRЧRЦR]R[tcompileRЫRЮRЯRQ(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRds( # tTextCollectioncB s2eZdZdДZdДZdДZdДZRS(uVA collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows: >>> import nltk.corpus >>> from nltk.text import TextCollection >>> print('hack'); from nltk.book import text1, text2, text3 hack... >>> gutenberg = TextCollection(nltk.corpus.gutenberg) >>> mytexts = TextCollection([text1, text2, text3]) Iterating over a TextCollection produces all the tokens of all the texts in order. cC sft|dГr:g|jГD]}|j|Г^q}n||_tj|t|ГГi|_dS(Nuwords(thasattrtfileidsR?t_textsRdR*Rt _idf_cache(R tsourcetf((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*#s + cC s|j|Гt|ГS(u$ The frequency of the term in text. (RЗR(R ttermttext((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyttf+scC sЛ|jj|Г}|dkrЗtg|jD]}||kr+t^q+Г}|rqttt|jГГ|Гnd}||j|s8VV8 ,