ó <¿CVc@ sždZddlmZmZmZddlmZddlmZddl m Z ddl m Z ddl Z ddlmZmZdd lmZdd lmZmZdd lmZmZdd lmZdd lmZmZmZdefd„ƒYZ edefd„ƒYƒZ!defd„ƒYZ"edefd„ƒYƒZ#de#fd„ƒYZ$d„Z%e&dkr…e%ƒndddddgZ'dS(u  This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity. iÿÿÿÿ(tprint_functiontdivisiontunicode_literals(tlog(t defaultdict(treduce(tisliceN(tFreqDisttLidstoneProbDist(tConditionalFreqDist(t tokenwraptLazyConcatenation(t f_measuretBigramAssocMeasures(tBigramCollocationFinder(tpython_2_unicode_compatiblet text_typetCountert ContextIndexcB s\eZdZed„ƒZd d d„d„Zd„Zd„Zdd„Z e d„Z RS( u A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function. cC s`|dkr ||djƒnd}|t|ƒdkrP||djƒnd}||fS(u;One left token and one right token, normalized to lowercaseiiu*START*u*END*(tlowertlen(ttokenstitlefttright((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt_default_context's&0cC s|S(N((tx((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt.sc s²|ˆ_ˆˆ_|r$|ˆ_n ˆjˆ_|r^gˆD]}||ƒr=|^q=‰nt‡‡fd†tˆƒDƒƒˆ_t‡‡fd†tˆƒDƒƒˆ_dS(Nc3 s6|],\}}ˆj|ƒˆjˆ|ƒfVqdS(N(t_keyt _context_func(t.0Rtw(tselfR(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 7sc3 s6|],\}}ˆjˆ|ƒˆj|ƒfVqdS(N(RR(RRR(R R(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 9s(Rt_tokensRRtCFDt enumeratet_word_to_contextst_context_to_words(R Rt context_functfiltertkeytt((R Rs[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__init__.s    (cC s|jS(uw :rtype: list(str) :return: The document that this context index was created from. (R!(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR<scC se|j|ƒ}t|j|ƒ}i}x6|jjƒD]%\}}t|t|ƒƒ||qsN( RR+R$trangeRRt intersectiont ValueErrortjoinR(R twordstfail_on_unknownRtcontextsRtemptytfd((R:R s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytcommon_contexts[s "&3 N( t__name__t __module__t__doc__t staticmethodRtNoneR*RR1R9tFalseRD(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR s   tConcordanceIndexcB sGeZdZd„d„Zd„Zd„Zd„Zddd„ZRS( us An index that can be used to look up the offset locations at which a given word occurs in a document. cC s|S(N((R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR|scC se||_||_ttƒ|_x=t|ƒD]/\}}|j|ƒ}|j|j|ƒq.WdS(ué Construct a new concordance index. :param tokens: The document (list of tokens) that this concordance index was created from. This list can be used to access the context of a given word occurrence. :param key: A function that maps each token to a normalized version that will be used as a key in the index. E.g., if you use ``key=lambda s:s.lower()``, then the index will be case-insensitive. N(R!RRtlistt_offsetsR#tappend(R RR(tindexR-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*|s  cC s|jS(u{ :rtype: list(str) :return: The document that this concordance index was created from. (R!(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR˜scC s|j|ƒ}|j|S(uä :rtype: list(int) :return: A list of the offset positions at which the given word occurs. If a key function was specified for the index, then given word's key will be looked up. (RRM(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytoffsets scC s dt|jƒt|jƒfS(Nu+(RR!RM(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__repr__ªsiKic C s|t|ƒdd}|d}|j|ƒ}|rt|t|ƒƒ}td|t|ƒfƒx§|D]’}|dkrƒPnd|dj|j|||!ƒ}dj|j|d||!ƒ} || }| | } t||j|| ƒ|d8}qmWn tdƒdS( uP Print a concordance for ``word`` with the specified context window. :param word: The target word :type word: str :param width: The width of each line, in characters (default=80) :type width: int :param lines: The number of lines to display (default=25) :type lines: int iiuDisplaying %s of %s matches:iu iu No matchesN(RRPtmintprintR>R!( R R-twidthtlinest half_widthtcontextRPRRR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytprint_concordance®s"    !  (RERFRGR*RRPRQRX(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRKvs   t TokenSearchercB s eZdZd„Zd„ZRS(uâ A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets -- e.g., ``''``. The regular expression passed to the ``findall()`` method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have ``'.'`` not match the angle brackets. cC s djd„|Dƒƒ|_dS(Nucs s|]}d|dVqdS(uN((RR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys Øs(R>t_raw(R R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*×scC sÕtjdd|ƒ}tjdd|ƒ}tjdd|ƒ}tjdd|ƒ}tj||jƒ}x<|D]4}|jdƒ rp|jdƒrptd ƒ‚qpqpWg|D]}|d d !jd ƒ^q¯}|S( u" Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g. >>> from nltk.text import TokenSearcher >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*>") you rule bro; telling you bro; u twizted bro >>> text1.findall("(<.*>)") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the :param regexp: A regular expression :type regexp: str u\suuu)>)u (?]u$Bad regexp for TokenSearcher.findalliiÿÿÿÿu><(tretsubtfindallRZt startswithtendswithR=tsplit(R tregexpthitsth((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR]Ús )(RERFRGR*R](((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRYÍs  tTextcB sÑeZdZeZdd„Zd„Zd„Zddd„Z ddd „Z d „Z d „Z d „Z dd „Zdd„Zd„Zd„Zd„Zd„ZejdƒZd„Zd„Zd„ZRS(uÛ A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the ``Text`` class, and use the appropriate analysis function or class directly instead. A ``Text`` is typically initialized from a given document or corpus. E.g.: >>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) cC s¤|jrt|ƒ}n||_|r3||_nmd|d kr||d jdƒ}djd„|d|!Dƒƒ|_n$djd„|d Dƒƒd|_d S( uv Create a Text object. :param tokens: The source text. :type tokens: sequence of str u]iu cs s|]}t|ƒVqdS(N(R(Rttok((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys .sics s|]}t|ƒVqdS(N(R(RRe((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys 0siu...N(t _COPY_TOKENSRLRtnameROR>(R RRgtend((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*s   &cC s2t|tƒr#|j|j|j!S|j|SdS(N(t isinstancetsliceRtstarttstop(R R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt __getitem__6scC s t|jƒS(N(RR(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__len__<siOicC sGd|jkr-t|jdd„ƒ|_n|jj|||ƒdS(u© Print a concordance for ``word`` with the specified context window. Word matching is not case-sensitive. :seealso: ``ConcordanceIndex`` u_concordance_indexR(cS s |jƒS(N(R(ts((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRLsN(t__dict__RKRt_concordance_indexRX(R R-RTRU((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt concordanceCs iic  sþd|jko*|j|ko*|j|ksº||_||_ddlm}|jdƒ‰tj|j|ƒ}|j dƒ|j ‡fd†ƒt ƒ}|j |j |ƒ|_ng|jD]\}}|d|^qÄ}tt|dd ƒƒd S( uA Print collocations derived from the text, ignoring stopwords. :seealso: find_collocations :param num: The maximum number of collocations to print. :type num: int :param window_size: The number of tokens spanned by a collocation (default=2) :type window_size: int u _collocationsiÿÿÿÿ(t stopwordsuenglishic s"t|ƒdkp!|jƒˆkS(Ni(RR(R(t ignored_words(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRcsu t separatoru; N(Rpt_numt _window_sizet nltk.corpusRsR?Rt from_wordsRtapply_freq_filtertapply_word_filterR tnbesttlikelihood_ratiot _collocationsRSR ( R tnumt window_sizeRstfindertbigram_measurestw1tw2tcolloc_strings((Rts[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt collocationsPs -    *cC s|jj|ƒS(uJ Count the number of times this word appears in the text. (Rtcount(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR‡iscC s|jj|ƒS(uQ Find the index of the first occurrence of the word in the text. (RRO(R R-((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyROoscC s t‚dS(N(tNotImplementedError(R tmethod((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt readabilityusc sád|jkr6t|jdd„dd„ƒ|_nˆjƒ‰|jj‰ˆˆjƒkrÓtˆˆƒ‰t‡‡‡fd†ˆjƒDƒƒ}g|j |ƒD]\}}|^q¨}t t |ƒƒn t dƒdS( u~ Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first. :param word: The word used to seed the similarity search :type word: str :param num: The number of words to generate (default=20) :type num: int :seealso: ContextIndex.similar_words() u_word_context_indexR'cS s |jƒS(N(tisalpha(R((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR‡sR(cS s |jƒS(N(R(Ro((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRˆsc3 s?|]5}ˆ|D]$}|ˆkr|ˆk r|VqqdS(N((RRR8(RAtwciR-(s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys su No matchesN( RpRRt_word_context_indexRR$t conditionsR+Rt most_commonRSR (R R-RRCRt_R?((RARŒR-s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pytsimilarys     ((cC s¿d|jkr-t|jdd„ƒ|_nyn|jj|tƒ}|sXtdƒnBg|j|ƒD]\}}|^qh}ttd„|DƒƒƒWnt k rº}t|ƒnXdS(uV Find contexts where the specified words appear; list most frequent common contexts first. :param word: The word used to seed the similarity search :type word: str :param num: The number of words to generate (default=20) :type num: int :seealso: ContextIndex.common_contexts() u_word_context_indexR(cS s |jƒS(N(R(Ro((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR¦suNo common contexts were foundcs s#|]\}}|d|VqdS(u_N((RRƒR„((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pys ®sN( RpRRRRDR6RSRR R=(R R?RRCRRtranked_contextste((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRD˜s   (cC s!ddlm}|||ƒdS(uü Produce a plot showing the distribution of the words through the text. Requires pylab to be installed. :param words: The words to be plotted :type words: list(str) :seealso: nltk.draw.dispersion_plot() iÿÿÿÿ(tdispersion_plotN(t nltk.drawR”(R R?R”((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR”³s cG s|jƒj|ŒdS(uc See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot() N(tvocabtplot(R targs((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR—¿scC s(d|jkr!t|ƒ|_n|jS(u. :seealso: nltk.prob.FreqDist u_vocab(RpRt_vocab(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR–ÆscC sld|jkr!t|ƒ|_n|jj|ƒ}g|D]}dj|ƒ^q:}tt|dƒƒdS(uò Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g. >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*>") you rule bro; telling you bro; u twizted bro >>> text1.findall("(<.*>)") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the :param regexp: A regular expression :type regexp: str u_token_searcheru u; N(RpRYt_token_searcherR]R>RSR (R RaRbRc((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR]Ïs "u \w+|[\.\!\?]cC sÊ|d}x1|dkr=|jj||ƒ r=|d8}q W|dkrT||nd}|d}x7|t|ƒkr|jj||ƒ r|d7}qgW|t|ƒkrº||nd}||fS(uÙ One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ``ContextIndex`` that is created for ``similar()`` and ``common_contexts()``. iiu*START*u*END*(t _CONTEXT_REtmatchR(R RRtjRR((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt_contextós & ,"cC s d|jS(Nu (Rg(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyt__str__ scC s d|jS(Nu (Rg(R ((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRQsN(RERFRGR6RfRIR*RmRnRrR†R‡RORŠR‘RDR”R—R–R]R[tcompileR›RžRŸRQ(((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyRds(           #  tTextCollectioncB s2eZdZd„Zd„Zd„Zd„ZRS(uVA collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows: >>> import nltk.corpus >>> from nltk.text import TextCollection >>> print('hack'); from nltk.book import text1, text2, text3 hack... >>> gutenberg = TextCollection(nltk.corpus.gutenberg) >>> mytexts = TextCollection([text1, text2, text3]) Iterating over a TextCollection produces all the tokens of all the texts in order. cC sft|dƒr:g|jƒD]}|j|ƒ^q}n||_tj|t|ƒƒi|_dS(Nuwords(thasattrtfileidsR?t_textsRdR*R t _idf_cache(R tsourcetf((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyR*#s + cC s|j|ƒt|ƒS(u$ The frequency of the term in text. (R‡R(R ttermttext((s[/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/text.pyttf+scC s‹|jj|ƒ}|dkr‡tg|jD]}||kr+t^q+ƒ}|rqttt|jƒƒ|ƒnd}||j|s8 VV8ÿ,