ó <¿CVc@s dZddlmZddlZddlmZddlmZddl m Z ddl m Z m Z mZddlmZmZd efd „ƒYZd efd „ƒYZd efd„ƒYZdefd„ƒYZddd„ZedkrddlZddl m Z yedejdƒZWnek rGdZnXyedejdƒZWnek r|dZnXeeeƒnd d dgZ dS(sß Tools to identify collocations --- words that often appear consecutively --- within corpora. They may also be used to find other associations between word occurrences. See Manning and Schutze ch. 5 at http://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation. The ``BigramCollocationFinder`` and ``TrigramCollocationFinder`` classes provide these functionalities, dependent on being provided a function which scores a ngram given appropriate frequency counts. A number of standard association measures are provided in bigram_measures and trigram_measures. iÿÿÿÿ(tprint_functionN(t iteritems(tFreqDist(tngrams(tContingencyMeasurestBigramAssocMeasurestTrigramAssocMeasures(tranks_from_scorestspearman_correlationtAbstractCollocationFindercBs›eZdZd„Zeeedd„ƒZed„ƒZe d„ƒZ d„d„Z d„Z d„Z d „Zd „Zd „Zd „Zd „ZRS(s€ An abstract base class for collocation finders whose purpose is to collect collocation candidate frequencies, filter and rank them. As a minimum, collocation finders require the frequencies of each word in a corpus, and the joint frequency of word tuples. This data should be provided through nltk.probability.FreqDist objects or an identical interface. cCs%||_|jƒ|_||_dS(N(tword_fdtNtngram_fd(tselfR R ((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt__init__6s csa|f|d‰|r7tjj‡fd†|DƒƒS|r]tjj‡fd†|DƒƒSdS(sU Pad the document with the place holder according to the window_size ic3s!|]}tj|ˆƒVqdS(N(t _itertoolstchain(t.0tdoc(tpadding(sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pys Bsc3s!|]}tjˆ|ƒVqdS(N(RR(RR(R(sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pys DsN(RRt from_iterable(tclst documentst window_sizetpad_leftt pad_rightt pad_symbol((Rsc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt_build_new_documents;s  cCs"|j|j||jdtƒƒS(s‚Constructs a collocation finder given a collection of documents, each of which is a list (or iterable) of tokens. R(t from_wordsRt default_wstTrue(RR((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pytfrom_documentsFscs-t‡‡fd†ttˆƒdƒDƒƒS(Nc3s&|]}tˆ||ˆ!ƒVqdS(N(ttuple(Rti(tntwords(sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pys Psi(Rtrangetlen(R#R"((R"R#sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt_ngram_freqdistNscCstS(N(tFalse(tngramtfreq((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pytRscCsRtƒ}x9t|jƒD](\}}|||ƒs|||js(tany(R.R1(R+(sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR*jsN(R-(R R+((R+sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pytapply_word_filterfsccsDx=|jD]2}|j||Œ}|dk r ||fVq q WdS(sbGenerates of (ngram, score) pairs as determined by the scoring function provided. N(R t score_ngramtNone(R tscore_fnttuptscore((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt _score_ngramsls cCst|j|ƒdd„ƒS(s‘Returns a sequence of (ngram, score) pairs ordered from highest to lowest score, as determined by the scoring function provided. tkeycSs|d |dfS(Nii((tt((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR*ys(tsortedR;(R R8((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt score_ngramsuscCs*g|j|ƒ| D]\}}|^qS(s;Returns the top n ngrams when scored by the given function.(R?(R R8R"tpts((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pytnbest{sccs9x2|j|ƒD]!\}}||kr0|VqPqWdS(s}Returns a sequence of ngrams, ordered by decreasing score, whose scores each exceed the given minimum score. N(R?(R R8t min_scoreR(R:((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyt above_scores N(t__name__t __module__t__doc__Rt classmethodR'R7RRt staticmethodR&R-R0R2R5R;R?RBRD(((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR +s        tBigramCollocationFindercBs;eZdZdZdd„Zedd„ƒZd„ZRS(s»A tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly. icCs tj|||ƒ||_dS(s…Construct a BigramCollocationFinder, given FreqDists for appearances of words and (possibly non-contiguous) bigrams. N(R RR(R R t bigram_fdR((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR‘scCsÆtƒ}tƒ}|dkr-tdƒ‚nxƒt||dtƒD]l}|d}|dkreqCn||cd7 2, count non-contiguous bigrams, in the style of Church and Hanks's (1990) association ratio. isSpecify window_size at least 2RiiRN(Rt ValueErrorRRR7(RR#Rtwfdtbfdtwindowtw1tw2((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR˜s      !cCsa|j}|j||f|jd}|s1dS|j|}|j|}||||f|ƒS(s¹Returns the score for a given bigram using the given scoring function. Following Church and Hanks (1990), counts are scaled by a factor of 1/(window_size - 1). gð?N(R R RR (R R8RPRQtn_alltn_iitn_ixtn_xi((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyR6®s   (RERFRGRRRHRR6(((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyRJŠs  tTrigramCollocationFindercBsAeZdZdZd„Zedd„ƒZd„Zd„ZRS(s¼A tool for the finding and ranking of trigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly. icCs)tj|||ƒ||_||_dS(s¥Construct a TrigramCollocationFinder, given FreqDists for appearances of words, bigrams, two words with any word between them, and trigrams. N(R Rt wildcard_fdRK(R R RKRWt trigram_fd((sc/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/collocations.pyRÃs c Cs.|dkrtdƒ‚ntƒ}tƒ}tƒ}tƒ}xÙt||dtƒD]Â}|d}|dkrwqUnxtj|ddƒD]…\} } ||cd7<| dkr¼qŽn||| fcd7<| dkräqŽn||| fcd7<||| | fcd7s4 _2FR