ó <żCVc@sŽdZddlmZd„Zd„Zd„Zd„Zd„Zd„Zd „Z d e fd „ƒYZ d „Z e d krŠe ƒndS(săCounts Paice's performance statistics for evaluating stemming algorithms. What is required: - A dictionary of words grouped by their real lemmas - A dictionary of words grouped by stems from a stemming algorithm When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted. References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50. i˙˙˙˙(tsqrtcCs5tƒ}x%|D]}|jt||ƒƒqW|S(s Get original set of words used for analysis. :param lemmas: A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma. :type lemmas: dict :return: Set of words that exist as values in the dictionary :rtype: set (tsettupdate(tlemmastwordstlemma((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pytget_words_from_dictionarys  cCsdi}xW|D]O}|| }y||j|gƒWq tk r[t|gƒ||–sgii(gggg(tsumR R'( RR tntgdmttgdnttgumttgwmtRt lemmacountR"R#((Rsd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt _calculate†s  cCsŚy||}Wntk r'd}nXy||}Wntk rOd}nXy||}Wn8tk r˜|dkr‰tdƒ}q™tdƒ}nX|||fS(s¸Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :param gumt, gdmt, gwmt, gdnt: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple gtnanR(RR(R-R+R.R,tuitoitsw((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt_indexesŻs      tPaicecBsGeZdZd„Zd„Zd„Zdd„Zd„Zd„ZRS(s7Class for storing lemmas, stems and evaluation metrics.cCsh||_||_g|_d\|_|_|_|_d\|_|_ |_ d|_ |j ƒdS(sE :param lemmas: A dictionary where keys are lemmas and values are sets or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict :type stems: dict N(NNNN(NNN( RR tcoordstNoneR-R+R.R,R2R3R4terrtR(tselfRR ((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt__init__Ös    cCsăd|jg}|jd|jƒ|jd|jƒ|jd|jƒ|jd|jƒ|jd|jƒ|jd|jƒ|jd|jƒd j g|j D]}d |^qŹƒ}|jd |ƒd j |ƒS( Ns)Global Unachieved Merge Total (GUMT): %s s&Global Desired Merge Total (GDMT): %s s'Global Wrongly-Merged Total (GWMT): %s s*Global Desired Non-merge Total (GDNT): %s s&Understemming Index (GUMT / GDMT): %s s%Overstemming Index (GWMT / GDNT): %s sStemming Weight (OI / UI): %s s.Error-Rate Relative to Truncation (ERRT): %s t s(%s, %s)sTruncation line: %st( R-tappendR+R.R,R2R3R4R9tjoinR7(R:ttexttitemR((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt__str__çs)c CsVt||ƒ}t|j|ƒ\}}}}t||||ƒd \}} || fS(s[Count (UI, OI) when stemming is done by truncating words at 'cutlength'. :param words: Words used for the analysis :param cutlength: Words are stemmed by cutting them at this length :type words: set or list :type cutlength: int :return: Understemming and overstemming indexes :rtype: tuple i(R R0RR5( R:RRt truncatedR-R+R.R,R2R3((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt_get_truncation_indexesôs icCsňt|jƒ}td„|Dƒƒ}g}xŔ||krí|j||ƒ}||krh|j|ƒn|dkrx|St|ƒdkrŕ|ddkrŕt|dƒ}t|dƒ}||jkoÔ|knrŕ|Sn|d7}q.W|S( sîCount (UI, OI) pairs for truncation points until we find the segment where (ui, oi) crosses the truncation line. :param cutlength: Optional parameter to start counting from (ui, oi) coordinates gotten by stemming at this length. Useful for speeding up the calculations when you know the approximate location of the intersection. :type cutlength: int :return: List of coordinate pairs that define the truncation line :rtype: list css|]}t|ƒVqdS(N(R (R(R ((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pys sgiiiţ˙˙˙i˙˙˙˙i(gg(RRtmaxRDR>R RR4(R:RRt maxlengthR7tpairt derivative1t derivative2((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt_get_truncation_coordinatess   "cCs×|jƒ|_d|jkrM|j|jfd kr@tdƒStdƒSn|j|jfd kridStd |j|jff|jdƒ}t|jd|jdƒ}t|dd|ddƒ}||S( sCount Error-Rate Relative to Truncation (ERRT). :return: ERRT, length of the line from origo to (UI, OI) divided by the length of the line from origo to the point defined by the same line when extended until the truncation line. :rtype: float gRR1iiţ˙˙˙ii(gg(gg(gg(ii(RJR7R2R3RRR(R:t intersectiontoptot((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyt_errt+s    cCsst|j|jƒ\|_|_|_|_t|j|j|j|jƒ\|_|_ |_ |j ƒ|_ dS(s7Update statistics after lemmas and stems have been set.N( R0RR R-R+R.R,R5R2R3R4RNR9(R:((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyRKs-3( t__name__t __module__t__doc__R;RBRDRJRNR(((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyR6Ôs   ' cCsŸiddgd6ddgd6dddgd6}idgd6dgd6dddgd6dgd6dgd6}dGHx0t|ƒD]"}d |d j||ƒfGHq€WdGHd GHx0t|ƒD]"}d |d j||ƒfGHq˝WdGHt||ƒ}|GHdGHidgd6dgd6dgd6ddgd6dgd6dgd6}d GHx0t|ƒD]"}d |d j||ƒfGHqXWdGH||_|jƒ|GHd S(sDemonstration of the module.tkneeltknelttrangetrangedtringtrangtrungsWords grouped by their lemmas:s%s => %sR<s+Same words grouped by a stemming algorithm:s/Counting stats after changing stemming results:N(((((tsortedR?R6R R(RR RR tp((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pytdemoRs@                t__main__N(RQtmathRRR RRR'R0R5tobjectR6R[RO(((sd/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/paice.pyts      ) %~ *