ó <æCVc@sŒdZyddlZWnek r)nXddlmZded„Zd„Zd„Z ddd dd „Z ddd „Z d „Z dS( s  Text Segmentation Metrics 1. Windowdiff Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics 28, 19-36 2. Generalized Hamming Distance Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375 Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l'efficacite en segmentation de textes ? TALN 2009 3. Pk text segmentation metric Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210 i’’’’N(txranget1cCsēt|ƒt|ƒkr'tdƒ‚n|t|ƒkrHtdƒ‚nd}x€tt|ƒ|dƒD]d}t||||!j|ƒ||||!j|ƒƒ}|rŗ||7}qi|td|ƒ7}qiW|t|ƒ|dS(sW Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> s1 = "000100000010" >>> s2 = "000010000100" >>> s3 = "100000010000" >>> '%.2f' % windowdiff(s1, s1, 3) '0.00' >>> '%.2f' % windowdiff(s1, s2, 3) '0.30' >>> '%.2f' % windowdiff(s2, s3, 3) '0.80' :param seg1: a segmentation :type seg1: str or list :param seg2: a segmentation :type seg2: str or list :param k: window width :type k: int :param boundary: boundary value :type boundary: str or int or bool :param weighted: use the weighted variant of windowdiff :type weighted: boolean :rtype: float s!Segmentations have unequal lengthsCWindow width k should be smaller or equal than segmentation lengthsiigš?(tlent ValueErrortrangetabstcounttmin(tseg1tseg2tktboundarytweightedtwdtitndiff((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt windowdiff3s!8 cCs_tj||fƒ}|tj|ƒ|ddd…f<|tj|ƒ|dd…df<|S(Ni(tnptemptytarange(tnrowstncolstins_costtdel_costtmat((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt _init_matbs##c CsßxŲt|ƒD]Ź\}}x»t|ƒD]­\}} |t|| ƒ|||f} || krs|||f} n?|| krš||||df} n|||d|f} t| | ƒ||d|df|^q>} t|ƒ} t| ƒ} | dkr–| dkr–dS| dkr¶| dkr¶| |S| dkrÖ| dkrÖ| |St| d| d||ƒ} t| | ||||ƒ| dS(sb Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation. >>> # Same examples as Kulyukin C++ implementation >>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5) 0.5 >>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5) 2.0 >>> ghd('011', '110', 1.0, 1.0, 0.5) 1.0 >>> ghd('1', '0', 1.0, 1.0, 0.5) 1.0 >>> ghd('111', '000', 1.0, 1.0, 0.5) 3.0 >>> ghd('000', '111', 1.0, 2.0, 0.5) 6.0 :param ref: the reference segmentation :type ref: str or list :param hyp: the hypothetical segmentation :type hyp: str or list :param ins_cost: insertion cost :type ins_cost: float :param del_cost: deletion cost :type del_cost: float :param shift_cost_coeff: constant used to compute the cost of a shift. shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float igii’’’’(i’’’’i’’’’(RRRR#( trefthypRRRR Rtvaltref_idxthyp_idxt nref_boundt nhyp_boundR((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pytghdys.11  cCsŅ|dkr8ttt|ƒ|j|ƒdƒƒ}nd}x{tt|ƒ|dƒD]_}||||!j|ƒdk}||||!j|ƒdk}||krY|d7}qYqYW|t|ƒ|dS(sł Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> '%.2f' % pk('0100'*100, '1'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0100'*100, 2) '0.00' :param ref: the reference segmentation :type ref: str or list :param hyp: the segmentation to evaluate :type hyp: str or list :param k: window size, if None, set to half of the average reference segment length :type boundary: str or int or bool :param boundary: boundary value :type boundary: str or int or bool :rtype: float g@iigš?N(tNonetinttroundRRR(R$R%R R terrRtrth((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pytpk»s ,!   cCsDddlm}yddl}Wntk r?|dƒ‚nXdS(Ni’’’’(tSkipTests/numpy is required for nltk.metrics.segmentation(tnoseR3tnumpyt ImportError(tmoduleR3R5((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt setup_moduleįs  (t__doc__R5RR6t nltk.compatRtFalseRRR#R+R,R2R8(((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt*s /  B&