Text Segmentation Metrics

1. Windowdiff

Pevzner, L., and Hearst, M., A Critique and Improvement of
an Evaluation Metric for Text Segmentation,
Computational Linguistics 28, 19-36

2. Generalized Hamming Distance

Bookstein A., Kulyukin V.A., Raita T.
Generalized Hamming Distance
Information Retrieval 5, 2002, pp 353-375

Baseline implementation in C++
http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html

Study describing benefits of Generalized Hamming Distance Versus
WindowDiff for evaluating text segmentation tasks
Begsten, Y.  Quel indice pour mesurer l'efficacite en segmentation de
textes ?  TALN 2009

3. Pk text segmentation metric

Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation
Machine Learning, 34, 177-210

Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> s1 = "000100000010" >>> s2 = "000010000100" >>> s3 = "100000010000" >>> '%.2f' % windowdiff(s1, s1, 3) '0.00' >>> '%.2f' % windowdiff(s1, s2, 3) '0.30' >>> '%.2f' % windowdiff(s2, s3, 3) '0.80' :param seg1: a segmentation :type seg1: str or list :param seg2: a segmentation :type seg2: str or list :param k: window width :type k: int :param boundary: boundary value :type boundary: str or int or bool :param weighted: use the weighted variant of windowdiff :type weighted: boolean :rtype: float s!Segmentations have unequal lengthsCWindow width k should be smaller or equal than segmentation lengthsiigš?(tlent ValueErrortrangetabstcounttmin(tseg1tseg2tktboundarytweightedtwdtitndiff((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt windowdiff3s!8 cCs_tj||fƒ}|tj|ƒ|ddd…f<|tj|ƒ|dd…df<|S(Ni(tnptemptytarange(tnrowstncolstins_costtdel_costtmat((sk/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/metrics/segmentation.pyt _init_matbs##c CsßxŲt|ƒD]Ź\}}x»t|ƒD]­\}} |t|| ƒ|||f} || krs|||f} n?|| krš||||df} n|||d|f} t| | ƒ||d|df|^q>} t|ƒ} t| ƒ} | dkr–| dkr–dS| dkr¶| dkr¶| |S| dkrÖ| dkrÖ| |St| d| d||ƒ} t| | ||||ƒ| dS(sb Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. Recommended parameter values are a shift_cost_coeff of 2. >>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0

:param ref: the reference segmentation
:type ref: str or list
:param hyp: the hypothetical segmentation
:type hyp: str or list
:param ins_cost: insertion cost
:type ins_cost: float
:param del_cost: deletion cost
:type del_cost: float
:param shift_cost_coeff: constant used to compute the cost of a shift.
    shift cost = shift_cost_coeff * |i - j| where i and j are the positions
    indicating the shift
:type shift_cost_coeff: float
:param boundary: boundary value
:type boundary: str or int or bool
:rtype: float