ó <¿CVc@s)ddlZddlmZd„ZdS(iÿÿÿÿN(tdefaultdictc s\g|jƒD]$}ttt|jdƒƒƒ^q }g|jƒD]$}ttt|jdƒƒƒ^qD}d d ddd dddg‰t|ƒjt|ƒƒ‰t|ƒjt|ƒƒ‰ttƒ‰x6ˆD].\}}ˆdj|ƒˆdj|ƒqÕW‡‡‡‡‡‡fd†}‡‡‡‡fd†}|ƒ||ƒ||ƒˆS(s4 This module symmetrisatizes the source-to-target and target-to-source word alignment output and produces, aka. GDFA algorithm (Koehn, 2005). Step 1: Find the intersection of the bidirectional alignment. Step 2: Search for additional neighbor alignment points to be added, given these criteria: (i) neighbor alignments points are not in the intersection and (ii) neighbor alignments are in the union. Step 3: Add all other alignment points thats not in the intersection, not in the neighboring alignments that met the criteria but in the original foward/backward alignment outputs. >>> forw = ('0-0 2-1 9-2 21-3 10-4 7-5 11-6 9-7 12-8 1-9 3-10 ' ... '4-11 17-12 17-13 25-14 13-15 24-16 11-17 28-18') >>> back = ('0-0 1-9 2-9 3-10 4-11 5-12 6-6 7-5 8-6 9-7 10-4 ' ... '11-6 12-8 13-12 15-12 17-13 18-13 19-12 20-13 ' ... '21-3 22-12 23-14 24-17 25-15 26-17 27-18 28-18') >>> srctext = ("ã“ã® ã‚ˆã† ãª ãƒãƒãƒ¼ ç™½è‰² ã‚ã„ æ˜Ÿ ã® ï¼¬ é–¢æ•° " ... "ã¯ ï¼¬ ã¨ å…± ã« ä¸é€£ç¶š ã« å¢—åŠ ã™ã‚‹ ã“ã¨ ãŒ " ... "æœŸå¾… ã• ã‚Œã‚‹ ã“ã¨ ã‚’ ç¤ºã— ãŸ ã€‚") >>> trgtext = ("Therefore , we expect that the luminosity function " ... "of such halo white dwarfs increases discontinuously " ... "with the luminosity .") >>> srclen = len(srctext.split()) >>> trglen = len(trgtext.split()) >>> >>> gdfa = grow_diag_final_and(srclen, trglen, forw, back) >>> gdfa == set([(28, 18), (6, 6), (24, 17), (2, 1), (15, 12), (13, 12), ... (2, 9), (3, 10), (26, 17), (25, 15), (8, 6), (9, 7), (20, ... 13), (18, 13), (0, 0), (10, 4), (13, 15), (23, 14), (7, 5), ... (25, 14), (1, 9), (17, 13), (4, 11), (11, 17), (9, 2), (22, ... 12), (27, 18), (24, 16), (21, 3), (19, 12), (17, 12), (5, ... 12), (11, 6), (12, 8)]) True References: Koehn, P., A. Axelrod, A. Birch, C. Callison, M. Osborne, and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In MT Eval Workshop. :type srclen: int :param srclen: the number of tokens in the source language :type trglen: int :param trglen: the number of tokens in the target language :type e2f: str :param e2f: the forward word alignment outputs from source-to-target language (in pharaoh output format) :type f2e: str :param f2e: the backward word alignment outputs from target-to-source language (in pharaoh output format) :rtype: set(tuple(int)) :return: the symmetrized alignment points from the GDFA algorithm t-iÿÿÿÿiitetjcstˆƒd}xû|tˆƒkr xâtˆƒD]Ô}xËtˆƒD]½}||fˆkrEx¢ˆD]—}td„t||f|ƒDƒƒ}|\}}|ˆkrd|ˆkrd|ˆkrdˆj|ƒˆdj|ƒˆdj|ƒ|d7}qdqdWqEqEWq2WqWdS(sz Search for the neighbor points and them to the intersected alignment points if criteria are met. icss|]\}}||VqdS(N((t.0tiR((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/gdfa.pys csRtfN(tlentrangettupletziptadd(tprev_lenRRtneighborte_newtf_new(talignedt alignmentt neighborstsrclenttrglentunion(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/gdfa.pyt grow_diagSs % cs”xtˆƒD]}xvtˆƒD]h}|ˆkr |ˆkr ||f|kr ˆj||fƒˆdj|ƒˆdj|ƒq q Wq WdS(s£ Adds remaining points that are not in the intersection, not in the neighboring alignments but in the original *e2f* and *f2e* alignments RRN(RR(taRR(RRRR(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/gdfa.pyt final_andms(iÿÿÿÿi(iiÿÿÿÿ(ii(ii(iÿÿÿÿiÿÿÿÿ(iÿÿÿÿi(iiÿÿÿÿ(ii( tsplitR tmaptinttsettintersectionRRR( RRte2ftf2eRRRRR((RRRRRRse/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/gdfa.pytgrow_diag_final_ands:77 (tcodecstcollectionsRR (((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/gdfa.pyt s