ó <¿CVc@ s²dZddlmZddlmZddlmZddlmZddlmZddlm Z ddl m Z dd l Z d efd „ƒYZ d e fd „ƒYZd S(s‘ Lexical translation model that considers word order. IBM Model 2 improves on Model 1 by accounting for word order. An alignment probability is introduced, a(i | j,l,m), which predicts a source word position, given its aligned target word's position. The EM algorithm used in Model 2 is: E step - In the training data, collect counts, weighted by prior probabilities. (a) count how many times a source language word is translated into a target language word (b) count how many times a particular position in the source sentence is aligned to a particular position in the target sentence M step - Estimate new probabilities based on the counts from the E step Notations: i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. iÿÿÿÿ(tdivision(t defaultdict(t AlignedSent(t Alignment(tIBMModel(t IBMModel1(tCountsNt IBMModel2cB sbeZdZd d„Zd„Zd„Zd„Zd„Zd„Z d„Z d„Z d „Z RS( sY Lexical translation model that considers word order >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> ibm2 = IBMModel2(bitext, 5) >>> print(round(ibm2.translation_table['buch']['book'], 3)) 1.0 >>> print(round(ibm2.translation_table['das']['book'], 3)) 0.0 >>> print(round(ibm2.translation_table['buch'][None], 3)) 0.0 >>> print(round(ibm2.translation_table['ja'][None], 3)) 0.0 >>> print(ibm2.alignment_table[1][1][2][2]) 0.938... >>> print(round(ibm2.alignment_table[1][2][2][2], 3)) 0.0 >>> print(round(ibm2.alignment_table[2][2][4][5], 3)) 1.0 >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)]) cC s£tt|ƒj|ƒ|dkrQt|d|ƒ}|j|_|j|ƒn|d|_|d|_x$td|ƒD]}|j |ƒq{W|j |ƒdS(s™ Train on ``sentence_aligned_corpus`` and create a lexical translation model and an alignment model. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: ``translation_table``, ``alignment_table``. See ``IBMModel`` for the type and purpose of these tables. :type probability_tables: dict[str]: object ittranslation_tabletalignment_tableiN( tsuperRt__init__tNoneRRtset_uniform_probabilitiesR trangettraint_IBMModel2__align_all(tselftsentence_aligned_corpust iterationstprobability_tablestibm1tn((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyR cs    c C sútƒ}xê|D]â}t|jƒ}t|jƒ}||f|kr|j||fƒdt|dƒ}|tjkrštj dt |ƒdƒnxUt d|dƒD]=}x4t d|dƒD]}||j ||||Òsii(RRRR+(RR2R3talignment_prob_for_tR'R5R&((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyR*Äs c C sZt|ƒd}t|ƒd}||}||}|j|||j||||S(sz Probability that position j in ``trg_sentence`` is aligned to position i in the ``src_sentence`` i(RRR ( RR&R'R2R3R#R$R6R5((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyR+Ús   c C s´d}t|jƒd}t|jƒd}xut|jƒD]d\}}|dkrZq<n|j|}|j|}||j|||j||||9}q<Wt|tj ƒS(sc Probability of target sentence and an alignment given the source sentence gð?ii( RR2R3t enumerateR9RR R<RR( Rtalignment_infotprobR#R$R'R&ttrg_wordtsrc_word((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pytprob_t_a_given_sås   cC s"x|D]}|j|ƒqWdS(N(t_IBMModel2__align(RR0t sentence_pair((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyt __align_alløs c C sg}t|jƒ}t|jƒ}xßt|jƒD]Î\}}|j|d|jd|d||}t|tj ƒ}d}xht|jƒD]W\} } |j|| |j| d|d||} | |kr”| }| }q”q”W|j ||fƒq4Wt |ƒ|_ dS(s Determines the best word alignment for one sentence pair from the corpus that the model was trained on. The best alignment will be set in ``sentence_pair`` when the method returns. In contrast with the internal implementation of IBM models, the word indices in the ``Alignment`` are zero- indexed, not one-indexed. :param sentence_pair: A sentence in the source language and its counterpart sentence in the target language :type sentence_pair: AlignedSent iiN( RRRRCRR R R<RRtappendRR9( RRJtbest_alignmentR#R$R'RFt best_probtbest_alignment_pointR&RGt align_prob((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyt__alignüs "  N( t__name__t __module__t__doc__R R R RR/R*R+RHRRI(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyR;s& )     R)cB s)eZdZd„Zd„Zd„ZRS(so Data object to store counts of various parameters during training. Includes counts for alignment. cC s;tt|ƒjƒtd„ƒ|_td„ƒ|_dS(NcS s td„ƒS(NcS s td„ƒS(NcS s td„ƒS(NcS sdS(Ng((((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA*s(R(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA)s(R(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA)s(R(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA)scS s td„ƒS(NcS s td„ƒS(NcS sdS(Ng((((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA,s(R(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA,s(R(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyRA,s(R R)R RR9R;(R((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm2.pyR &s cC s.|j||c|7<|j|c|7|j||||c|7<|j|||c|7/s æ