ó <¿CVc@ sŒdZddlmZddlmZddlmZddlmZddlmZddl m Z ddl Z d efd „ƒYZ dS( sõ Lexical translation model that ignores word order. In IBM Model 1, word order is ignored for simplicity. Thus, the following two alignments are equally likely. Source: je mange du jambon Target: i eat some ham Alignment: (1,1) (2,2) (3,3) (4,4) Source: je mange du jambon Target: some ham eat i Alignment: (1,4) (2,3) (3,2) (4,1) The EM algorithm used in Model 1 is: E step - In the training data, count how many times a source language word is translated into a target language word, weighted by the prior probability of the translation. M step - Estimate the new probability of translation based on the counts from the Expectation step. Notations: i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence s: A word in the source language t: A word in the target language References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. iÿÿÿÿ(tdivision(t defaultdict(t AlignedSent(t Alignment(tIBMModel(tCountsNt IBMModel1cB sYeZdZd d„Zd„Zd„Zd„Zd„Zd„Z d„Z d„Z RS( s Lexical translation model that ignores word order >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> ibm1 = IBMModel1(bitext, 5) >>> print(ibm1.translation_table['buch']['book']) 0.889... >>> print(ibm1.translation_table['das']['book']) 0.061... >>> print(ibm1.translation_table['buch'][None]) 0.113... >>> print(ibm1.translation_table['ja'][None]) 0.072... >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, 2), (4, 3)]) cC swtt|ƒj|ƒ|dkr2|j|ƒn |d|_x$td|ƒD]}|j|ƒqOW|j|ƒdS(se Train on ``sentence_aligned_corpus`` and create a lexical translation model. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, the following entry must be present: ``translation_table``. See ``IBMModel`` for the type and purpose of this table. :type probability_tables: dict[str]: object ttranslation_tableiN( tsuperRt__init__tNonetset_uniform_probabilitiesRtrangettraint_IBMModel1__align_all(tselftsentence_aligned_corpust iterationstprobability_tablestn((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm1.pyR cs  c s}dt|jƒ‰ˆtjkrItjdtt|jƒƒdƒnx-|jD]"}t‡fd†ƒ|j|s( tlent trg_vocabRtMIN_PROBtwarningstwarntstrRR(RRtt((Rse/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm1.pyR ‡s !c C sÂtƒ}x¥|D]}|j}dg|j}|j||ƒ}xi|D]a}xX|D]P}|j||ƒ} | ||} |j||c| 7<|j|c| 77s