ó <¿CVc@ sÒdZddlmZddlmZddlmZddlmZddlm Z ddlm Z ddlm Z dd l m Z dd l mZdd lZd e fd „ƒYZde fd„ƒYZd S(s Translation model that reorders output words based on their type and distance from other related words in the output sentence. IBM Model 4 improves the distortion model of Model 3, motivated by the observation that certain words tend to be re-ordered in a predictable way relative to one another. For example, in English usually has its order flipped as in French. Model 4 requires words in the source and target vocabularies to be categorized into classes. This can be linguistically driven, like parts of speech (adjective, nouns, prepositions, etc). Word classes can also be obtained by statistical methods. The original IBM Model 4 uses an information theoretic approach to group words into 50 classes for each vocabulary. Terminology: Cept: A source word with non-zero fertility i.e. aligned to one or more target words. Tablet: The set of target word(s) aligned to a cept. Head of cept: The first word of the tablet of that cept. Center of cept: The average position of the words in that cept's tablet. If the value is not an integer, the ceiling is taken. For example, for a tablet with words in positions 2, 5, 6 in the target sentence, the center of the corresponding cept is ceil((2 + 5 + 6) / 3) = 5 Displacement: For a head word, defined as (position of head word - position of previous cept's center). Can be positive or negative. For a non-head word, defined as (position of non-head word - position of previous word in the same tablet). Always positive, because successive words in a tablet are assumed to appear to the right of the previous word. In contrast to Model 3 which reorders words in a tablet independently of other words, Model 4 distinguishes between three cases. (1) Words generated by NULL are distributed uniformly. (2) For a head word t, its position is modeled by the probability d_head(displacement | word_class_s(s),word_class_t(t)), where s is the previous cept, and word_class_s and word_class_t maps s and t to a source and target language word class respectively. (3) For a non-head word t, its position is modeled by the probability d_non_head(displacement | word_class_t(t)) The EM algorithm used in Model 4 is: E step - In the training data, collect counts, weighted by prior probabilities. (a) count how many times a source language word is translated into a target language word (b) for a particular word class, count how many times a head word is located at a particular displacement from the previous cept's center (c) for a particular word class, count how many times a non-head word is located at a particular displacement from the previous target word (d) count how many times a source word is aligned to phi number of target words (e) count how many times NULL is aligned to a target word M step - Estimate new probabilities based on the counts from the E step Like Model 3, there are too many possible alignments to consider. Thus, a hill climbing approach is used to sample good candidates. Notations: i: Position in the source sentence Valid values are 0 (for NULL), 1, 2, ..., length of source sentence j: Position in the target sentence Valid values are 1, 2, ..., length of target sentence l: Number of words in the source sentence, excluding NULL m: Number of words in the target sentence s: A word in the source language t: A word in the target language phi: Fertility, the number of target words produced by a source word p1: Probability that a target word produced by a source word is accompanied by another target word that is aligned to NULL p0: 1 - p1 dj: Displacement, Δj References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. iÿÿÿÿ(tdivision(t defaultdict(t factorial(t AlignedSent(t Alignment(tIBMModel(t IBMModel3(tCounts(tlongest_target_sentence_lengthNt IBMModel4cB sVeZdZdd„Zd„Zd„Zd„Zd„Zd„Z e d„ƒZ RS( s~ Translation model that reorders output words based on their type and their distance from other related words in the output sentence >>> bitext = [] >>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big'])) >>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small'])) >>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small'])) >>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house'])) >>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book'])) >>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book'])) >>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book'])) >>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize'])) >>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 } >>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 } >>> ibm4 = IBMModel4(bitext, 5, src_classes, trg_classes) >>> print(round(ibm4.translation_table['buch']['book'], 3)) 1.0 >>> print(round(ibm4.translation_table['das']['book'], 3)) 0.0 >>> print(round(ibm4.translation_table['ja'][None], 3)) 1.0 >>> print(round(ibm4.head_distortion_table[1][0][1], 3)) 1.0 >>> print(round(ibm4.head_distortion_table[2][0][1], 3)) 0.0 >>> print(round(ibm4.non_head_distortion_table[3][6], 3)) 0.5 >>> print(round(ibm4.fertility_table[2]['summarize'], 3)) 1.0 >>> print(round(ibm4.fertility_table[1]['book'], 3)) 1.0 >>> print(ibm4.p1) 0.033... >>> test_sentence = bitext[2] >>> test_sentence.words ['das', 'buch', 'ist', 'ja', 'klein'] >>> test_sentence.mots ['the', 'book', 'is', 'small'] >>> test_sentence.alignment Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)]) cC stt|ƒj|ƒ|jƒ||_||_|dkrt||ƒ}|j|_|j |_ |j |_ |j |_ |j |ƒnN|d|_|d|_ |d|_ |d|_ |d|_ |d|_x$td|ƒD]}|j|ƒqëWdS( sæ Train on ``sentence_aligned_corpus`` and create a lexical translation model, distortion models, a fertility model, and a model for generating NULL-aligned words. Translation direction is from ``AlignedSent.mots`` to ``AlignedSent.words``. :param sentence_aligned_corpus: Sentence-aligned parallel corpus :type sentence_aligned_corpus: list(AlignedSent) :param iterations: Number of iterations to run training algorithm :type iterations: int :param source_word_classes: Lookup table that maps a source word to its word class, the latter represented by an integer id :type source_word_classes: dict[str]: int :param target_word_classes: Lookup table that maps a target word to its word class, the latter represented by an integer id :type target_word_classes: dict[str]: int :param probability_tables: Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: ``translation_table``, ``alignment_table``, ``fertility_table``, ``p1``, ``head_distortion_table``, ``non_head_distortion_table``. See ``IBMModel`` and ``IBMModel4`` for the type and purpose of these tables. :type probability_tables: dict[str]: object ttranslation_tabletalignment_tabletfertility_tabletp1thead_distortion_tabletnon_head_distortion_tableiN(tsuperR t__init__treset_probabilitiest src_classest trg_classestNoneRR R R R tset_uniform_probabilitiesRRtrangettrain(tselftsentence_aligned_corpust iterationstsource_word_classesttarget_word_classestprobability_tablestibm3tn((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyR¨s*#              c sGttˆƒjƒt‡fd†ƒˆ_t‡fd†ƒˆ_dS(Nc st‡fd†ƒS(Nc st‡fd†ƒS(Nc sˆjS(N(tMIN_PROB((R(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pytés(R((R(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyR"és(R((R(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyR"ésc st‡fd†ƒS(Nc sˆjS(N(R!((R(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyR"ñs(R((R(se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyR"ñs(RR RRRR(R((Rse/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm4.pyRæs c sùt|ƒ}|dkr$tj‰ntdƒd|d‰ˆtjkritjdt|ƒdƒnx‰td|ƒD]x}t‡fd†ƒ|j |fs ÿ8