ó <¿CVc@s•dZddlmZddlmZddlmZddlmZd„Z de fd„ƒYZ d e fd „ƒYZ d e fd „ƒYZ d S(s² Common methods and classes for all IBM models. See ``IBMModel1``, ``IBMModel2``, ``IBMModel3``, ``IBMModel4``, and ``IBMModel5`` for specific implementations. The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus. The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables. Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention. Each target word is aligned to exactly one source word or the NULL token. References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311. iÿÿÿÿ(t insort_left(t defaultdict(tdeepcopy(tceilcCs9d}x,|D]$}t|jƒ}t||ƒ}q W|S(sî :param sentence_aligned_corpus: Parallel corpus under consideration :type sentence_aligned_corpus: list(AlignedSent) :return: Number of words in the longest target language sentence of ``sentence_aligned_corpus`` i(tlentwordstmax(tsentence_aligned_corpustmax_mtaligned_sentencetm((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pytlongest_target_sentence_length/s  tIBMModelcBs•eZdZdZd„Zd„Zd„Zd„Zd„Zddd„Z dd „Z dd „Z d „Z d „Zd „Zd„Zd„ZRS(s0 Abstract base class for all IBM models gê-™—q=cCs|j|ƒ|jƒdS(N(t init_vocabtreset_probabilities(tselfR((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyt__init__Hs csItd„ƒˆ_td„ƒˆ_t‡fd†ƒˆ_dˆ_dS(NcSs td„ƒS(NcSstjS(N(R tMIN_PROB(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pytNs(R(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRNscSs td„ƒS(NcSs td„ƒS(NcSs td„ƒS(NcSstjS(N(R R(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRVs(R(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRUs(R(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRUs(R(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRUscst‡fd†ƒS(NcsˆjS(N(R((R(sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyR^s(R((R(sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyR^sgà?(Rttranslation_tabletalignment_tabletfertility_tabletp1(R((Rsj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyRLs cCsdS(s… Initialize probability tables to a uniform distribution Derived classes should implement this accordingly. N((RR((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pytset_uniform_probabilitieslscCsftƒ}tƒ}x.|D]&}|j|jƒ|j|jƒqW|jdƒ||_||_dS(N(tsettupdateRtmotstaddtNonet src_vocabt trg_vocab(RRRRR ((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyR ts      c Cstƒ}t|jƒ}t|jƒ}|j|ƒ}|j|ƒ}|j|j|ƒƒ|}xštd|dƒD]…}x|td|dƒD]g} |j||| ƒ}|j||ƒ}|j||ƒ} |j| ƒ|j |j kr|}qqWquW||fS(s® Sample the most probable alignments from the entire alignment space First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a higher IBM Model. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point. Hill climbing may be stuck in a local maxima, hence the pegging and trying out of different alignments. :param sentence_pair: Source and target language sentence pair to generate a sample of alignments from :type sentence_pair: AlignedSent :return: A set of best alignments represented by their ``AlignmentInfo`` and the best alignment of the set for convenience :rtype: set(AlignmentInfo), AlignmentInfo ii( RRRRtbest_model2_alignmentt hillclimbRt neighboringtrangetscore( Rt sentence_pairtsampled_alignmentstlR tinitial_alignmenttpotential_alignmenttbest_alignmenttjtit neighbors((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pytsample‡s"  icCskdg|j}dg|j}t|ƒd}t|ƒd}dg|d}gt|dƒD] } g^qb} xÏtd|dƒD]º} | |kr£|} n„d} tj} || }xhtd|dƒD]S} || }|j|||j| | ||}|| krÐ|} | } qÐqÐW| || <| | j | ƒqˆWt t |ƒt |ƒt |ƒ| ƒS(sT Finds the best alignment according to IBM Model 2 Used as a starting point for hill climbing in Models 3 and above, because it is easier to compute than the best alignments in higher models :param sentence_pair: Source and target language sentence pair to be word-aligned :type sentence_pair: AlignedSent :param j_pegged: If specified, the alignment point of j_pegged will be fixed to i_pegged :type j_pegged: int :param i_pegged: Alignment point to j_pegged :type i_pegged: int tUNUSEDiiN( RRRRR"R RRRtappendt AlignmentInfottuple(RR$tj_peggedti_peggedt src_sentencet trg_sentenceR&R t alignmentR+tceptsR*tbest_itmax_alignment_probtttstalignment_prob((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyR¶s.#        cCsŒ|}|j|ƒ}xgtr~|}xD|j||ƒD]0}|j|ƒ}||kr7|}|}q7q7W||krPqqW||_|S(s, Starting from the alignment in ``alignment_info``, look at neighboring alignments iteratively for the best one There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum. :param j_pegged: If specified, the search will be constrained to alignments where ``j_pegged`` remains unchanged :type j_pegged: int :return: The best alignment found from hill climbing :rtype: AlignmentInfo (tprob_t_a_given_stTrueR!R#(Rtalignment_infoR2R6tmax_probabilityt old_alignmenttneighbor_alignmenttneighbor_probability((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/ibm_model.pyR ês     cCs$tƒ}t|jƒd}t|jƒd}|j}|j}xÄtd|dƒD]¯}||krUxštd|dƒD]‚} t|ƒ} t|ƒ} ||} | | |'s ÿ;€