ó <¿CVc@sxdZddlZddlmZddlmZdefd„ƒYZdefd„ƒYZd efd „ƒYZ dS( s& A decoder that uses stacks to implement phrase-based translation. In phrase-based translation, the source sentence is segmented into phrases of one or more words, and translations for those phrases are used to build the target sentence. Hypothesis data structures are used to keep track of the source words translated so far and the partial output. A hypothesis can be expanded by selecting an untranslated phrase, looking up its translation in a phrase table, and appending that translation to the partial output. Translation is complete when a hypothesis covers all source words. The search space is huge because the source sentence can be segmented in different ways, the source phrases can be selected in any order, and there could be multiple translations for the same source phrase in the phrase table. To make decoding tractable, stacks are used to limit the number of candidate hypotheses by doing histogram and/or threshold pruning. Hypotheses with the same number of words translated are placed in the same stack. In histogram pruning, each stack has a size limit, and the hypothesis with the lowest score is removed when the stack is full. In threshold pruning, hypotheses that score below a certain threshold of the best hypothesis in that stack are removed. Hypothesis scoring can include various factors such as phrase translation probability, language model probability, length of translation, cost of remaining words to be translated, and so on. References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York. iÿÿÿÿN(t defaultdict(tlogt StackDecodercBs†eZdZd„Zed„ƒZejd„ƒZd„Zd„Zd„Z d„Z d„Z d „Z d „Z ed „ƒZRS( s• Phrase-based stack decoder for machine translation >>> from nltk.translate import PhraseTable >>> phrase_table = PhraseTable() >>> phrase_table.add(('niemand',), ('nobody',), log(0.8)) >>> phrase_table.add(('niemand',), ('no', 'one'), log(0.2)) >>> phrase_table.add(('erwartet',), ('expects',), log(0.8)) >>> phrase_table.add(('erwartet',), ('expecting',), log(0.2)) >>> phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1)) >>> phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8)) >>> phrase_table.add(('!',), ('!',), log(0.8)) >>> # nltk.model should be used here once it is implemented >>> from collections import defaultdict >>> language_prob = defaultdict(lambda: -999.0) >>> language_prob[('nobody',)] = log(0.5) >>> language_prob[('expects',)] = log(0.4) >>> language_prob[('the', 'spanish', 'inquisition')] = log(0.2) >>> language_prob[('!',)] = log(0.1) >>> language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})() >>> stack_decoder = StackDecoder(phrase_table, language_model) >>> stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!']) ['nobody', 'expects', 'the', 'spanish', 'inquisition', '!'] cCsD||_||_d|_d|_d|_d|_|jƒdS(sG :param phrase_table: Table of translations for source language phrases and the log probabilities for those translations. :type phrase_table: PhraseTable :param language_model: Target language model. Must define a ``probability_change`` method that calculates the change in log probability of a sentence, if a given string is appended to it. This interface is experimental and will likely be replaced with nltk.model once it is implemented. :type language_model: object gidgà?N(t phrase_tabletlanguage_modelt word_penaltytbeam_thresholdt stack_sizet _StackDecoder__distortion_factort%_StackDecoder__compute_log_distortion(tselfRR((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt__init__Os      cCs|jS(s float: Amount of reordering of source phrases. Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5. (R(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pytdistortion_factoryscCs||_|jƒdS(N(RR (R td((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR ƒs cCs7|jdkr!tdƒ|_nt|jƒ|_dS(Ngg•Ö&è .>(RRt$_StackDecoder__log_distortion_factor(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt__compute_log_distortionˆsc CsŸt|ƒ}t|ƒ}gtd|dƒD]}t|j|jƒ^q,}tƒ}|dj|ƒ|j|ƒ}|j |ƒ}xâ|D]Ú} xÑ| D]É} t j || ƒ} x®| D]¦} || d| d!} xˆ|j j | ƒD]t}|j| || ƒ}td|d| d|jd| ƒ}|j|||ƒ|_|jƒ}||j|ƒqæWq¸Wq™WqŒW||s…tjdƒgS||jƒ}|jƒS(s¦ :param src_sentence: Sentence to be translated :type src_sentence: list(str) :return: Translated sentence :rtype: list(str) iit raw_scoretsrc_phrase_spant trg_phrasetprevioussYUnable to translate all words. The source sentence contains words not in the phrase table(ttupletlentranget_StackRRt _Hypothesistpushtfind_all_src_phrasestcompute_future_scoresRt valid_phrasesRttranslations_fortexpansion_scoreRt future_scorettotal_translated_wordstwarningstwarntbestttranslation_so_far(R t src_sentencetsentencetsentence_lengtht_tstackstempty_hypothesist all_phrasestfuture_score_tabletstackt hypothesistpossible_expansionsRt src_phrasettranslation_optionRtnew_hypothesist total_wordstbest_hypothesis((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt translates@  2           !  cCs•t|ƒ}g|D] }g^q}xitd|ƒD]X}xOt|d|dƒD]6}|||!}||jkrS||j|ƒqSqSWq5W|S(s@ Finds all subsequences in src_sentence that have a phrase translation in the translation table :type src_sentence: tuple(str) :return: Subsequences that have a phrase translation, represented as a table of lists of end positions. For example, if result[2] is [5, 6, 9], then there are three phrases starting from position 2 in ``src_sentence``, ending at positions 5, 6, and 9 exclusive. The list of ending positions are in ascending order. :rtype: list(list(int)) ii(RRRtappend(R R%R'R(tphrase_indiceststarttendtpotential_phrase((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRÁs  c Cs"td„ƒ}x tdt|ƒdƒD]ñ}xètdt|ƒ|dƒD]É}||}|||!}||jkr¹|jj|ƒdj}||jj|ƒ7}||||ès(R(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR<èsii(RRRRRtlog_probRt probability( R R%tscorest seq_lengthR8R9tphrasetscoretmidtcombined_score((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRÙs  $    cCs>d}x1|j|ƒD] }|||d|d7}qW|S(ss Determines the approximate score for translating the untranslated words in ``hypothesis`` gii(tuntranslated_spans(R R.R,R'RBtspan((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRýscCsf|j}||j7}||jj||jƒ7}||j||ƒ7}||jt|jƒ8}|S(s¹ Calculate the score of expanding ``hypothesis`` with ``translation_option`` :param hypothesis: Hypothesis being expanded :type hypothesis: _Hypothesis :param translation_option: Information about the proposed expansion :type translation_option: PhraseTableEntry :param src_phrase_span: Word position span of the source phrase :type src_phrase_span: tuple(int, int) (RR=Rtprobability_changeRtdistortion_scoreRR(R R.R1RRB((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRs   cCs?|js dS|d}|jd}||}t|ƒ|jS(Ngii(RtabsR(R R.tnext_src_phrase_spantnext_src_phrase_starttprev_src_phrase_endtdistortion_distance((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRHs     cCs™|jt|ƒƒ}g}xw|D]o}|d}|d}xR||krx5||D])}||krlPn|j||fƒqVW|d7}q?Wq"W|S(s Extract phrases from ``all_phrases_from`` that contains words that have not been translated by ``hypothesis`` :param all_phrases_from: Phrases represented by their spans, in the same format as the return value of ``find_all_src_phrases`` :type all_phrases_from: list(list(int)) :type hypothesis: _Hypothesis :return: A list of phrases, represented by their spans, that cover untranslated positions. :rtype: list(tuple(int, int)) ii(RERR6(tall_phrases_fromR.RERtavailable_spanR8t available_endt phrase_end((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR's    (t__name__t __module__t__doc__R tpropertyR tsetterR R5RRRRRHt staticmethodR(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR2s *   1  $  RcBs\eZdZdd d d dd„Zd„Zd„Zd„Zd„Zd„Z d„Z RS( s0 Partial solution to a translation. Records the word positions of the phrase being translated, its translation, raw score, and the cost of the untranslated parts of the sentence. When the next phrase is selected to build upon the partial solution, a new _Hypothesis object is created, with a back pointer to the previous hypothesis. To find out which words have been translated so far, look at the ``src_phrase_span`` in the hypothesis chain. Similarly, the translation output can be found by traversing up the chain. gcCs1||_||_||_||_||_dS(sÔ :param raw_score: Likelihood of hypothesis so far. Higher is better. Does not account for untranslated words. :type raw_score: float :param src_phrase_span: Span of word positions covered by the source phrase in this hypothesis expansion. For example, (2, 5) means that the phrase is from the second word up to, but not including the fifth word in the source sentence. :type src_phrase_span: tuple(int) :param trg_phrase: Translation of the source phrase in this hypothesis expansion :type trg_phrase: tuple(str) :param previous: Previous hypothesis before expansion to this one :type previous: _Hypothesis :param future_score: Approximate score for translating the remaining words not covered by this hypothesis. Higher means that the remaining words are easier to translate. :type future_score: float N(RRRRR(R RRRRR((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR Xs     cCs|j|jS(sd Overall score of hypothesis after accounting for local and global features (RR(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRBwscCsp|jƒ}|jƒ|j|ƒg}d}x:|D]2}||kr^|j||fƒn|d}q6W|S(s. Starting from each untranslated word, find the longest continuous span of untranslated positions :param sentence_length: Length of source sentence being translated by the hypothesis :type sentence_length: int :rtype: list(tuple(int, int)) ii(ttranslated_positionstsortR6(R R'RXRER8R9((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRE~s     cCsVg}|}xC|jdk rQ|j}|jt|d|dƒƒ|j}qW|S(s’ List of positions in the source sentence of words already translated. The list is not sorted. :rtype: list(int) iiN(RtNoneRtextendR(R RXtcurrent_hypothesisttranslated_span((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRX—s  cCst|jƒƒS(N(RRX(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR §scCsg}|j||ƒ|S(N(t_Hypothesis__build_translation(R t translation((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR$ªscCs:|jdkrdS|j|j|ƒ|j|jƒdS(N(RRZR^R[R(R R.toutput((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt__build_translation¯s((N( RRRSRTRZR RBRERXR R$R^(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRJs       RcBsJeZdZddd„Zd„Zd„Zd„Zd„Zd„ZRS( s+ Collection of _Hypothesis objects idgcCsC||_g|_|dkr0tdƒ|_nt|ƒ|_dS(sè :param beam_threshold: Hypotheses that score less than this factor of the best hypothesis are discarded from the stack. Value must be between 0.0 and 1.0. :type beam_threshold: float gs-infN(tmax_sizetitemsR;t_Stack__log_beam_thresholdR(R RbR((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR ºs    cCsf|jj|ƒ|jjdd„dtƒx)t|jƒ|jkrW|jjƒq/W|jƒdS(s Add ``hypothesis`` to the stack. Removes lowest scoring hypothesis if the stack is full. After insertion, hypotheses that score less than ``beam_threshold`` times the score of the best hypothesis are removed. tkeycSs |jƒS(N(RB(th((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR<ÒstreverseN(RcR6RYtTrueRRbtpoptthreshold_prune(R R.((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRÉs cCsh|js dS|jdjƒ|j}x:t|jƒD])}|jƒ|kr_|jjƒq7Pq7WdS(Ni(RcRBRdtreversedRi(R t thresholdR.((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyRj×s cCs|jr|jdSdS(se :return: Hypothesis with the highest score in the stack :rtype: _Hypothesis iN(RcRZ(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR#âs  cCs t|jƒS(N(titerRc(R ((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt__iter__ëscCs ||jkS(N(Rc(R R.((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt __contains__îs( RRRSRTR RRjR#RnRo(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyR¶s  ( RTR!t collectionsRtmathRtobjectRRR(((sn/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/translate/stack_decoder.pyt+s ÿl