B _oa*@szddlZddlmZddlmZmZddlmZddlm Z m Z m Z m Z ddl mZddlmZmZmZdd lmZdd lmZdd lmZmZmZmZmZee ed d dZee edddZeee ed ddZ eee ed ddZ!d)e ee"e edddZ#ee ee$dddZ%ee edddZ&e eed d!d"Z'ed#d$d*ee$e eed&d'd(Z(dS)+N)IncrementalDecoder)Counter OrderedDict) lru_cache)DictListOptionalTuple) FREQUENCIES)KO_NAMESTOO_SMALL_SEQUENCEZH_NAMES) is_suspiciously_successive_range)CoherenceMatches)is_accentuatedis_latinis_multi_byte_encodingis_unicode_range_secondary unicode_range) iana_namereturncst|rtdtd|j}|dd}idxltddD]^}|t|g}|r@t |}|dkrjq@t |d kr@|krd|<|d 7<d 7q@Wt fd d DS) zF Return associated unicode ranges in a single byte code page. z.Function not supported on multi-byte code pagez encodings.{}ignore)errorsr@NFr cs g|]}|dkr|qS)g333333?).0character_range)character_count seen_rangesr2sz*encoding_unicode_range..) rIOError importlib import_moduleformatrrangedecodebytesrrsorted)rdecoderpichunkrr)rr r!encoding_unicode_ranges(    r/) primary_rangercCsDg}x:tD].\}}x$|D]}t||kr||PqWqW|S)z> Return inferred languages used with a unicode range. )r itemsrappend)r0 languageslanguage characters characterrrr!unicode_range_languages9s    r7cCs>t|}d}x|D]}d|kr|}PqW|dkr6dgSt|S)z Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. NZLatinz Latin Based)r/r7)rZunicode_rangesr0Zspecified_rangerrr!encoding_languagesHs r8cCsb|ds&|ds&|ds&|dkr,dgS|ds>|tkrFddgS|d sX|tkr^d gSgS) z Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. Zshift_ iso2022_jpZeuc_jcp932JapanesegbChinesezClassical Chinese iso2022_krKorean) startswithrr )rrrr!mb_encoding_languages\s   rAF)r5ignore_non_latinrc sg}d}xD]}t|rd}PqWxtD]\}}d}d}x8|D]0} |dkr\t| r\d}|dkrDt| dkrDd}qDW|r|dkrq.|dkr|rq.t|} tfdd|D} | | } | dkr.||| fq.Wt|dddd}d d|DS) zE Return associated languages associated to given characters. FTcsg|]}|kr|qSrr)rc)r5rr!r"sz&alphabet_languages..g?cSs|dS)Nr r)xrrr!z$alphabet_languages..)keyreversecSsg|] }|dqS)rr)rZcompatible_languagerrr!r"s)rr r1rlenr2r*) r5rBr3Zsource_have_accentsr6r4Zlanguage_charactersZtarget_have_accentsZtarget_pure_latinZlanguage_characterrZcharacter_match_countratior)r5r!alphabet_languagesqs4    rK)r4ordered_charactersrcs6|tkrtd|d}x |D]}|t|kr6q"t|dt||}t|t||d}|d|||||dfdd|Dd}fdd|Dd}t|dkr|dkr|d 7}q"t|dkr|dkr|d 7}q"|t|d ks|t|d kr"|d 7}q"q"W|t|S) aN Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.) z{} not availablerNcsg|] }|kqSrr)re)characters_beforerr!r"sz1characters_popularity_compare..Tcsg|] }|kqSrr)rrM)characters_afterrr!r"sr g?)r ValueErrorr&indexcountrI)r4rLZcharacter_approved_countr6Zcharacters_before_sourceZcharacters_after_sourceZbefore_match_countZafter_match_countr)rOrNr!characters_popularity_compares: rT)decoded_sequencercCst}x|D]}|dkrq t|}|dkr0q d}x |D]}t||dkr:|}Pq:W|dkrb|}||krx|||<q |||7<q Wt|S)a Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew. FN)risalpharrlowerlistvalues)rUZlayersr6rZlayer_target_rangeZdiscovered_rangerrr!alpha_unicode_splits(    rZ)resultsrc Cst}g}xD|D]<}x6|D].}|\}}||kr:|g||<q|||qWqWx4|D],}||tt||t||dfqVWt|ddddS)z This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio. rPcSs|dS)Nr r)rDrrr!rE rFz(merge_coherence_ratios..T)rGrH)rr2roundsumrIr*)r[Zper_language_ratiosmergeresultZ sub_resultr4rJrrr!merge_coherence_ratioss"      r`i)maxsize皙?)rU threshold lg_inclusionrcCsg}g}d}d}|dk r"|d}d|kr8d}|dxt|D]}t|}|} tdd| D} | tkrrqBd d| D} xZ|pt| |D]H} t| | } | |krqn| d kr|d 7}| | t | d f|d krPqWqBWt |ddddS)z Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges. FrN,z Latin BasedTcSsg|] \}}|qSrr)rrCorrr!r"=sz#coherence_ratio..cSsg|] \}}|qSrr)rrCrfrrr!r"Bsg?r rPcSs|dS)Nr r)rDrrr!rEUrFz!coherence_ratio..)rGrH) splitremoverZr most_commonr]r rKrTr2r\r*)rUrcrdr[Zlg_inclusion_listrBZsufficient_match_countZlayerZsequence_frequenciesrjrZpopular_character_orderedr4rJrrr!coherence_ratio#s8     rk)F)rbN))r$codecsr collectionsrr functoolsrtypingrrrr Zassetsr Zconstantr r rZmdrmodelsrutilsrrrrrstrr/r7r8rAboolrKfloatrTrZr`rkrrrr!s0      % /:'