B /Wb*@sddlZddlmZddlmZddlmZddlmZm Z m Z m Z ddl m Z ddlmZmZmZmZdd lmZdd lmZdd lmZmZmZmZmZee ed d dZee edddZeee ed ddZ eee ed ddZ!eedee e"e"fdddZ#d,e ee"e edddZ$ee ee%ddd Z&ee ed!d"d#Z'e eed$d%d&Z(ed'dd-ee%e eed)d*d+Z)dS).N)IncrementalDecoder)Counter) lru_cache)DictListOptionalTuple) FREQUENCIES)KO_NAMESLANGUAGE_SUPPORTED_COUNTTOO_SMALL_SEQUENCEZH_NAMES) is_suspiciously_successive_range)CoherenceMatches)is_accentuatedis_latinis_multi_byte_encodingis_unicode_range_secondary unicode_range) iana_namereturncst|rtdtd|j}|dd}idxltddD]^}|t|g}|r@t |}|dkrjq@t |d kr@|krd|<|d 7<d 7q@Wt fd d DS) zF Return associated unicode ranges in a single byte code page. z.Function not supported on multi-byte code pagez encodings.{}ignore)errorsr@NFr cs g|]}|dkr|qS)g333333?).0character_range)character_count seen_rangesr2sz*encoding_unicode_range..) rIOError importlib import_moduleformatrrangedecodebytesrrsorted)rdecoderpichunkrr)rr r!encoding_unicode_ranges(    r/) primary_rangercCsDg}x:tD].\}}x$|D]}t||kr||PqWqW|S)z> Return inferred languages used with a unicode range. )r itemsrappend)r0 languageslanguage characters characterrrr!unicode_range_languages9s    r7cCs>t|}d}x|D]}d|kr|}PqW|dkr6dgSt|S)z Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. NZLatinz Latin Based)r/r7)rZunicode_rangesr0Zspecified_rangerrr!encoding_languagesHs r8cCsb|ds&|ds&|ds&|dkr,dgS|ds>|tkrFddgS|d sX|tkr^d gSgS) z Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. Zshift_ iso2022_jpZeuc_jcp932JapanesegbChinesezClassical Chinese iso2022_krKorean) startswithrr )rrrr!mb_encoding_languages\s   rA)maxsize)r4rcCsFd}d}x4t|D](}|s&t|r&d}|rt|dkrd}qW||fS)zg Determine main aspects from a supported language if it contains accents and if is pure Latin. FT)r rr)r4target_have_accentstarget_pure_latinr6rrr!get_target_featuresqs rEF)r5ignore_non_latinrc sg}tddD}xxtD]l\}}t|\}}|rB|dkrBq |dkrP|rPq t|}tfdd|D} | |} | dkr ||| fq Wt|ddd d }d d|DS) zE Return associated languages associated to given characters. css|]}t|VqdS)N)r)rr6rrr! sz%alphabet_languages..Fcsg|]}|kr|qSrr)rc)r5rr!r"sz&alphabet_languages..g?cSs|dS)Nr r)xrrr!z$alphabet_languages..T)keyreversecSsg|] }|dqS)rr)rZcompatible_languagerrr!r"s)anyr r1rElenr2r*) r5rFr3Zsource_have_accentsr4Zlanguage_charactersrCrDrZcharacter_match_countratior)r5r!alphabet_languagess    rQ)r4ordered_charactersrc Cs4|tkrtd|d}tt|}x|D]}||kr.cSs|dS)Nr r)rIrrr!rJrKz(merge_coherence_ratios..T)rLrM)r2r*)r^resultZ sub_resultr4rPmerger)rar!merge_coherence_ratioss    rdi皙?)rX threshold lg_inclusionrcCsg}d}d}|dk r|dng}d|kr8d}|dxt|D]}t|}|} tdd| D} | tkrrqBd d | D} xZ|pt| |D]H} t| | } | |krqn| d kr|d 7}| | t | d f|dkrPqWqBWt |ddddS)z Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges. FrN,z Latin BasedTcss|]\}}|VqdS)Nr)rrHorrr!rG9sz"coherence_ratio..cSsg|] \}}|qSrr)rrHrirrr!r">sz#coherence_ratio..g?r rScSs|dS)Nr r)rIrrr!rJQrKz!coherence_ratio..)rLrM) splitremover]r most_commonr`r rQrWr2r_r*)rXrfrgr^rFZsufficient_match_countZlg_inclusion_listZlayerZsequence_frequenciesrmrZpopular_character_orderedr4rPrrr!coherence_ratio"s4    rn)F)reN)*r$codecsr collectionsr functoolsrtypingrrrrZassetsr Zconstantr r r rZmdrmodelsrutilsrrrrrstrr/r7r8rAboolrErQfloatrWr]rdrnrrrr!s4       % #7'