a *NaD@s8ddlmZddlmZmZddlmZddlmZm Z m Z m Z m Z m Z mZmZmZmZmZmZmZmZmZGdddZGdd d eZGd d d eZGd d d eZGdddeZGdddeZGdddeZGdddeZGdddeZee ee e!dddZ"eddd#e e#e!e#dd d!Z$d"S)$) lru_cache)ListOptional)UNICODE_SECONDARY_RANGE_KEYWORD)is_accentuatedis_asciiis_case_variableis_cjk is_emoticon is_hangul is_hiragana is_katakanais_latinis_punctuation is_separator is_symbolis_thai remove_accent unicode_rangec@sPeZdZdZeedddZeddddZddd d Ze e dd d Z dS) MessDetectorPluginzy Base abstract class used for mess detection plugins. All detectors MUST extend and implement given methods.  characterreturncCstdS)z@ Determine if given character should be fed in. NNotImplementedErrorselfrrp/private/var/folders/js/6pj4vh5d4zd0k6bxv74qrbhr0000gr/T/pip-target-22xwyzbs/lib/python/charset_normalizer/md.pyeligibleszMessDetectorPlugin.eligibleNcCstdS)z The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic. Nrrrrrfeed$szMessDetectorPlugin.feedrcCstdS)zB Permit to reset the plugin to the initial state. Nrrrrrreset+szMessDetectorPlugin.resetcCstdS)z Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0. Nrr#rrrratio1szMessDetectorPlugin.ratio) __name__ __module__ __qualname____doc__strboolr r!r$propertyfloatr%rrrrrs rc@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS) TooManySymbolOrPunctuationPluginNr"cCs"d|_d|_d|_d|_d|_dS)NrF)_punctuation_count _symbol_count_character_count_last_printable_charZ_frenzy_symbol_in_wordr#rrr__init__;s z)TooManySymbolOrPunctuationPlugin.__init__rcCs|SN isprintablerrrrr Csz)TooManySymbolOrPunctuationPlugin.eligiblecCsp|jd7_||jkrf|dvrft|r8|jd7_n.|durft|rft|durf|jd7_||_dS)Nr<>=:/&;{}[],|"-F)r1r2rr/isdigitrr r0rrrrr!Fs  z%TooManySymbolOrPunctuationPlugin.feedcCsd|_d|_d|_dSNr)r/r1r0r#rrrr$esz&TooManySymbolOrPunctuationPlugin.resetcCs0|jdkrdS|j|j|j}|dkr,|SdS)Nr333333?)r1r/r0)rZratio_of_punctuationrrrr%js   z&TooManySymbolOrPunctuationPlugin.ratio r&r'r(r3r*r+r r!r$r,r-r%rrrrr.:s r.c@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)TooManyAccentuatedPluginNr"cCsd|_d|_dSrIr1_accentuated_countr#rrrr3wsz!TooManyAccentuatedPlugin.__init__rcCs|Sr4)isalpharrrrr {sz!TooManyAccentuatedPlugin.eligiblecCs(|jd7_t|r$|jd7_dSNr)r1rrOrrrrr!~szTooManyAccentuatedPlugin.feedcCsd|_d|_dSrIrNr#rrrr$szTooManyAccentuatedPlugin.resetcCs*|jdkrdS|j|j}|dkr&|SdS)NrrJgffffff?rN)rZratio_of_accentuationrrrr%s   zTooManyAccentuatedPlugin.ratiorLrrrrrMvs rMc@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)UnprintablePluginNr"cCsd|_d|_dSrI)_unprintable_countr1r#rrrr3szUnprintablePlugin.__init__rcCsdSNTrrrrrr szUnprintablePlugin.eligiblecCsL|dvr:|dur:|dur:t|dkr:|jd7_|jd7_dS)N>    Fr)r6isspaceordrSr1rrrrr!s   zUnprintablePlugin.feedcCs d|_dSrI)rSr#rrrr$szUnprintablePlugin.resetcCs|jdkrdS|jd|jS)NrrJ)r1rSr#rrrr%s zUnprintablePlugin.ratiorLrrrrrRs  rRc@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)SuspiciousDuplicateAccentPluginNr"cCsd|_d|_d|_dSrI_successive_countr1_last_latin_characterr#rrrr3sz(SuspiciousDuplicateAccentPlugin.__init__rcCs|ot|Sr4)rPrrrrrr sz(SuspiciousDuplicateAccentPlugin.eligiblecCst|jd7_|jdurjt|rjt|jrj|rJ|jrJ|jd7_t|t|jkrj|jd7_||_dSrQ)r1r`risupperr_rrrrrr!s z$SuspiciousDuplicateAccentPlugin.feedcCsd|_d|_d|_dSrIr^r#rrrr$sz%SuspiciousDuplicateAccentPlugin.resetcCs|jdkrdS|jd|jS)NrrJrG)r1r_r#rrrr%s z%SuspiciousDuplicateAccentPlugin.ratiorLrrrrr]s  r]c@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)SuspiciousRangeNr"cCsd|_d|_d|_dSrI)"_suspicious_successive_range_countr1_last_printable_seenr#rrrr3szSuspiciousRange.__init__rcCs|Sr4r5rrrrr szSuspiciousRange.eligiblecCsx|jd7_|s&t|s&|dvr0d|_dS|jdurD||_dSt|j}t|}t||rn|jd7_||_dS)Nrr7)r1rZrrdr is_suspiciously_successive_rangerc)rrunicode_range_aunicode_range_brrrr!s*  zSuspiciousRange.feedcCsd|_d|_d|_dSrI)r1rcrdr#rrrr$szSuspiciousRange.resetcCs.|jdkrdS|jd|j}|dkr*dS|S)NrrJrGg?)r1rc)rZratio_of_suspicious_range_usagerrrr% s zSuspiciousRange.ratiorLrrrrrbs *rbc@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)SuperWeirdWordPluginNr"cCs4d|_d|_d|_d|_d|_d|_d|_d|_dS)NrF) _word_count_bad_word_count_is_current_word_bad_foreign_long_watchr1_bad_character_count_buffer_buffer_accent_countr#rrrr3szSuperWeirdWordPlugin.__init__rcCsdSrTrrrrrr 'szSuperWeirdWordPlugin.eligiblecCs|rd|j|g|_t|r0|jd7_|jdurt|durt|durt|durt |durt |durt |durd|_dS|jsdS| st |st|rV|jrV|jd7_t|j}|j|7_|dkr|j|dkrd|_|dkr|jrd|_|jrB|jd7_|jt|j7_d|_d|_d|_d|_n6|d vr|durt|rd|_|j|7_dS) NrirFTrKr>rFr:r9r8)rPjoinrorrprmrr r rr rrZrrrjlenr1rlrkrnrHr)rrZ buffer_lengthrrrr!*sh         zSuperWeirdWordPlugin.feedcCs.d|_d|_d|_d|_d|_d|_d|_dS)NriFr)rorlrmrkrjr1rnr#rrrr$YszSuperWeirdWordPlugin.resetcCs|jdkrdS|j|jS)N rJ)rjrnr1r#rrrr%bs zSuperWeirdWordPlugin.ratiorLrrrrrhs  / rhc@s^eZdZdZddddZeedddZeddd d Zddd d Z e e dd dZ dS)CjkInvalidStopPluginu GB(Chinese) based encoding often render the stop incorrectly when the content does not fit and can be easily detected. Searching for the overuse of '丅' and '丄'. Nr"cCsd|_d|_dSrI_wrong_stop_count_cjk_character_countr#rrrr3pszCjkInvalidStopPlugin.__init__rcCsdSrTrrrrrr tszCjkInvalidStopPlugin.eligiblecCs4|dvr|jd7_dSt|r0|jd7_dS)N)u丅u丄r)rxr ryrrrrr!ws zCjkInvalidStopPlugin.feedcCsd|_d|_dSrIrwr#rrrr$~szCjkInvalidStopPlugin.resetcCs|jdkrdS|j|jS)NrJ)ryrxr#rrrr%s zCjkInvalidStopPlugin.ratio) r&r'r(r)r3r*r+r r!r$r,r-r%rrrrrvjsrvc@sZeZdZddddZeedddZedddd Zddd d Ze e dd d Z dS)ArchaicUpperLowerPluginNr"cCs.d|_d|_d|_d|_d|_d|_d|_dS)NFrT)_buf_character_count_since_last_sep_successive_upper_lower_count#_successive_upper_lower_count_finalr1_last_alpha_seen_current_ascii_onlyr#rrrr3sz ArchaicUpperLowerPlugin.__init__rcCsdSrTrrrrrr sz ArchaicUpperLowerPlugin.eligiblecCs$|ot|}|du}|r|jdkr|jdkrV|durV|jdurV|j|j7_d|_d|_d|_d|_|j d7_ d|_dS|jdurt |durd|_|jdur| r|j s| r|j r|jdur|jd7_d|_qd|_nd|_|j d7_ |jd7_||_dS)NFr@rTrG) rPr r}rHrrr~rr|r1rraislower)rrZ is_concernedZ chunk_seprrrr!sF   zArchaicUpperLowerPlugin.feedcCs.d|_d|_d|_d|_d|_d|_d|_dS)NrFT)r1r}r~rrr|rr#rrrr$szArchaicUpperLowerPlugin.resetcCs|jdkrdS|j|jS)NrrJ)r1rr#rrrr%s zArchaicUpperLowerPlugin.ratiorLrrrrr{s  * r{)rfrgrcCsL|dus|durdS||kr dSd|vr4d|vr4dSd|vsDd|vrHdS|d|d}}|D]}|tvrpqb||vrbdSqb|dvr|dvrdS|dvs|dvrd|vsd|vrdSd |vsd |vrd|vsd|vrdS|d ks|d krdSd|vsd|vs|dvrH|dvrHd |vs,d |vr0dSd |vsDd |vrHdSdS) za Determine if two Unicode range seen next to each other can be considered as suspicious. NTFZLatinZ Emoticons )KatakanaHiraganaCJKZHangulz Basic LatinZ PunctuationZForms)splitr)rfrgZkeywords_range_aZkeywords_range_belrrrresLrei)maxsize皙?F)decoded_sequencemaximum_thresholddebugrc Csg}tD]}||q t|}d}|dkr8d}n|dkrFd}nd}t|td|D]d\}} |D]} | |rf| |qf| dkr| |dks| |dkrZtd d |D}||krZqqZ|r|D]} t | j | j qt |d S) zw Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier. rJi irrrcSsg|] }|jqSr)r%).0dtrrr 1zmess_ratio..) r__subclasses__appendrtzipranger r!sumprint __class__r%round) rrrZ detectorsZmd_classlengthZmean_mess_ratioZ!intermediary_mean_mess_ratio_calcrindexdetectorrrrr mess_ratios6     rN)rF)% functoolsrtypingrrZconstantrutilsrrr r r r r rrrrrrrrrr.rMrRr]rbrhrvr{r*r+rer-rrrrrs*  D"<$GPM  =