B `6@sddlmZddlmZmZddlmZddlmZm Z m Z m Z m Z m Z mZmZGdddZGdddeZGd d d eZGd d d eZGd ddeZGdddeZGdddeZGdddeZGdddeZeeeeedddZeddd"eeeeddd Zd!S)#) lru_cache)OptionalList)UNICODE_SECONDARY_RANGE_KEYWORD)is_punctuation is_symbol unicode_rangeis_accentuatedis_latin remove_accent is_separatoris_cjkc@sPeZdZdZeedddZeddddZddd d Ze e dd d Z dS) MessDetectorPluginzy Base abstract class used for mess detection plugins. All detectors MUST extend and implement given methods. ) characterreturncCstdS)z@ Determine if given character should be fed in. N)NotImplementedError)selfrrp/private/var/folders/7j/8686xlfs15q3tgljmghtvg0r0000gn/T/pip-target-isidps9b/lib/python/charset_normalizer/md.pyeligibleszMessDetectorPlugin.eligibleNcCstdS)z The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic. N)r)rrrrrfeedszMessDetectorPlugin.feed)rcCstdS)zB Permit to reset the plugin to the initial state. N)r)rrrrresetszMessDetectorPlugin.resetcCstdS)z Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0. N)r)rrrrratio"szMessDetectorPlugin.ratio) __name__ __module__ __qualname____doc__strboolrrrpropertyfloatrrrrrr s rc@sTeZdZddZeedddZeddddZdd d d Ze e d d d Z dS) TooManySymbolOrPunctuationPlugincCs"d|_d|_d|_d|_d|_dS)NrF)_punctuation_count _symbol_count_character_count_last_printable_charZ_frenzy_symbol_in_word)rrrr__init__-s z)TooManySymbolOrPunctuationPlugin.__init__)rrcCs|S)N) isprintable)rrrrrr5sz)TooManySymbolOrPunctuationPlugin.eligibleNcCsd|jd7_||jkrZ|dkrZt|r8|jd7_n"|dkrZt|rZ|jd7_||_dS)N) <>=:/&;{}[]F)r$r%rr"isdigitrr#)rrrrrr8sz%TooManySymbolOrPunctuationPlugin.feed)rcCsd|_d|_d|_dS)Nr)r"r$r#)rrrrrCsz&TooManySymbolOrPunctuationPlugin.resetcCs0|jdkrdS|j|j|j}|dkr,|SdS)Nrgg333333?)r$r"r#)rZratio_of_punctuationrrrrHs z&TooManySymbolOrPunctuationPlugin.ratio) rrrr&rrrrrrr rrrrrr!+s  r!c@sTeZdZddZeedddZeddddZdd d d Ze e d d d Z dS)TooManyAccentuatedPlugincCsd|_d|_dS)Nr)r$_accentuated_count)rrrrr&Tsz!TooManyAccentuatedPlugin.__init__)rrcCs|S)N)isalpha)rrrrrrXsz!TooManyAccentuatedPlugin.eligibleNcCs(|jd7_t|r$|jd7_dS)Nr()r$r r7)rrrrrr[szTooManyAccentuatedPlugin.feed)rcCsd|_d|_dS)Nr)r$r7)rrrrraszTooManyAccentuatedPlugin.resetcCs*|jdkrdS|j|j}|dkr&|SdS)Nrggffffff?)r$r7)rZratio_of_accentuationrrrres  zTooManyAccentuatedPlugin.ratio) rrrr&rrrrrrr rrrrrr6Rs r6c@sTeZdZddZeedddZeddddZdd d d Ze e d d d Z dS)UnprintablePlugincCsd|_d|_dS)Nr)_unprintable_countr$)rrrrr&oszUnprintablePlugin.__init__)rrcCsdS)NTr)rrrrrrsszUnprintablePlugin.eligibleNcCs4|dkr"|dkr"|jd7_|jd7_dS)N>   Fr()r'r:r$)rrrrrrvszUnprintablePlugin.feed)rcCs d|_dS)Nr)r:)rrrrr{szUnprintablePlugin.resetcCs|jdkrdS|jd|jS)Nrg)r$r:)rrrrr~s zUnprintablePlugin.ratio) rrrr&rrrrrrr rrrrrr9ms r9c@sTeZdZddZeedddZeddddZdd d d Ze e d d d Z dS)SuspiciousDuplicateAccentPlugincCsd|_d|_d|_dS)Nr)_successive_countr$_last_latin_character)rrrrr&sz(SuspiciousDuplicateAccentPlugin.__init__)rrcCst|S)N)r )rrrrrrsz(SuspiciousDuplicateAccentPlugin.eligibleNcCsF|jdk rdS|sVt|sVt|r|jr|jd7_t |j}|j |7_ |dkr|j|dkrd|_ |j r|j d7_ |j t |j7_ d|_ d|_d|_n6|dkr|dkrt|rd|_ |j|7_dS) NrJr(g333333?TFr>r*-r)r+)r8joinrOr rPrErr rKlenr$rMrLrNr5r)rrZ buffer_lengthrrrrs, "zSuperWeirdWordPlugin.feed)rcCs(d|_d|_d|_d|_d|_d|_dS)NrJFr)rOrMrLrKr$rN)rrrrrs zSuperWeirdWordPlugin.resetcCs|jdkrdS|j|jS)Ng)rKrNr$)rrrrrs zSuperWeirdWordPlugin.ratio) rrrr&rrrrrrr rrrrrrIs  rIc@sXeZdZdZddZeedddZedddd Zdd d d Z e e d d dZ dS)CjkInvalidStopPluginu GB(Chinese) based encoding often render the stop incorrectly when the content does not fit and can be easily detected. Searching for the overuse of '丅' and '丄'. cCsd|_d|_dS)Nr)_wrong_stop_count_cjk_character_count)rrrrr&szCjkInvalidStopPlugin.__init__)rrcCsdS)NTr)rrrrrrszCjkInvalidStopPlugin.eligibleNcCs4|dkr|jd7_dSt|r0|jd7_dS)N)u丅u丄r()rWr rX)rrrrrrs zCjkInvalidStopPlugin.feed)rcCsd|_d|_dS)Nr)rWrX)rrrrr$szCjkInvalidStopPlugin.resetcCs|jdkrdS|j|jS)NrUg)rXrW)rrrrr(s zCjkInvalidStopPlugin.ratio) rrrrr&rrrrrrr rrrrrrVsrVc@sTeZdZddZeedddZeddddZdd d d Ze e d d d Z dS)ArchaicUpperLowerPlugincCsd|_d|_d|_d|_dS)NFr)_buf_successive_upper_lower_countr$_last_alpha_seen)rrrrr&1sz ArchaicUpperLowerPlugin.__init__)rrcCs|p|S)N)rEr8)rrrrrr8sz ArchaicUpperLowerPlugin.eligibleNcCsn|jdk rV|r|js.|rP|jrP|jdkrH|jd7_qVd|_nd|_|jd7_||_dS)NTr(F)r\isupperislowerrZr[r$)rrrrrr;s $ zArchaicUpperLowerPlugin.feed)rcCsd|_d|_d|_dS)Nr)r$r[r\)rrrrrHszArchaicUpperLowerPlugin.resetcCs|jdkrdS|jd|jS)Nrgr4)r$r[)rrrrrMs zArchaicUpperLowerPlugin.ratio) rrrr&rrrrrrr rrrrrrY/s  rY)rGrHrcCsN|dks|dkrdS||kr dSd|kr4d|kr4dSd|ksDd|krHdS|d|d}}x"|D]}|tkrrqd||krddSqdW|dkr|dkrdS|dks|dkrd|ksd|krdSd |ksd |krd|ksd|krdS|d ks|d krdSd|ksd|ks|dkrJ|dkrJd |ks.d |kr2dSd |ksFd |krJdSdS) za Determine if two Unicode range seen next to each other can be considered as suspicious. NTFZLatinZ Emoticons )KatakanaHiraganaCJKZHangulz Basic LatinZ PunctuationZForms)splitr)rGrHZkeywords_range_aZkeywords_range_belrrrrFUs< (rFi)maxsize皙?F)decoded_sequencemaximum_thresholddebugrc Csg}xtD]}||qWt|}d}|dkrszmess_ratio..) r__subclasses__appendrTziprangerrsumprint __class__rround) rgrhriZ detectorsZmd_classlengthZmean_mess_ratioZ!intermediary_mean_mess_ratio_calcrindexdetectorrnrrr mess_ratios8      r|N)rfF) functoolsrtypingrrZcharset_normalizer.constantrZcharset_normalizer.utilsrrrr r r r r rr!r6r9r?rBrIrVrYrrrFr r|rrrrs  ("'/<&0