a *NaJE @sddlmZmZddlmZmZmZmZzddlm Z Wne yNe Z Yn0ddl Z ddl mZmZmZmZddlmZmZmZddlmZdd lmZmZdd lmZmZmZmZm Z m!Z!e "d Z#e#$e j%e &Z'e'(e )d e#*e'de+e,e,e-ee ee e.e.ed ddZ/dee,e,e-ee ee e.e.ed ddZ0d e e,e,e-ee ee e.e.ed ddZ1d!e e,e,e-ee ee e.edddZ2dS)")basenamesplitext)BinaryIOListOptionalSet)PathLikeN)coherence_ratioencoding_languagesmb_encoding_languagesmerge_coherence_ratios)IANA_SUPPORTEDTOO_BIG_SEQUENCETOO_SMALL_SEQUENCE) mess_ratio) CharsetMatchCharsetMatches)any_specified_encoding iana_nameidentify_sig_or_bom is_cp_similaris_multi_byte_encodingshould_strip_sig_or_bomZcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)s皙?TF) sequencessteps chunk_size threshold cp_isolation cp_exclusionpreemptive_behaviourexplainreturnc1 Cspt|ttfs tdt||s2ttj n ttj t |}|dkrpt dt t|dddgdgS|durt d d |d d |D}ng}|durt d d |dd |D}ng}|||krt d|||d}|}|dkr|||krt||}t |tk} t |tk} | rDt d|n| rZtd|g} |durpt|nd} | dur| | td| t} g}g}d}d}d}d}d}t }t|\}}|dur| |tdt ||| dd| vr | d| tD]}|r.||vr.q|rB||vrBq|| vrPq| |d}||k}|ort|}|dvr|durtd|qz t|}Wn*ttfytd|YqYn0zr| r|durt|dur|dtdn|t |td|dn&t|dur*|n|t |d|d}Wnt y}zDt d|t||||s~|d7}WYd}~qWYd}~n:d}~0t!y|||s|d7}YqYn0d}|D]}t"||rd}qq|rt d||qt#|dur dnt ||t||} |oP|duoPt ||k}!|!rdtd|tt | d}"|"d krd }"d}#g}$g}%| D]:}&||&|&|}'|r|dur||'}'|'j$|d!d"}(|r||&dkr|||&d#kr||d$krd$n|})|r||(d|)|vr|t#|&|&dd%D]T}*||*|&|}'|rR|durR||'}'|'j$|d!d"}(|(d|)|vr&q|q&|$|(|%t%|(||%d%|kr|#d7}#|#|"ks|r|durqАq|%rt&|%t |%}+nd}+|+|ks|#|"kr~|||s|d7}t d&||#t'|+d'd(d)|dd| fvrt|||dg|},|| krf|,}n|dkrv|,}n|,}qtd*|t'|+d'd(d)|st(|}-nt)|}-|-rtd+|t|-g}.|$D],}(t*|(d,|-rd-|-nd}/|.|/qt+|.}0|0r"td.|0||t|||+||0||| ddfvrr|+d,krrtd/|t ||gS||krtd0|t ||gS|d%j,rtd1|||j-qt |dkrl|s|s|rt d2|rt d3|j.||nd|r|dus<|r2|r2|j/|j/ks<|durRt d4||n|rlt d5|||S)6aD Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will. The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance. You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose. This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. z4Expected object of type bytes or bytearray, got: {0}rzXGiven content is empty, stopping the process very early, returning empty utf_8 str matchutf_8gFNz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, cSsg|]}t|dqSFr.0cpr-q/private/var/folders/js/6pj4vh5d4zd0k6bxv74qrbhr0000gr/T/pip-target-22xwyzbs/lib/python/charset_normalizer/api.py Xzfrom_bytes..zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.cSsg|]}t|dqSr(r)r*r-r-r.r/br0z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).Tz@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.ascii>utf_16utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecodergA)encodingz9Code page %s does not fit given bytes sequence at ALL. %szW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.ignore)errorszc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.d)ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g?,z We detected language {} using {}z0%s is most likely the one. Stopping the process.z[%s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.z:Using %s code page we detected the following languages: %szONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z#%s will be used as a fallback matchz&utf_8 will be used as a fallback matchz&ascii will be used as a fallback match)0 isinstance bytearraybytes TypeErrorformattypeloggersetLevelloggingCRITICALINFOlenwarningrrjoinintrrinforappendsetrraddrrModuleNotFoundError ImportErrordebugstrUnicodeDecodeError LookupErrorrrangedecodersumroundr r r r languagesZ _languagesr4 fingerprint)1rrrr r!r"r#r$lengthZis_too_small_sequenceZis_too_large_sequenceZprioritized_encodingsZspecified_encodingZtestedZtested_but_hard_failureZtested_but_soft_failureZfallback_asciiZ fallback_u8Zfallback_specifiedZsingle_byte_hard_failure_countZsingle_byte_soft_failure_countresultsZ sig_encodingZ sig_payloadZ encoding_ianaZdecoded_payloadZbom_or_sig_availableZstrip_sig_or_bomZis_multi_byte_decodereZsimilar_soft_failure_testZencoding_soft_failedZr_Zmulti_byte_bonusZmax_chunk_gave_upZearly_stop_countZ md_chunksZ md_ratiosiZ cut_sequencechunkZchunk_partial_size_chkjZmean_mess_ratioZfallback_entryZtarget_languagesZ cd_ratiosZchunk_languagesZcd_ratios_mergedr-r-r. from_bytes%sh                   "                          re) fprrr r!r"r#r$r%c Cst||||||||S)z Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer. )reread)rfrrr r!r"r#r$r-r-r.from_fpsrh) pathrrr r!r"r#r$r%c CsDt|d&}t||||||||WdS1s60YdS)z Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError. rbN)openrh) rirrr r!r"r#r$rfr-r-r. from_paths rl)rirrr r!r"r#r%c Cst|||||||}t|}tt|} t|dkrBtd||} | dd| j7<t dt | |d | d} | | Wdn1s0Y| S)zi Take a (text-based) file path and try to create another file next to it, this time using UTF-8. rz;Unable to normalize "{}", no encoding charset seems to fit.-z{}r'wbN)rlrlistrrKIOErrorrDbestr4rkrVreplacerMwriteoutput) rirrr r!r"r#r`filenameZtarget_extensionsresultrfr-r-r. normalizes2    ,rw)rrrNNTF)rrrNNTF)rrrNNTF)rrrNNT)3os.pathrrtypingrrrrosrrTrVrHZcdr r r r ZconstantrrrZmdrmodelsrrutilsrrrrrr getLoggerrFrGDEBUG StreamHandlerhandler setFormatter Formatter addHandlerrBrNfloatboolrerhrlrwr-r-r-r.s       $