B /WbJ @spddlZddlmZddlmZmZddlmZmZm Z m Z ddl m Z m Z mZmZddlmZmZmZmZddlmZdd lmZmZdd lmZmZmZmZmZm Z m!Z!e"d Z#e$Z%e%&e'd de(e)e)e*ee+ee+e,e,ed ddZ-dee)e)e*ee+ee+e,e,ed ddZ.d ee)e)e*ee+ee+e,e,ed ddZ/d!ee)e)e*ee+ee+e,edddZ0dS)"N)PathLike)basenamesplitext)BinaryIOListOptionalSet)coherence_ratioencoding_languagesmb_encoding_languagesmerge_coherence_ratios)IANA_SUPPORTEDTOO_BIG_SEQUENCETOO_SMALL_SEQUENCETRACE) mess_ratio) CharsetMatchCharsetMatches)any_specified_encodingcut_sequence_chunks iana_nameidentify_sig_or_bom is_cp_similaris_multi_byte_encodingshould_strip_sig_or_bomZcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)s皙?TF) sequencessteps chunk_size threshold cp_isolation cp_exclusionpreemptive_behaviourexplainreturnc- Cst|ttfs tdt||r>tj}tt t t t |} | dkrt d|rvtt t |prtjtt|dddgdgS|dk rtt d d |d d |D}ng}|dk rtt d d |dd |D}ng}| ||krtt d||| d}| }|dkr:| ||kr:t| |}t |tk} t |tk} | rltt d| n| rtt d| g} |rt|nd} | dk r| | tt d| t}g}g}d}d}d}t}t|\}}|dk r| |tt dt ||| dd| kr.| dx| tD]r}|rT||krTq:|rh||krhq:||krvq:||d}||k}|ot|}|dkr|stt d|q:y t|}Wn,t t!fk rtt d|w:YnXyr| r@|dkr@t"|dkr$|dtdn|t |td|dn&t"|dkrP|n|t |d|d}WnVt#t$fk r}z2t|t$stt d|t"|||w:Wdd}~XYnXd}x |D]}t%||rd}PqW|rtt d||q:t&|s dnt || t| |}|o<|dk oZszfrom_bytes..zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.cSsg|]}t|dqS)F)r)r*r+r,r,r-r.esz^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).z@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.ascii>utf_32utf_16z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecodergA)encodingz9Code page %s does not fit given bytes sequence at ALL. %sTzW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.zaLazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %sgj@strict)errorsz^LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %szc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.d)ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g?,z We detected language {} using {}z.Encoding detection: %s is most likely the one.zoEncoding detection: %s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z7Encoding detection: %s will be used as a fallback matchz:Encoding detection: utf_8 will be used as a fallback matchz:Encoding detection: ascii will be used as a fallback matchz]Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.z=Encoding detection: Unable to determine any suitable charset.)4 isinstance bytearraybytes TypeErrorformattypeloggerlevel addHandlerexplain_handlersetLevelrlendebug removeHandlerloggingWARNINGrrlogjoinintrrrappendsetrraddrrModuleNotFoundError ImportErrorstrUnicodeDecodeError LookupErrorrrangemaxrrdecodesumroundr r r r r2 fingerprintbest)-rr r!r"r#r$r%r&Zprevious_logger_levellengthZis_too_small_sequenceZis_too_large_sequenceZprioritized_encodingsZspecified_encodingZtestedZtested_but_hard_failureZtested_but_soft_failureZfallback_asciiZ fallback_u8Zfallback_specifiedresultsZ sig_encodingZ sig_payloadZ encoding_ianaZdecoded_payloadZbom_or_sig_availableZstrip_sig_or_bomZis_multi_byte_decodereZsimilar_soft_failure_testZencoding_soft_failedZr_Zmulti_byte_bonusZmax_chunk_gave_upZearly_stop_countZlazy_str_hard_failureZ md_chunksZ md_ratioschunkZmean_mess_ratioZfallback_entryZtarget_languagesZ cd_ratiosZchunk_languagesZcd_ratios_mergedr,r,r- from_bytes"sD                                                              rb) fpr r!r"r#r$r%r&r'c Cst||||||||S)z Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer. )rbread)rcr r!r"r#r$r%r&r,r,r-from_fpsre) pathr r!r"r#r$r%r&r'c Cs,t|d}t||||||||SQRXdS)z Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError. rbN)openre) rfr r!r"r#r$r%r&rcr,r,r- from_paths ri)rfr r!r"r#r$r%r'c Cst|||||||}t|}tt|} t|dkrBtd||} | dd| j7<t dt | |d | d} | | WdQRX| S)zi Take a (text-based) file path and try to create another file next to it, this time using UTF-8. rz;Unable to normalize "{}", no encoding charset seems to fit.-z{}r)wbN)rirlistrrGIOErrorr@r]r2rhrTreplacerMwriteoutput) rfr r!r"r#r$r%r_filenameZtarget_extensionsresultrcr,r,r- normalizes*    rs)rrrNNTF)rrrNNTF)rrrNNTF)rrrNNT)1rJosros.pathrrtypingrrrrZcdr r r r ZconstantrrrrZmdrmodelsrrutilsrrrrrrr getLoggerrB StreamHandlerrE setFormatter Formatterr>rNfloatrTboolrbrerirsr,r,r,r-s\  $  >