B °¤ï`,7ã @s´ddlmZmZddlmZmZmZmZmZyddl m Z Wn e k r\ee dfZ YnXddl mZmZmZddlmZddlmZmZddlmZdd lZdd lmZmZmZmZmZmZdd l m!Z!m"Z"m#Z#m$Z$e %d ¡Z&e& 'ej(¡e )¡Z*e* +e ,d ¡¡e& -e*¡de.e/e/e0ee ee e1e1edœ dd„Z2d ee/e/e0ee ee e1e1edœ dd„Z3d!e e/e/e0ee ee e1e1edœ dd„Z4d"e e/e/e0ee ee e1edœdd„Z5d S)#é)ÚsplitextÚbasename)ÚListÚBinaryIOÚOptionalÚSetÚUnion)ÚPathLikezos.PathLike[str])ÚTOO_SMALL_SEQUENCEÚTOO_BIG_SEQUENCEÚIANA_SUPPORTED)Ú mess_ratio)ÚCharsetMatchesÚ CharsetMatch)ÚwarnN)Úany_specified_encodingÚis_multi_byte_encodingÚidentify_sig_or_bomÚshould_strip_sig_or_bomÚ is_cp_similarÚ iana_name)Úcoherence_ratioÚencoding_languagesÚmb_encoding_languagesÚmerge_coherence_ratiosZcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)sééçš™™™™™É?TF) Ú sequencesÚstepsÚ chunk_sizeÚ thresholdÚ cp_isolationÚ cp_exclusionÚpreemptive_behaviourÚexplainÚreturnc+ Csx|st tj¡n t tj¡t|ƒ}|dkrPt d¡tt|dddgdƒgƒS|dk rzt dd   |¡¡d d „|Dƒ}ng}|dk r¨t d d   |¡¡d d „|Dƒ}ng}|||krÐt d|||¡d}|}|dkrð|||krðt ||ƒ}t|ƒt k} t|ƒt k} | rt d |¡ƒg} |dkr2t|ƒnd} | dk rV|  | ¡t d| ¡tƒ} g}g}d}d}tƒ}t|ƒ\}}|dk r¤|  |¡t dt|ƒ|¡|  d¡d| krÂ|  d¡x®| tD] }|rè||krèqÎ|rü||krüqÎ|| kr qÎ|  |¡d}||k}|o,t|ƒ}|dkrR|dkrRt d|¡qÎy t|ƒ}Wn*ttfk rˆt d|¡wÎYnXyr| rÔ|dkrÔt|dkr¸|dt dƒ…n|t|ƒt dƒ…|dn&t|dkrä|n|t|ƒd…|d}Wn‚tk rN}z2t d|t|ƒ¡| |¡|s:|d7}wÎWdd}~XYn2tk r~| |¡|sv|d7}wÎYnXd}x |D]}t||ƒrŠd}PqŠW|r¾t d||¡qÎt|dkrÎdnt|ƒ|t ||ƒƒ}|oþ|dk oþt|ƒ|k}|rt d|¡t t|ƒdƒ}|dkr0d}d} g}!g}"x|D]ˆ}#||#|#|…}$|rn|dkrn||$}$|$j |dd }%|! |%¡|" t!|%|ƒ¡|"d!|kr¬| d7} | |ksÆ|rB|dkrBPqBW|"ræt"|"ƒt|"ƒ}&nd}&|&|ksþ| |kr6| |¡|s|d7}t d"|| t#|&d#d$d%¡qÎt d&|t#|&d#d$d%¡|s`t$|ƒ}'nt%|ƒ}'|'r„t d' |t|'ƒ¡¡g}(x4|!D],}%t&|%d(|'r¨d)  |'¡ndƒ})|( |)¡qŽWt'|(ƒ}*|*rÞt d* |*|¡¡| t|||&||*|ƒ¡|| ddgkr*|&d(kr*t d+|¡t||gƒS||krNt d,|¡t||gƒS|d!j(rÎt d-||d!j)¡qÎW|S).aD Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will. The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance. You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose. This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. rzXGiven content is empty, stopping the process very early, returning empty utf_8 str matchÚutf_8gFÚNz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, cSsg|]}t|dƒ‘qS)F)r)Ú.0Úcp©r+úq/private/var/folders/7j/8686xlfs15q3tgljmghtvg0r0000gn/T/pip-target-isidps9b/lib/python/charset_normalizer/api.pyú Nszfrom_bytes..zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.cSsg|]}t|dƒ‘qS)F)r)r)r*r+r+r,r-Wsz^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.éz>Trying to detect encoding from a tiny portion of ({}) byte(s).Tz@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.Úascii>Úutf_16Úutf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg€„A)Úencodingz9Code page %s does not fit given bytes sequence at ALL. %szW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!z Code page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes. Should not be a coincidence. Priority +1 given.ééÚignore)Úerrorséÿÿÿÿzc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.édé)Úndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}gš™™™™™¹?ú,z We detected language {} using {}z0%s is most likely the one. Stopping the process.z[%s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.z:Using %s code page we detected the following languages: %s)*ÚloggerÚsetLevelÚloggingÚCRITICALÚINFOÚlenÚwarningrrÚjoinÚintr r rÚformatrÚappendÚinfoÚsetrr ÚaddrrÚModuleNotFoundErrorÚ ImportErrorÚdebugÚstrÚUnicodeDecodeErrorÚ LookupErrorrÚrangeÚdecoder ÚsumÚroundrrrrÚ languagesZ _languages)+rrr r!r"r#r$r%ÚlengthZis_too_small_sequenceZis_too_large_sequenceZprioritized_encodingsZspecified_encodingZtestedZtested_but_hard_failureZtested_but_soft_failureZsingle_byte_hard_failure_countZsingle_byte_soft_failure_countÚresultsZ sig_encodingZ sig_payloadZ encoding_ianaZdecoded_payloadZbom_or_sig_availableZstrip_sig_or_bomZis_multi_byte_decoderÚeZsimilar_soft_failure_testZencoding_soft_failedZr_Zmulti_byte_bonusZmax_chunk_gave_upZearly_stop_countZ md_chunksZ md_ratiosÚiZ cut_sequenceÚchunkZmean_mess_ratioZtarget_languagesZ cd_ratiosZchunk_languagesZcd_ratios_mergedr+r+r,Ú from_bytessj                         ,                    rZ) Úfprr r!r"r#r$r%r&c Cst| ¡|||||||ƒS)z† Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer. )rZÚread)r[rr r!r"r#r$r%r+r+r,Úfrom_fp@sr]) Úpathrr r!r"r#r$r%r&c Cs,t|dƒ}t||||||||ƒSQRXdS)z• Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError. ÚrbN)Úopenr]) r^rr r!r"r#r$r%r[r+r+r,Ú from_pathZs ra)r^rr r!r"r#r$r&c Csœt|||||||ƒ}t|ƒ}tt|ƒƒ} t|ƒdkrBtd |¡ƒ‚| ¡} | dd| j7<t d |  |d  | ¡¡¡dƒ} |   |   ¡¡WdQRX| S)zi Take a (text-based) file path and try to create another file next to it, this time using UTF-8. rz;Unable to normalize "{}", no encoding charset seems to fit.ú-z{}r(ÚwbN)rarÚlistrrAÚIOErrorrEÚbestr2r`ÚreplacerCÚwriteÚoutput) r^rr r!r"r#r$rVÚfilenameZtarget_extensionsÚresultr[r+r+r,Ú normalizels$   rl)rrrNNTF)rrrNNTF)rrrNNTF)rrrNNT)6Úos.pathrrÚtypingrrrrrÚosr rKrMZcharset_normalizer.constantr r r Zcharset_normalizer.mdr Zcharset_normalizer.modelsrrÚwarningsrr>Zcharset_normalizer.utilsrrrrrrZcharset_normalizer.cdrrrrÚ getLoggerr<r=ÚDEBUGÚ StreamHandlerÚhandlerÚ setFormatterÚ FormatterÚ addHandlerÚbytesrDÚfloatÚboolrZr]rarlr+r+r+r,ÚsX