a *Na4@sddlZddlmZddlmZddlmZddlmZddl m Z m Z ddl mZmZmZmZmZmZmZmZdd lmZdd lmZdd lmZmZmZGd d d ZGdddZee e!fZ"ee"Z#GdddZ$dS)N)Counter)aliases)sha256)dumps)compilesub)AnyDictIteratorListOptionalSetTupleUnion)TOO_BIG_SEQUENCE) mess_ratio) iana_nameis_multi_byte_encoding unicode_rangec@s eZdZdBeeeedeedddZe edddZ e edd d Z e ed d d Z e ed ddZe ed ddZe ed ddZed ddZed ddZdddddZe ed ddZe eed ddZe ed ddZe ed d d!Ze eed d"d#Ze ed d$d%Ze ed d&d'Ze ed d(d)Ze ed d*d+Ze ed d,d-Ze ed d.d/Z e edd d0d1Z!e ed d2d3Z"e eed d4d5Z#e eed d6d7Z$dd d8d9Z%dd d:d;Z&dCeed=d>d?Z'e ed d@dAZ(dS)D CharsetMatchNCoherenceMatches)payloadguessed_encodingmean_mess_ratiohas_sig_or_bom languagesdecoded_payloadcCsF||_||_||_||_||_d|_g|_d|_d|_d|_ ||_ dS)N) _payload _encoding_mean_mess_ratio _languages_has_sig_or_bom_unicode_ranges_leavesZ_mean_coherence_ratio_output_payload_output_encoding_string)selfrrrrrrr*t/private/var/folders/js/6pj4vh5d4zd0k6bxv74qrbhr0000gr/T/pip-target-22xwyzbs/lib/python/charset_normalizer/models.py__init__s zCharsetMatch.__init__)otherreturncCs>t|ts&tdt|jt|j|j|jko<|j|jkS)Nz&__eq__ cannot be invoked on {} and {}.) isinstancer TypeErrorformatstr __class__encoding fingerprintr)r-r*r*r+__eq__(s zCharsetMatch.__eq__cCs^t|tstt|j|j}|dkrR|dkrF|j|jkrF|j|jkS|j|jkS|j|jkS)zQ Implemented to make sorted available upon CharsetMatches items. g{Gz?r)r/r ValueErrorabschaos coherencemulti_byte_usage)r)r-Zchaos_differencer*r*r+__lt__1s   zCharsetMatch.__lt__r.cCsdtt|t|jS)N?)lenr2rawr)r*r*r+r<CszCharsetMatch.multi_byte_usagecCstdttt|dS)z Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0 z=chaos_secondary_pass is deprecated and will be removed in 3.0r?)warningswarnDeprecationWarningrr2rBr*r*r+chaos_secondary_passGs z!CharsetMatch.chaos_secondary_passcCstdtdS)zy Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0 z)r1r4r5rBr*r*r+__repr__tszCharsetMatch.__repr__cCs8t|tr||kr"td|jd|_|j|dS)Nz;Unable to add instance <{}> as a submatch of a CharsetMatch)r/rr8r1r3r(r%appendr6r*r*r+ add_submatchwszCharsetMatch.add_submatchcCs|jSN)r rBr*r*r+r4szCharsetMatch.encodingcCsDg}tD]2\}}|j|kr*||q |j|kr ||q |S)z Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855. )ritemsr4rP)r)Z also_known_asupr*r*r+encoding_aliasess    zCharsetMatch.encoding_aliasescCs|jSrRr#rBr*r*r+bomszCharsetMatch.bomcCs|jSrRrWrBr*r*r+byte_order_markszCharsetMatch.byte_order_markcCsdd|jDS)z Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if 'language' property return something != 'Unknown'. cSsg|] }|dqS)rr*).0er*r*r+ z*CharsetMatch.languages..r"rBr*r*r+rszCharsetMatch.languagescCsp|jsbd|jvrdSddlm}m}t|jr8||jn||j}t|dksVd|vrZdS|dS|jddS)z Most probable language found in decoded sequence. If none were detected or inferred, the property will return "Unknown". asciiZEnglishr)encoding_languagesmb_encoding_languagesz Latin BasedUnknown)r"could_be_from_charsetZcharset_normalizer.cdr`rarr4r@)r)r`rarr*r*r+languages  zCharsetMatch.languagecCs|jSrR)r!rBr*r*r+r:szCharsetMatch.chaoscCs|js dS|jddS)Nrrrr^rBr*r*r+r;szCharsetMatch.coherencecCst|jdddSNd)ndigits)roundr:rBr*r*r+ percent_chaosszCharsetMatch.percent_chaoscCst|jdddSre)rir;rBr*r*r+percent_coherenceszCharsetMatch.percent_coherencecCs|jS)z+ Original untouched bytes. )rrBr*r*r+rAszCharsetMatch.rawcCs|jSrR)r%rBr*r*r+submatchszCharsetMatch.submatchcCst|jdkSNr)r@r%rBr*r*r+ has_submatchszCharsetMatch.has_submatchcCsN|jdur|jSt}t|D]}t|}|r||qtt||_|jSrR)r$setr2raddsortedlist)r)Zdetected_ranges characterZdetected_ranger*r*r+ alphabetss   zCharsetMatch.alphabetscCs|jgdd|jDS)z The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property 'encoding'. cSsg|] }|jqSr*)r4)rZmr*r*r+r\r]z6CharsetMatch.could_be_from_charset..)r r%rBr*r*r+rcsz"CharsetMatch.could_be_from_charsetcCs|Sz> Kept for BC reasons. Will be removed in 3.0. r*rBr*r*r+firstszCharsetMatch.firstcCs|Srvr*rBr*r*r+bestszCharsetMatch.bestutf_8)r4r.cCs2|jdus|j|kr,||_t||d|_|jS)z Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced. Nreplace)r'r2encoder&)r)r4r*r*r+outputszCharsetMatch.outputcCst|S)zw Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one. )rr| hexdigestrBr*r*r+r5 szCharsetMatch.fingerprint)N)ry))__name__ __module__ __qualname__bytesr2floatboolr r,objectr7r=propertyr<rFrGrrLrNrOrQr4r rVrXrYrrdr:r;rjrkrArlrnrtrcrwrxr|r5r*r*r*r+rsr         rc@seZdZdZdeedddZeedddZe e e fed d d Z e dd d Z edddZedd ddZeddddZeddddZdS)CharsetMatchesz Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods. N)resultscCs|r t|ng|_dSrR)rq_results)r)rr*r*r+r,szCharsetMatches.__init__r>ccs|jD] }|VqdSrRr)r)resultr*r*r+__iter__s zCharsetMatches.__iter__)itemr.cCsNt|tr|j|St|trFt|d}|jD]}||jvr.|Sq.tdS)z Retrieve a single item either by its position or encoding name (alias may be used here). Raise KeyError upon invalid index or encoding not present in results. FN)r/intrr2rrcKeyError)r)rrr*r*r+ __getitem__!s       zCharsetMatches.__getitem__cCs t|jSrRr@rrBr*r*r+__len__/szCharsetMatches.__len__cCst|jdkSrmrrBr*r*r+__bool__2szCharsetMatches.__bool__cCs|t|tstdt|jt|jtkr`|j D],}|j |j kr2|j |j kr2| |dSq2|j |t|j |_ dS)z~ Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch. z-Cannot append instance '{}' to CharsetMatchesN)r/rr8r1r2r3r@rArrr5r:rQrPrq)r)rmatchr*r*r+rP5s    zCharsetMatches.appendrcCs|js dS|jdS)zQ Simply return the first match. Strict equivalent to matches[0]. NrrrBr*r*r+rxIszCharsetMatches.bestcCs|S)zP Redundant method, call the method best(). Kept for BC reasons. )rxrBr*r*r+rwQszCharsetMatches.first)N)r~rr__doc__r rr,r rrrr2rrrrrPr rxrwr*r*r*r+rsrc @sjeZdZeeeeeeeeeeeeeeeed ddZe e ee fdddZ edddZ d S) CliDetectionResult pathr4rValternative_encodingsrdrtrr:r; unicode_path is_preferredc CsF||_| |_||_||_||_||_||_||_||_| |_ | |_ dSrR) rrr4rVrrdrtrr:r;r) r)rr4rVrrdrtrr:r;rrr*r*r+r,]szCliDetectionResult.__init__r>c Cs2|j|j|j|j|j|j|j|j|j|j |j d S)NrrrBr*r*r+__dict__wszCliDetectionResult.__dict__cCst|jdddS)NT) ensure_asciiindent)rrrBr*r*r+to_jsonszCliDetectionResult.to_jsonN)r~rrr2r r rrr,rr rrrr*r*r*r+r\s r)%rC collectionsrZencodings.aliasesrhashlibrjsonrrerrIrtypingrr r r r r rrZconstantrZmdrutilsrrrrrr2rZCoherenceMatchrrr*r*r*r+s     (  D