B _oa4@sddlZddlmZddlmZddlmZddlmZddl m Z ddl m Z m Z mZmZmZmZmZdd lmZmZdd lmZdd lmZmZmZGd d d ZGdddZeeefZ ee Z!GdddZ"dS)N)Counter)aliases)sha256)dumps)sub)AnyDictIteratorListOptionalTupleUnion)NOT_PRINTABLE_PATTERNTOO_BIG_SEQUENCE) mess_ratio) iana_nameis_multi_byte_encoding unicode_rangec@s eZdZdBeeeedeedddZe edddZ e edd d Z e ed d d Z e ed ddZe ed ddZe ed ddZed ddZed ddZdddddZe ed ddZe eed ddZe ed ddZe ed d d!Ze eed d"d#Ze ed d$d%Ze ed d&d'Ze ed d(d)Ze ed d*d+Ze ed d,d-Ze ed d.d/Z e edd d0d1Z!e ed d2d3Z"e eed d4d5Z#e eed d6d7Z$dd d8d9Z%dd d:d;Z&dCeed=d>d?Z'e ed d@dAZ(dS)D CharsetMatchNCoherenceMatches)payloadguessed_encodingmean_mess_ratiohas_sig_or_bom languagesdecoded_payloadcCsF||_||_||_||_||_d|_g|_d|_d|_d|_ ||_ dS)Ng) _payload _encoding_mean_mess_ratio _languages_has_sig_or_bom_unicode_ranges_leavesZ_mean_coherence_ratio_output_payload_output_encoding_string)selfrrrrrrr(@/tmp/pip-target-avibdbtm/lib/python/charset_normalizer/models.py__init__s zCharsetMatch.__init__)otherreturncCs>t|ts&tdt|jt|j|j|jko<|j|jkS)Nz&__eq__ cannot be invoked on {} and {}.) isinstancer TypeErrorformatstr __class__encoding fingerprint)r'r+r(r(r)__eq__(s  zCharsetMatch.__eq__cCsvt|tstt|j|j}t|j|j}|dkrj|dkrj|dkr^|j|jkr^|j|jkS|j|jkS|j|jkS)zQ Implemented to make sorted available upon CharsetMatches items. g{Gz?g{Gz?g)r-r ValueErrorabschaos coherencemulti_byte_usage)r'r+Zchaos_differenceZcoherence_differencer(r(r)__lt__1s   zCharsetMatch.__lt__)r,cCsdtt|t|jS)Ng?)lenr0raw)r'r(r(r)r9DszCharsetMatch.multi_byte_usagecCstdttt|dS)z Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0 z=chaos_secondary_pass is deprecated and will be removed in 3.0g?)warningswarnDeprecationWarningrr0)r'r(r(r)chaos_secondary_passHsz!CharsetMatch.chaos_secondary_passcCstdtdS)zy Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0 zr?)r'r(r(r)coherence_non_latinUsz CharsetMatch.coherence_non_latincCs,tdtttdt|}t|S)z_ Word counter instance on decoded text. Notice: Will be removed in 3.0 z2w_counter is deprecated and will be removed in 3.0 ) r=r>r?rrr0lowerrsplit)r'Zstring_printable_onlyr(r(r) w_counteraszCharsetMatch.w_countercCs"|jdkrt|j|jd|_|jS)Nstrict)r&r0rr)r'r(r(r)__str__os zCharsetMatch.__str__cCsd|j|jS)Nz)r/r2r3)r'r(r(r)__repr__uszCharsetMatch.__repr__cCs8t|tr||kr"td|jd|_|j|dS)Nz;Unable to add instance <{}> as a submatch of a CharsetMatch)r-rr5r/r1r&r#append)r'r+r(r(r) add_submatchxs  zCharsetMatch.add_submatchcCs|jS)N)r)r'r(r(r)r2szCharsetMatch.encodingcCsHg}x>tD]2\}}|j|kr,||q|j|kr||qW|S)z Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855. )ritemsr2rI)r'Z also_known_asupr(r(r)encoding_aliasess   zCharsetMatch.encoding_aliasescCs|jS)N)r!)r'r(r(r)bomszCharsetMatch.bomcCs|jS)N)r!)r'r(r(r)byte_order_markszCharsetMatch.byte_order_markcCsdd|jDS)z Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if 'language' property return something != 'Unknown'. cSsg|] }|dqS)rr().0er(r(r) sz*CharsetMatch.languages..)r )r'r(r(r)rszCharsetMatch.languagescCsp|jsbd|jkrdSddlm}m}t|jr8||jn||j}t|dksVd|krZdS|dS|jddS)z Most probable language found in decoded sequence. If none were detected or inferred, the property will return "Unknown". asciiZEnglishr)encoding_languagesmb_encoding_languagesz Latin BasedUnknown)r could_be_from_charsetZcharset_normalizer.cdrUrVrr2r;)r'rUrVrr(r(r)languages  zCharsetMatch.languagecCs|jS)N)r)r'r(r(r)r7szCharsetMatch.chaoscCs|js dS|jddS)Ngrr)r )r'r(r(r)r8szCharsetMatch.coherencecCst|jdddS)Nd)ndigits)roundr7)r'r(r(r) percent_chaosszCharsetMatch.percent_chaoscCst|jdddS)NrZr[)r\)r]r8)r'r(r(r)percent_coherenceszCharsetMatch.percent_coherencecCs|jS)z+ Original untouched bytes. )r)r'r(r(r)r<szCharsetMatch.rawcCs|jS)N)r#)r'r(r(r)submatchszCharsetMatch.submatchcCst|jdkS)Nr)r;r#)r'r(r(r) has_submatchszCharsetMatch.has_submatchcCs@|jdk r|jSddt|D}ttdd|D|_|jS)NcSsg|] }t|qSr()r)rQcharr(r(r)rSsz*CharsetMatch.alphabets..cSsh|] }|r|qSr(r()rQrr(r(r) sz)CharsetMatch.alphabets..)r"r0sortedlist)r'Zdetected_rangesr(r(r) alphabetss  zCharsetMatch.alphabetscCs|jgdd|jDS)z The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property 'encoding'. cSsg|] }|jqSr()r2)rQmr(r(r)rSsz6CharsetMatch.could_be_from_charset..)rr#)r'r(r(r)rXsz"CharsetMatch.could_be_from_charsetcCs|S)z> Kept for BC reasons. Will be removed in 3.0. r()r'r(r(r)firstszCharsetMatch.firstcCs|S)z> Kept for BC reasons. Will be removed in 3.0. r()r'r(r(r)bestszCharsetMatch.bestutf_8)r2r,cCs2|jdks|j|kr,||_t||d|_|jS)z Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced. Nreplace)r%r0encoder$)r'r2r(r(r)outputszCharsetMatch.outputcCst|S)zw Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one. )rrn hexdigest)r'r(r(r)r3 szCharsetMatch.fingerprint)N)rk))__name__ __module__ __qualname__bytesr0floatboolr r*objectr4r:propertyr9r@rArrErGrHrJr2r rNrOrPrrYr7r8r^r_r<r`rargrXrirjrnr3r(r(r(r)rsb        rc@seZdZdZdeedddZeedddZe e e fed d d Z e dd d Z edddZedd ddZeddddZeddddZdS)CharsetMatchesz Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods. N)resultscCs|r t|ng|_dS)N)re_results)r'ryr(r(r)r*szCharsetMatches.__init__)r,ccsx|jD] }|VqWdS)N)rz)r'resultr(r(r)__iter__s zCharsetMatches.__iter__)itemr,cCsNt|tr|j|St|trFt|d}x|jD]}||jkr0|Sq0WtdS)z Retrieve a single item either by its position or encoding name (alias may be used here). Raise KeyError upon invalid index or encoding not present in results. FN)r-intrzr0rrXKeyError)r'r}r{r(r(r) __getitem__"s      zCharsetMatches.__getitem__cCs t|jS)N)r;rz)r'r(r(r)__len__0szCharsetMatches.__len__cCst|jdkS)Nr)r;rz)r'r(r(r)__bool__3szCharsetMatches.__bool__cCs~t|tstdt|jt|jtkrbx4|j D]*}|j |j kr4|j |j kr4| |dSq4W|j |t|j |_ dS)z~ Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch. z-Cannot append instance '{}' to CharsetMatchesN)r-rr5r/r0r1r;r<rrzr3r7rJrIre)r'r}matchr(r(r)rI6s    zCharsetMatches.appendrcCs|js dS|jdS)zQ Simply return the first match. Strict equivalent to matches[0]. Nr)rz)r'r(r(r)rjJszCharsetMatches.bestcCs|S)zP Redundant method, call the method best(). Kept for BC reasons. )rj)r'r(r(r)riRszCharsetMatches.first)N)rprqrr__doc__r rr*r r|r r~r0rrrurrIr rjrir(r(r(r)rxsrxc @sjeZdZeeeeeeeeeeeeeeeed ddZe e ee fdddZ edddZ d S) CliDetectionResult) pathr2rNalternative_encodingsrYrgrr7r8 unicode_path is_preferredc CsF||_| |_||_||_||_||_||_||_||_| |_ | |_ dS)N) rrr2rNrrYrgrr7r8r) r'rr2rNrrYrgrr7r8rrr(r(r)r*^szCliDetectionResult.__init__)r,c Cs2|j|j|j|j|j|j|j|j|j|j |j d S)N) rr2rNrrYrgrr7r8rr) rr2rNrrYrgrr7r8rr)r'r(r(r)__dict__xszCliDetectionResult.__dict__cCst|jdddS)NT) ensure_asciiindent)rr)r'r(r(r)to_jsonszCliDetectionResult.to_jsonN)rprqrrr0r r rurtr*rwrrrrr(r(r(r)r]sr)#r= collectionsrZencodings.aliasesrhashlibrjsonrrertypingrrr r r r r ZconstantrrZmdrutilsrrrrrxr0rtZCoherenceMatchrrr(r(r(r)s      $  D