B `R @sDdZdZddlZddlmZddlZddlZddlZdZyddl Z ddZ WnFe k ryddl Z ddZ Wne k rddZ YnXYnXy ddl Z Wne k rYnXd Zd ZeZeed ejeed ejd ee<eeejeeejd ee<Gd ddeZGdddZGdddZdS)aBBeautiful Soup bonus library: Unicode, Dammit This library converts a bytestream to Unicode through any means necessary. It is heavily based on code from Mark Pilgrim's Universal Feed Parser. It works best on XML and HTML, but it does not rewrite the XML or HTML to reflect a new encoding; that's the tree builder's job. MITN)codepoint2namecCst|trdSt|dS)Nencoding) isinstancestrcchardetdetect)sr j/private/var/folders/fw/jsxvvqfs4sz4tdnfdvg5typ5vk77qg/T/pip-install-p7nfy4dm/beautifulsoup4/bs4/dammit.pychardet_dammits r cCst|trdSt|dS)Nr)rrchardetr)r r r r r "s cCsdS)Nr )r r r r r *sz$^\s*<\?.*encoding=['"](.*?)['"].*\?>z0<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]ascii)htmlxmlc@seZdZdZddZe\ZZZdddddd Ze d Z e d Z e d d Ze ddZe ddZe dddZe dddZe ddZdS)EntitySubstitutionzFThe ability to substitute XML or HTML entities for certain characters.cCsxi}i}g}dg}xFtt|D]2\}}t|}|dkrN|||||<|||<q$Wdd|}||t|fS)N)'apos)"rz[%s])listritemschrappendjoinrecompile)lookupZreverse_lookupZcharacters_for_reextra codepointname characterZ re_definitionr r r _populate_class_variablesGs  z,EntitySubstitution._populate_class_variablesrquotampltgt)'"&<>z&([<>]|&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;))z([<>&])cCs|j|d}d|S)ziUsed with a regular expression to substitute the appropriate HTML entity for a special character.rz&%s;)CHARACTER_TO_HTML_ENTITYgetgroup)clsmatchobjentityr r r _substitute_html_entityqsz*EntitySubstitution._substitute_html_entitycCs|j|d}d|S)zhUsed with a regular expression to substitute the appropriate XML entity for a special character.rz&%s;)CHARACTER_TO_XML_ENTITYr.)r/r0r1r r r _substitute_xml_entityxsz)EntitySubstitution._substitute_xml_entitycCs6d}d|kr*d|kr&d}|d|}nd}|||S)a*Make a value into a quoted XML attribute, possibly escaping it. Most strings will be quoted using double quotes. Bob's Bar -> "Bob's Bar" If a string contains double quotes, it will be quoted using single quotes. Welcome to "my bar" -> 'Welcome to "my bar"' If a string contains both single and double quotes, the double quotes will be escaped, and the string will be quoted using double quotes. Welcome to "Bob's Bar" -> "Welcome to "Bob's bar" r(r'z")replace)selfvalueZ quote_withZ replace_withr r r quoted_attribute_valuesz)EntitySubstitution.quoted_attribute_valueFcCs"|j|j|}|r||}|S)a Substitute XML entities for special XML characters. :param value: A string to be substituted. The less-than sign will become <, the greater-than sign will become >, and any ampersands will become &. If you want ampersands that appear to be part of an entity definition to be left alone, use substitute_xml_containing_entities() instead. :param make_quoted_attribute: If True, then the string will be quoted, as befits an attribute value. )AMPERSAND_OR_BRACKETsubr4r8)r/r7make_quoted_attributer r r substitute_xmls   z!EntitySubstitution.substitute_xmlcCs"|j|j|}|r||}|S)aSubstitute XML entities for special XML characters. :param value: A string to be substituted. The less-than sign will become <, the greater-than sign will become >, and any ampersands that are not part of an entity defition will become &. :param make_quoted_attribute: If True, then the string will be quoted, as befits an attribute value. )BARE_AMPERSAND_OR_BRACKETr:r4r8)r/r7r;r r r "substitute_xml_containing_entitiess   z5EntitySubstitution.substitute_xml_containing_entitiescCs|j|j|S)aReplace certain Unicode characters with named HTML entities. This differs from data.encode(encoding, 'xmlcharrefreplace') in that the goal is to make the result more readable (to those with ASCII displays) rather than to recover from errors. There's absolutely nothing wrong with a UTF-8 string containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that character with "é" will make it more readable to some people. :param s: A Unicode string. )CHARACTER_TO_HTML_ENTITY_REr:r2)r/r r r r substitute_htmlsz"EntitySubstitution.substitute_htmlN)F)F)__name__ __module__ __qualname____doc__r"r,ZHTML_ENTITY_TO_CHARACTERr?r3rrr=r9 classmethodr2r4r8r<r>r@r r r r rDs$      %  rc@sHeZdZdZdddZddZedd Zed d Z edd d Z dS)EncodingDetectora^Suggests a number of possible encodings for a bytestring. Order of precedence: 1. Encodings you specifically tell EncodingDetector to try first (the override_encodings argument to the constructor). 2. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a tag (if the bytestring is to be interpreted as an HTML document.) 3. An encoding detected through textual analysis by chardet, cchardet, or a similar external library. 4. UTF-8. 5. Windows-1252. NFcCsN|pg|_|pg}tdd|D|_d|_||_d|_||\|_|_dS)aConstructor. :param markup: Some markup in an unknown encoding. :param override_encodings: These encodings will be tried first. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be tried, even if they otherwise would be. cSsg|] }|qSr )lower).0xr r r sz-EncodingDetector.__init__..N) override_encodingssetexclude_encodingschardet_encodingis_htmldeclared_encodingstrip_byte_order_markmarkupsniffed_encoding)r6rRrKrOrMr r r __init__s zEncodingDetector.__init__cCs8|dk r4|}||jkrdS||kr4||dSdS)zShould we even bother to try this encoding? :param encoding: Name of an encoding. :param tried: Encodings that have already been tried. This will be modified as a side effect. NFT)rGrMadd)r6rtriedr r r _usable s  zEncodingDetector._usableccst}x |jD]}|||r|VqW||j|r>|jV|jdkrZ||j|j|_||j|rp|jV|jdkrt |j|_||j|r|jVxdD]}|||r|VqWdS)zmYield a number of encodings that might work for this markup. :yield: A sequence of strings. N)zutf-8z windows-1252) rLrKrWrSrPfind_declared_encodingrRrOrNr )r6rVer r r encodingss$        zEncodingDetector.encodingscCsd}t|tr||fSt|dkrT|dddkrT|dddkrTd}|dd}nt|dkr|dddkr|dddkrd}|dd}nd|dd d krd }|d d}nB|ddd krd }|dd}n |dddkrd}|dd}||fS)zIf a byte-order mark is present, strip it and return the encoding it implies. :param data: Some markup. :return: A 2-tuple (modified data, implied encoding) Nszzutf-16beszutf-16leszutf-8szutf-32beszutf-32le)rrlen)r/datarr r r rQ>s*  z&EncodingDetector.strip_byte_order_markc Cs|rt|}}nd}tdtt|d}t|tr@tt}ntt}|d}|d}d} |j||d} | s|r|j||d} | dk r| d} | rt| tr| d d } | SdS) aGiven a document, tries to find its declared encoding. An XML encoding is declared at the beginning of the document. An HTML encoding is declared in a tag, hopefully near the beginning of the document. :param markup: Some markup. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param search_entire_document: Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document. iig?rrN)endposrrr5) r^maxintrbytes encoding_resrsearchgroupsdecoderG) r/rRrOZsearch_entire_documentZ xml_endposZ html_endposresxml_reZhtml_rerPZdeclared_encoding_matchr r r rX\s(     z'EncodingDetector.find_declared_encoding)NFN)FF) rArBrCrDrTrWpropertyrZrErQrXr r r r rFs  $ rFc@seZdZdZdddZdddgZgdd gfd d Zd d ZdddZdddZ e ddZ ddZ ddZ dddddddd d!d"d#d$d%d&d'd&d&d(d)d*d+d,d-d.d/d0d1d2d3d&d4d5d6 Zd7dd8d9d:d;dd?d@dAdBd&dCd&d&dDdDdEdEdFdGdHdIdJdKdLdMd&dNdOddPdQdRdSdTdUd@dVdWdXdYdPddZdGd[d\d]d^d_d`dadFd8dbdXdcdddedfd&dgdgdgdgdgdgdhdidjdjdjdjdkdkdkdkdldmdndndndndndFdndododododOdpdqdrdrdrdrdrdrdsdQdtdtdtdtdudududud[dvd[d[d[d[d[dwd[d`d`d`d`dxdpdxdyZdzd{d|d}d~ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddzZdddgZeddZeddZedddZdS( UnicodeDammitzA class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.z mac-romanz shift-jis) macintoshzx-sjis windows-1252z iso-8859-1z iso-8859-2NFcCs||_g|_d|_||_tt|_t|||||_ t |t sF|dkr`||_ t ||_ d|_dS|j j |_ d}x,|j jD] }|j j }||}|dk rxPqxW|sx@|j jD]4}|dkr||d}|dk r|jdd|_PqW||_ |sd|_dS)aPConstructor. :param markup: A bytestring representing markup in an unknown encoding. :param override_encodings: These encodings will be tried first, before any sniffing code is run. :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead. Setting it to 'xml' will convert them to XML entity references, and setting it to 'html' will convert them to HTML entity references. :param is_html: If True, this markup is considered to be HTML. Otherwise it's assumed to be XML. :param exclude_encodings: These encodings will not be considered, even if the sniffing code thinks they might make sense. FrNrr5zSSome characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.T)smart_quotes_totried_encodingsZcontains_replacement_charactersrOlogging getLoggerrAlogrFdetectorrrrRZunicode_markuporiginal_encodingrZ _convert_fromwarning)r6rRrKrnrOrMurr r r rTs>     zUnicodeDammit.__init__cCs|d}|jdkr&|j|}nf|j|}t|tkr|jdkrfd|dd}qd|dd}n|}|S)z[Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.rrz&#x;r)r)r.rnMS_CHARS_TO_ASCIIr-encodeMS_CHARStypetuple)r6matchorigr:r r r _sub_ms_chars     zUnicodeDammit._sub_ms_charstrictc Cs||}|r||f|jkr dS|j||f|j}|jdk rf||jkrfd}t|}||j |}y| |||}||_||_ Wn"t k r}zdSd}~XYnX|jS)z|Attempt to convert the markup to the proposed encoding. :param proposed: The name of a character encoding. Ns([-])) find_codecrorrRrnENCODINGS_WITH_SMART_QUOTESrrr:r _to_unicodert Exception)r6ZproposederrorsrRZsmart_quotes_reZsmart_quotes_compiledrwrYr r r rus"     zUnicodeDammit._convert_fromcCs t|||S)z}Given a string and its encoding, decodes the string into Unicode. :param encoding: The name of an encoding. )r)r6r_rrr r r r szUnicodeDammit._to_unicodecCs|js dS|jjS)zhIf the markup is an HTML document, returns the encoding declared _within_ the document. N)rOrsrP)r6r r r declared_html_encodingsz$UnicodeDammit.declared_html_encodingcCs`||j||pN|r*||ddpN|r@||ddpN|rL|pN|}|r\|SdS)zConvert the name of a character set to a codec name. :param charset: The name of a character set. :return: The name of a codec. -r_N)_codecCHARSET_ALIASESr-r5rG)r6charsetr7r r r rs zUnicodeDammit.find_codecc Cs<|s|Sd}yt||}Wnttfk r6YnX|S)N)codecsr LookupError ValueError)r6rcodecr r r r)s zUnicodeDammit._codec)euroZ20AC )sbquoZ201A)fnofZ192)bdquoZ201E)hellipZ2026)daggerZ2020)DaggerZ2021)circZ2C6)permilZ2030)ScaronZ160)lsaquoZ2039)OEligZ152?)z#x17DZ17D)lsquoZ2018)rsquoZ2019)ldquoZ201C)rdquoZ201D)bullZ2022)ndashZ2013)mdashZ2014)tildeZ2DC)tradeZ2122)scaronZ161)rsaquoZ203A)oeligZ153)z#x17EZ17E)Yumlr) ZEUR,fz,,z...+z++^%Sr*ZOEZr'r(*rz--~z(TM)r r+ZoezY!cZGBP$ZYEN|z..rz(th)z<>z1/4z1/2z3/4AZAECEIDNOUbBaZaerYin/y)rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrs€s‚sƒs„s…s†s‡sˆs‰sŠs‹sŒsŽs‘s’s“s”s•s–s—s˜s™sšs›sœsžsŸs s¡s¢s£s¤s¥s¦s§s¨s©sªs«s¬s­s®s¯s°s±s²s³s´sµs¶s·s¸s¹sºs»s¼s½s¾s¿sÀsÁsÂsÃsÄsÅsÆsÇsÈsÉsÊsËsÌsÍsÎsÏsÐsÑsÒsÓsÔsÕsÖs×sØsÙsÚsÛsÜsÝsÞsßsàrsâsãsäsåsæsçsèsésêsësìsísîsïsðsñsòsósôsõsös÷søsùsúsûsüsýsþ)z)rrr\)rrr])rrr[rrxutf8c Cs"|dddkrtd|dkr0tdg}d}d}x|t|kr||}t|tsdt|}||jkr||jkrxz|j D]$\}} } ||kr|| kr|| 7}PqWq>|dkr||j kr| |||| |j ||d 7}|}q>|d 7}q>W|dkr|S| ||d d |S) aFix characters from one encoding embedded in some other encoding. Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8. :param in_bytes: A bytestring that you suspect contains characters from multiple encodings. Note that this _must_ be a bytestring. If you've already converted the document to Unicode, you're too late. :param main_encoding: The primary encoding of `in_bytes`. :param embedded_encoding: The encoding that was used to embed characters in the main document. :return: A bytestring in which `embedded_encoding` characters have been converted to their `main_encoding` equivalents. rr)z windows-1252 windows_1252zPWindows-1252 and ISO-8859-1 are the only currently supported embedded encodings.)rzutf-8z4UTF-8 is the only currently supported main encoding.rrQrxN) r5rGNotImplementedErrorr^rrbordFIRST_MULTIBYTE_MARKERLAST_MULTIBYTE_MARKERMULTIBYTE_MARKERS_AND_SIZESWINDOWS_1252_TO_UTF8rr) r/Zin_bytesZ main_encodingZembedded_encodingZ byte_chunksZ chunk_startposbytestartendsizer r r detwingleis<      zUnicodeDammit.detwingle)r)r)rrm)rArBrCrDrrrTrrurrjrrrr|rzrrrrrErr r r r rks`@       rk)rD __license__r html.entitiesrrrpstringZ chardet_typerr ImportErrorr Z iconv_codecZ xml_encodingZ html_metadictrdrr{rrcrobjectrrFrkr r r r s@     %