U Dx`!@spdZddlZddlZddlmZddlmZddlm Z ddl m Z m Z m Z zeZWnek rpeefZYnXzddlmZWn ek rddlmZYnXzddlmZWn ek rddlmZYnXGd d d eZzdd lmZWnek r YnXGd d d eZeZddZdddZdddZdddZd ddZ d!ddZ!ddZ"eZ#dS)"z? An interface to html5lib that mimics the lxml.html interface. N) HTMLParser) TreeBuilder)etree)ElementXHTML_NAMESPACE_contains_block_level_tag)urlopen)urlparsec@seZdZdZdddZdS)rz*An html5lib HTML parser with lxml as tree.FcKstj|f|td|dSN)stricttree) _HTMLParser__init__rselfr kwargsrd}t|}|rrt|dtrh|d|_|d=|||S|st dt |dkrt d|d}|j r|j rt d|j d |_ |S) aParses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If 'create_parent' is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string. r)r)r*r1r-rzNo elements foundzMultiple elements foundzElement followed by text: %rN) r!r"r#boolr3rtextextendrr0lentailr/)r(Z create_parentr)r*Zaccept_leading_textelementsnew_rootresultrrrfragment_fromstringqs8       r=cCst|tstdt|||d}|dd}t|trB|dd}|}|dsb|drf|St |d }t |r||St |d }t |d kr|j r|j s|d j r|d j s|d St|rd|_nd|_|S)aParse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. 'base_url' will set the document's base_url attribute (and the tree's docinfo.URL) If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string. r)r*r)N2asciireplacezsN     " , 6 $