U ®Dx`Ywã@sÚddlmZddlZddlmZddlmZddlZddgZzddl m Z Wn e k rlddl m Z YnXzeZWnek rŽeZYnXzeWnek r°eZYnXdd „Zefd d„Zd d „Zd d„Zdd„Zdd„Zdd„Zdd„Zdd„Zdd„Zdidd„Zdd„ZGdd „d ƒZGd!d"„d"ƒZ Gd#d$„d$e!ƒZ"d%d&„Z#d'd(„Z$d)d*„Z%d+d,„Z&d-d.„Z'd/d0„Z(Gd1d2„d2eƒZ)Gd3d4„d4e)ƒZ*Gd5d6„d6e)ƒZ+djd8d9„Z,dkd:d;„Z-e .dej/ej0B¡Z3d?d@„Z4e .dA¡Z5dBdC„Z6dDdE„Z7dFZ8dGZ9dHZ:dldIdJ„Z;e .dKej<¡Z=dLdM„Z>e .dN¡Z?dOdP„Z@dQdR„ZAdSdT„ZBdUdV„ZCdWdX„ZDdYdZ„ZEdmd[d\„ZFd]d^„ZGd_d`„ZHdadb„ZIdcdd„ZJGdedf„dfejKƒZLeMdgkrÖddhlmNZNeN O¡dS)né)Úabsolute_importN)Úetree)Úfragment_fromstringÚ html_annotateÚhtmldiff)ÚescapecCsdtt|ƒdƒ|fS)Nz%sé)Ú html_escapeÚ_unicode)ÚtextÚversion©r ú5/tmp/pip-target-zr53vnty/lib/python/lxml/html/diff.pyÚdefault_markups ÿrcCsVdd„|Dƒ}|d}|dd…D]}t||ƒ|}q"t|ƒ}t||ƒ}d |¡ ¡S)a doclist should be ordered from oldest to newest, like:: >>> version1 = 'Hello World' >>> version2 = 'Goodbye World' >>> print(html_annotate([(version1, 'version 1'), ... (version2, 'version 2')])) Goodbye World The documents must be *fragments* (str/UTF8 or unicode), not complete documents The markup argument is a function to markup the spans of words. This function is called like markup('Hello', 'version 2'), and returns HTML. The first argument is text and never includes any markup. The default uses a span with a title: >>> print(default_markup('Some Text', 'by Joe')) Some Text cSsg|]\}}t||ƒ‘qSr )Útokenize_annotated)Ú.0Údocr r r rÚ =sÿz!html_annotate..rrNÚ)Úhtml_annotate_merge_annotationsÚcompress_tokensÚmarkup_serialize_tokensÚjoinÚstrip)ZdoclistZmarkupÚ tokenlistZ cur_tokensÚtokensÚresultr r rr#sÿ  cCs t|dd}|D] }||_q|S)zFTokenize a document and add an annotation attribute to each token F©Ú include_hrefs)ÚtokenizeÚ annotation)rr rÚtokr r rrKs rc CsVt||d}| ¡}|D]8\}}}}}|dkr|||…} |||…} t| | ƒqdS)zˆMerge the annotations from tokens_old into tokens_new, when the tokens in the new document already existed in the old document. ©ÚaÚbÚequalN)ÚInsensitiveSequenceMatcherÚ get_opcodesÚcopy_annotations) Z tokens_oldZ tokens_newÚsÚcommandsÚcommandÚi1Úi2Új1Új2Zeq_oldZeq_newr r rrSs   rcCs4t|ƒt|ƒkst‚t||ƒD]\}}|j|_qdS)zN Copy annotations from the tokens listed in src to the tokens in dest N)ÚlenÚAssertionErrorÚzipr )ÚsrcÚdestZsrc_tokZdest_tokr r rr(`sr(cCsV|dg}|dd…D]:}|djsF|jsF|dj|jkrFt||ƒq| |¡q|S)zm Combine adjacent tokens when there is no HTML between the tokens, and they share an annotation rrNéÿÿÿÿ)Ú post_tagsÚpre_tagsr Úcompress_merge_backÚappend)rrr!r r rrhs  ÿþ  rcCsv|d}t|ƒtk s t|ƒtk r,| |¡nFt|ƒ}|jrD||j7}||7}t||j|j|jd}|j|_||d<dS)zY Merge tok into the last element of tokens (modifying the list of tokens in-place). r5©r7r6Útrailing_whitespaceN)ÚtypeÚtokenr9r r;r7r6r )rr!Úlastr Úmergedr r rr8ws  ýr8ccs\|D]R}|jD] }|Vq| ¡}|||jƒ}|jr>||j7}|V|jD] }|VqJqdS)zz Serialize the list of tokens into a list of text chunks, calling markup_func around text to add annotations. N)r7Úhtmlr r;r6)rZ markup_funcr=Úprer@Úpostr r rr‰s    rcCs0t|ƒ}t|ƒ}t||ƒ}d |¡ ¡}t|ƒS)aŒ Do a diff of the old and new document. The documents are HTML *fragments* (str/UTF8 or unicode), they are not complete documents (i.e., no tag). Returns HTML with and tags added around the appropriate text. Markup is generally ignored, with the markup from new_html preserved, and possibly some markup from old_html (though it is considered acceptable to lose some of the old markup). Only the words in the HTML are diffed. The exception is tags, which are treated like words, and the href attribute of tags, which are noted inside the tag itself when there are changes. r)rÚhtmldiff_tokensrrÚfixup_ins_del_tags)Zold_htmlZnew_htmlZold_html_tokensZnew_html_tokensrr r rržs  c Cs°t||d}| ¡}g}|D]†\}}}}} |dkrN| t||| …dd¡q|dks^|dkrxt||| …ƒ} t| |ƒ|dksˆ|dkrt|||…ƒ} t| |ƒqt|ƒ}|S)z] Does a diff on the tokens themselves, returning a list of text chunks (not tokens). r"r%T)r%ÚinsertÚreplaceÚdelete)r&r'ÚextendÚ expand_tokensÚ merge_insertÚ merge_deleteÚcleanup_delete) Z html1_tokensZ html2_tokensr)r*rr+r,r-r.r/Z ins_tokensZ del_tokensr r rrCµs   rCFccs^|D]T}|jD] }|Vq|r$|jsF|jr<| ¡|jVn | ¡V|jD] }|VqLqdS)zeGiven a list of tokens, return a generator of the chunks of text for the data in the tokens. N)r7Úhide_when_equalr;r@r6)rr%r=rArBr r rrIÛs    rIcCsŒt|ƒ\}}}| |¡|r:|d d¡s:|dd7<| d¡|rj|d d¡rj|ddd…|d<| |¡| d¡| |¡dS)z| doc is the already-handled document (as a list of text chunks); here we add ins_chunks to the end of that. r5ú zNz )Úsplit_unbalancedrHÚendswithr9)Z ins_chunksrÚunbalanced_startÚbalancedÚunbalanced_endr r rrJês    rJc@s eZdZdS)Ú DEL_STARTN©Ú__name__Ú __module__Ú __qualname__r r r rrTsrTc@s eZdZdS)ÚDEL_ENDNrUr r r rrYsrYc@seZdZdZdS)Ú NoDeleteszY Raised when the document no longer contains any pending deletes (DEL_START/DEL_END) N)rVrWrXÚ__doc__r r r rrZsrZcCs"| t¡| |¡| t¡dS)z¾ Adds the text chunks in del_chunks to the document doc (another list of text chunks) with marker to show it is a delete. cleanup_delete later resolves these markers into tags.N)r9rTrHrY)Z del_chunksrr r rrK s  rKcCsÐzt|ƒ\}}}Wntk r*YqÌYnXt|ƒ\}}}t|||ƒt|||ƒ|}|rx|d d¡sx|dd7<| d¡|r¨|d d¡r¨|ddd…|d<| |¡| d¡| |¡|}q|S)a¹ Cleans up any DEL_START/DEL_END markers in the document, replacing them with . To do this while keeping the document valid, it may need to drop some tags (either start or end tags). It may also move the del into adjacent tags to try to move it to a similar location where it was originally located (e.g., moving a delete into preceding
tag, if the del looks like (DEL_START, 'Text
', DEL_END)r5rNzNz )Ú split_deleterZrOÚlocate_unbalanced_startÚlocate_unbalanced_endrPr9rH)ÚchunksÚ pre_deleterGÚ post_deleterQrRrSrr r rrLs$        rLc Csg}g}g}g}|D]Ø}| d¡s.| |¡q|ddk}| ¡d d¡}|tkr`| |¡q|rÎ|rš|dd|krš| |¡| ¡\}}} | ||<qì|rÂ| dd„|Dƒ¡g}| |¡qì| |¡q| |t|ƒ|f¡| d ¡q| d d„|Dƒ¡d d„|Dƒ}|||fS) a]Return (unbalanced_start, balanced, unbalanced_end), where each is a list of text and tag chunks. unbalanced_start is a list of all the tags that are opened, but not closed in this span. Similarly, unbalanced_end is a list of tags that are closed but were not opened. Extracting these might mean some reordering of the chunks.ú/r5cSsg|]\}}}|‘qSr r )rÚnameÚposÚtagr r rrTsz$split_unbalanced..NcSsg|]\}}}|‘qSr r )rrerfÚchunkr r rr]scSsg|]}|dk r|‘qS©Nr )rrhr r rr^s)Ú startswithr9ÚsplitrÚ empty_tagsÚpoprHr0) r_ÚstartÚendZ tag_stackrRrhZendtagrerfrgr r rrO9s<          ÿrOcCs\z| t¡}Wntk r&t‚YnX| t¡}|d|…||d|…||dd…fS)zæ Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END, stuff_after_DEL_END). Returns the first case found (there may be more DEL_STARTs in stuff_after_DEL_END). Raises NoDeletes if there's no DEL_START found. Nr)ÚindexrTÚ ValueErrorrZrY)r_rfÚpos2r r rr\as   r\cCs¬|sq¨|d}| ¡d d¡}|s&q¨|d}|tks¨| d¡sBq¨|ddkrPq¨| ¡d d¡}|dkrlq¨|dks€td|ƒ‚||kr¨| d¡| | d¡¡qq¨qd S) a° pre_delete and post_delete implicitly point to a place in the document (where the two were split). This moves that point (by popping items from one and pushing them onto the other). It moves the point to try to find a place where unbalanced_start applies. As an example:: >>> unbalanced_start = ['
'] >>> doc = ['

', 'Text', '

', '
', 'More Text', '
'] >>> pre, post = doc[:3], doc[3:] >>> pre, post (['

', 'Text', '

'], ['
', 'More Text', '
']) >>> locate_unbalanced_start(unbalanced_start, pre, post) >>> pre, post (['

', 'Text', '

', '
'], ['More Text', '
']) As you can see, we moved the point so that the dangling
that we found will be effectively replaced by the div in the original document. If this doesn't work out, we just throw away unbalanced_start without doing anything. rz<>rbrrcÚinsÚdelzUnexpected delete tag: %rN)rkrrTrjr1rmr9)rQr`raÚfindingÚ finding_nameÚnextrer r rr]ms*  ÿ r]cCs|sqŒ|d}| ¡d d¡}|s&qŒ|d}|tksŒ| d¡sBqŒ| ¡d d¡}|dksŒ|dkrfqŒ||krŒ| ¡| d| ¡¡qqŒqdS)zt like locate_unbalanced_start, except handling end tags and possibly moving the point earlier in the document. r5rrdú tag, which takes up visible space just like a word but is only represented in a document by a tag. NrcCs2tj|dt|f|||d}||_||_||_|S)Nz%s: %sr:)r=ryr<rgÚdataÚ html_repr)rzrgr€rr7r6r;r{r r rryèsýztag_token.__new__cCs d|j|j|j|j|j|jfS)NzRtag_token(%s, %s, html_repr=%s, post_tags=%r, pre_tags=%r, trailing_whitespace=%r))rgr€rr7r6r;r}r r rr|ósúztag_token.__repr__cCs|jSri)rr}r r rr@ûsztag_token.html)NNr)rVrWrXr[ryr|r@r r r rrâsÿ rc@seZdZdZdZdd„ZdS)Ú href_tokenzh Represents the href in an anchor tag. Unlike other words, we only show the href when it changes. TcCsd|S)Nz Link: %sr r}r r rr@szhref_token.htmlN)rVrWrXr[rMr@r r r rr‚þsr‚TcCs2t |¡r|}n t|dd}t|d|d}t|ƒS)ak Parse the given HTML and returns token objects (words with attached tags). This parses only the content of a page; anything in the head is ignored, and the and elements are themselves optional. The content is then parsed by lxml, which ensures the validity of the resulting parsed document (though lxml may make incorrect guesses when the markup is particular bad). and tags are also eliminated from the document, as that gets confusing. If include_hrefs is true, then the href attribute of tags is included as a special kind of diffable token.T©Úcleanup)Úskip_tagr)rÚ iselementÚ parse_htmlÚ flatten_elÚ fixup_chunks)r@rZbody_elr_r r rrs   rcCs|r t|ƒ}t|ddS)a Parses an HTML fragment, returning an lxml element. Note that the HTML will be wrapped in a
tag that was not in the original document. If cleanup is true, make sure there's no or , and get rid of any and tags. T)Z create_parent)Ú cleanup_htmlr)r@r„r r rr‡ sr‡z z zcCsLt |¡}|r|| ¡d…}t |¡}|r<|d| ¡…}t d|¡}|S)z³ This 'cleans' the HTML, meaning that any page structure is removed (only the contents of are used, if there is any and tags are removed. Nr)Ú_body_reÚsearchroÚ _end_body_rernÚ _ins_del_reÚsub)r@Úmatchr r rrŠ1s   rŠz [ \t\n\r]$cCs$t| ¡ƒ}|d|…||d…fS)zP This function takes a word, such as 'test ' and returns ('test',' ') rN)r0Úrstrip)ÚwordZstripped_lengthr r rÚsplit_trailing_whitespaceAs r“c CsRg}d}g}|D]}t|tƒr˜|ddkrf|d}t|dƒ\}}td||||d}g}| |¡q|ddkr|d}t||dd }g}| |¡qt|ƒrÊt|ƒ\}}t|||d }g}| |¡qt|ƒrÞ| |¡qt |ƒr |rø| |¡n&|st d ||||fƒ‚|j  |¡qd st ‚q|s>td |d gS|dj   |¡|S)zM This function takes a list of chunks and produces a list of tokens. NrÚimgré)rr7r;ÚhrefrN)r7r;z4Weird state, cur_word=%r, result=%r, chunks=%r of %rFr)r7r5) Ú isinstanceÚtupler“rr9r‚Úis_wordr=Ú is_start_tagÚ is_end_tagr1r6rH) r_Z tag_accumZcur_wordrrhr3rgr;r–r r rr‰IsR   þ         ÿÿ r‰) Úparamr”ÚareaÚbrÚbasefontÚinputÚbaseÚmetaÚlinkÚcol)ÚaddressÚ blockquoteÚcenterÚdirÚdivÚdlÚfieldsetÚformÚh1Úh2Úh3Úh4Úh5Úh6ÚhrÚisindexÚmenuÚnoframesÚnoscriptÚolÚprAÚtableÚul) ÚddÚdtÚframesetÚliÚtbodyÚtdÚtfootÚthÚtheadÚtrccsê|s0|jdkr&d| d¡t|ƒfVn t|ƒV|jtkrR|jsRt|ƒsR|jsRdSt|jƒ}|D]}t|ƒVq`|D]}t ||dD] }|Vq„qt|jdkrº| d¡rº|rºd| d¡fV|sæt |ƒVt|jƒ}|D]}t|ƒVqÖdS)a Takes an lxml element el, and generates all the text chunks for that tag. Each start tag is a chunk, each word is a chunk, and each end tag is a chunk. If skip_tag is true, then the outermost container tag is not returned (just its contents).r”r3Nrr#r–) rgÚgetÚ start_tagrlr r0ÚtailÚ split_wordsr rˆÚend_tag)Úelrr…Z start_wordsr’ÚchildÚitemZ end_wordsr r rrˆ¬s&       rˆz \S+(?:\s+|$)cCs|r | ¡sgSt |¡}|S)z_ Splits some text into words. Includes trailing whitespace on each word when appropriate. )rÚsplit_words_reÚfindall)r Úwordsr r rrÉÊs  rÉz ^[ \t\n\r]cCs$d|jd dd„|j ¡Dƒ¡fS)z= The text representation of the start tag for a tag. z<%s%s>rcSs"g|]\}}d|t|dƒf‘qS)z %s="%s"T)r )rreÚvaluer r rrÚsÿzstart_tag..)rgrÚattribÚitems)rËr r rrÇÕs  ÿÿrÇcCs*|jrt |j¡rd}nd}d|j|fS)zg The text representation of an end tag for a tag. Includes trailing whitespace when appropriate. rNrz%s)rÈÚstart_whitespace_rerŒrg)rËÚextrar r rrÊÝsrÊcCs | d¡ S)Nrb©rj©r!r r rr™æsr™cCs | d¡S)NrxrÖr×r r rr›ésr›cCs| d¡o| d¡ S)NrbrxrÖr×r r rršìsršcCs$t|dd}t|ƒt|dd}|S)z  Given an html string, move any or tags inside of any block-level elements, e.g. transform

word

to

word

FrƒT)Ú skip_outer)r‡Ú_fixup_ins_del_tagsÚserialize_html_fragment)r@rr r rrDïs  rDcCsbt|tƒrtd|ƒ‚tj|dtd}|rZ|| d¡dd…}|d| d¡…}| ¡S|SdS)z¨ Serialize a single lxml element as HTML. The serialized form includes the elements tail. If skip_outer is true, then don't serialize the outermost tag z3You should pass in an element, not a string like %rr@)ÚmethodÚencodingú>rNrb) r—Ú basestringr1rÚtostringr ÚfindÚrfindr)rËrØr@r r rrÚøs ÿrÚcCs@dD]6}| d|¡D]"}t|ƒs$qt||d| ¡qqdS)z?fixup_ins_del_tags that works on an lxml document in-place )rsrtzdescendant-or-self::%s)rgN)ZxpathÚ_contains_block_level_tagÚ_move_el_inside_blockZdrop_tag)rrgrËr r rrÙ s  rÙcCs4|jtks|jtkrdS|D]}t|ƒrdSqdS)zPTrue if the element contains any block-level elements, like

, , etc. TF)rgÚblock_level_tagsÚblock_level_container_tagsrâ)rËrÌr r rrâs râcCsò|D]}t|ƒrqNqt |¡}|j|_d|_| t|ƒ¡|g|dd…<dSt|ƒD]l}t|ƒr¢t||ƒ|jrÂt |¡}|j|_d|_| |  |¡d|¡qVt |¡}|  ||¡|  |¡qV|jrît |¡}|j|_d|_| d|¡dS)zt helper for _fixup_ins_del_tags; actually takes the etc tags and moves them inside any block-level tags. Nrr) rârÚElementr rHÚlistrãrÈrErprFr9)rËrgrÌZ children_tagZtail_tagZ child_tagZtext_tagr r rrãs2        rãcCsÚ| ¡}|jpd}|jrXt|ƒs,||j7}n,|djrL|dj|j7_n |j|d_| |¡}|rÂ|dkrtd}n ||d}|dkr¦|jrž|j|7_qÂ||_n|jr¼|j|7_n||_| ¡|||d…<dS)z© Removes an element, but merges its contents into its place, e.g., given

Hi there!

, if you remove the element you get

Hi there!

rr5rNr)Z getparentr rÈr0rpÚ getchildren)rËÚparentr rpÚpreviousr r rÚ_merge_element_contents?s*      rëc@seZdZdZdZdd„ZdS)r&zt Acts like SequenceMatcher, but tries not to find very small equal blocks amidst large spans of changes r•csDtt|jƒt|jƒƒ}t|j|dƒ‰tj |¡}‡fdd„|DƒS)Nécs$g|]}|dˆks|ds|‘qS)r•r )rrÍ©Ú thresholdr rrms þzBInsensitiveSequenceMatcher.get_matching_blocks..)Úminr0r$rîÚdifflibÚSequenceMatcherÚget_matching_blocks)r~ÚsizeÚactualr rírròis z.InsensitiveSequenceMatcher.get_matching_blocksN)rVrWrXr[rîròr r r rr&asr&Ú__main__)Ú _diffcommand)F)T)T)F)F)PÚ __future__rrðÚlxmlrZ lxml.htmlrÚreÚ__all__r@rr Ú ImportErrorÚcgiÚunicoder Ú NameErrorÚstrrÞrrrrr(rr8rrrCrIrJrTrYÚ ExceptionrZrKrLrOr\r]r^r=rr‚rr‡ÚcompileÚIÚSr‹rrŽrŠZend_whitespace_rer“r‰rlrärårˆÚUrÎrÉrÔrÇrÊr™r›ršrDrÚrÙrârãrërñr&rVröÚmainr r r rÚs      ( & '( 2)   6       "