� ��^c@sBddlZddlZddlmZdefd��YZdS(i����Ni(t ProbingStatet CharSetProbercBs�eZdZd d�Zd�Zed��Zd�Zed��Z d�Z e d��Z e d��Z e d ��ZRS( gffffff�?cCs(d|_||_tjt�|_dS(N(tNonet_statet lang_filtertloggingt getLoggert__name__tlogger(tselfR((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pyt__init__'s  cCstj|_dS(N(Rt DETECTINGR(R ((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytreset,scCsdS(N(R(R ((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pyt charset_name/scCsdS(N((R tbuf((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytfeed3scCs|jS(N(R(R ((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytstate6scCsdS(Ng((R ((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytget_confidence:scCstjdd|�}|S(Ns([-])+t (tretsub(R((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytfilter_high_byte_only=scCszt�}tjd|�}xX|D]P}|j|d �|d}|j� re|dkred}n|j|�q"W|S(s5 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [�-�] marker: everything else [^a-zA-Z�-�] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. s%[a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?i����s�R(t bytearrayRtfindalltextendtisalpha(Rtfilteredtwordstwordt last_char((s6/tmp/pip-build-1THPZW/chardet/chardet/charsetprober.pytfilter_international_wordsBs      cCs�t�}t}d}x�tt|��D]�}|||d!}|dkrTt}n|dkrit}n|dkr(|j� r(||kr�| r�|j|||!�|jd�n|d}q(q(W|s�|j||�n|S(s� Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. iit>ts