ó <¿CVc@sÚddlZddlZyddlZWnek r;nXddlmZddf\ZZddf\ZZ dgZ defd„ƒYZ de fd„ƒYZ d e fd „ƒYZd d d „Zed„ZdS(iÿÿÿÿN(t TokenizerIiitTextTilingTokenizerc BsƒeZdZddededdeed„ Zd„Z d„Z d„Z d „Z d „Z d „Zd „Zd „Zd„ZRS(sfTokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns. The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned. :param w: Pseudosentence size :type w: int :param k: Size (in sentences) of the block used in the block comparison method :type k: int :param similarity_method: The method used for determining similarity scores: `BLOCK_COMPARISON` (default) or `VOCABULARY_INTRODUCTION`. :type similarity_method: constant :param stopwords: A list of stopwords that are filtered out (defaults to NLTK's stopwords corpus) :type stopwords: list(str) :param smoothing_method: The method used for smoothing the score plot: `DEFAULT_SMOOTHING` (default) :type smoothing_method: constant :param smoothing_width: The width of the window used by the smoothing method :type smoothing_width: int :param smoothing_rounds: The number of smoothing passes :type smoothing_rounds: int :param cutoff_policy: The policy used to determine the number of boundaries: `HC` (default) or `LC` :type cutoff_policy: constant >>> from nltk.corpus import brown >>> tt = TextTilingTokenizer(demo_mode=True) >>> text = brown.raw()[:10000] >>> s, ss, d, b = tt.tokenize(text) >>> b [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0] ii iic CsO|dkr.ddlm}|jdƒ}n|jjtƒƒ|jd=dS(Niÿÿÿÿ(t stopwordstenglishtself(tNonet nltk.corpusRtwordst__dict__tupdatetlocals( Rtwtktsimilarity_methodRtsmoothing_methodtsmoothing_widthtsmoothing_roundst cutoff_policyt demo_mode((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyt__init__Bs cCsá|jƒ}|j|ƒ}t|ƒ}djd„|Dƒƒ}|j|ƒ}|j|ƒ}x@|D]8}g|jD]} | d|jkru| ^qu|_qeW|j||ƒ} |jt kr×|j || ƒ} n|jt krõt dƒ‚n|j tkr|j| ƒ} n|j| ƒ} |j| ƒ}|j|||ƒ}g}d}x:|D]2}|dkrtq\n|j|||!ƒ|}q\W||kr²|j||ƒn|sÄ|g}n|jrÝ| | | |fS|S(sZReturn a tokenized copy of *text*, where each "token" represents a separate topic.tcss'|]}tjd|ƒr|VqdS(s [a-z\-' ]N(tretmatch(t.0tc((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pys _sis'Vocabulary introduction not implemented(tlowert_mark_paragraph_breakstlentjoint_divide_to_tokensequencest wrdindex_listRt_create_token_tableR tBLOCK_COMPARISONt_block_comparisontVOCABULARY_INTRODUCTIONtNotImplementedErrorRtDEFAULT_SMOOTHINGt_smooth_scorest _depth_scorest_identify_boundariest_normalize_boundariestappendR(Rttexttlowercase_texttparagraph_breakst text_lengtht nopunct_texttnopunct_par_breaksttokseqsttstwit token_tablet gap_scorest smooth_scorest depth_scorestsegment_boundariestnormalized_boundariestsegmented_texttprevbtb((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyttokenizeTsF   &        cs—‡fd†}g}t|ƒd}xkt|ƒD]]}d\}}} d} ||jdkrm|d} n)|||jkr||} n |j} g||| d|d!D]} | j^q°} g||d|| d!D]} | j^qß}x\ˆD]T}|||| ƒ|||ƒ7}|||| ƒd7}| |||ƒd7} qûWy|tj|| ƒ} Wntk rnX|j| ƒq2W|S(s&Implements the block comparison methodcsFt‡fd†ˆ|jƒ}tg|D]}|d^q)ƒ}|S(Ncs|dˆkS(Ni((to(tblock(sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytœsi(tfiltert ts_occurencestsum(ttokR>tts_occsttsocctfreq(R3(R>sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytblk_frq›s#igi(ggg(RtrangeR tindextmathtsqrttZeroDivisionErrorR)(RR0R3RGR4tnumgapstcurr_gaptscore_dividendtscore_divisor_b1tscore_divisor_b2tscoret window_sizeR1tb1tb2tt((R3sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR!™s4   ,,    cCs'tttj|ƒd|jdƒƒS(s1Wraps the smooth function from the SciPy Cookbookt window_leni(tlisttsmoothtnumpytarrayR(RR4((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR%ÂscCsƒd}tjdƒ}|j|ƒ}d}dg}xI|D]A}|jƒ||kr\q:q:|j|jƒƒ|jƒ}q:W|S(sNIdentifies indented text or line breaks as the beginning of paragraphsids[ ]* [ ]* [ ]*i(RtcompiletfinditertstartR)(RR*t MIN_PARAGRAPHtpatterntmatchest last_breaktpbreakstpb((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRÇs  cCs’|j}g}tjd|ƒ}x-|D]%}|j|jƒ|jƒfƒq(Wgtdt|ƒ|ƒD]$}t||||||!ƒ^qjS(s3Divides the text into pseudosentences of fixed sizes\w+i( R RR]R)tgroupR^RHRt TokenSequence(RR*R RRaRti((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRÚs  #c CsÍi}d}d}|jƒ}t|ƒ}|dkriyt|ƒ}Wqitk retdƒ‚qiXnx]|D]U}xB|jD]7\} } y-x&| |kr·t|ƒ}|d7}q’WWntk rÌnX| |kr}|| jd7_|| j|kr"||| _|| jd7_n|| j|kr_||| _|| j j |dgƒq·|| j ddcd7|D]6} | t| |ƒkrüt| |ƒ} | } qÇPqÇW| |kr |j| ƒq n|d7}q"q"W|S(sSNormalize the boundaries identified to the original text's paragraph breaksiis (iii(tFalsetTrueRR‰R R€R)( RR*RR,tnorm_boundariest char_countt word_countt gaps_seent seen_wordtchartbest_fittbrtbestbr((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR(Ps.        N(t__name__t __module__t__doc__R RR$tHCRRR<R!R%RRRR'R&R((((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRs&( E )   3  RqcBs#eZdZddddd„ZRS(s[A field in the token table holding parameters for each token, used later in the processiicCs!|jjtƒƒ|jd=dS(NR(RR R (RRhRARiRjRkRl((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRtsN(RšR›RœRR(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRqqs RfcBseZdZdd„ZRS(s3A token list with its original length and its indexcCs3|pt|ƒ}|jjtƒƒ|jd=dS(NR(RRR R (RRIRtoriginal_length((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR€sN(RšR›RœRR(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRf~si tflatcCs|jdkrtdƒ‚n|j|kr<tdƒ‚n|dkrL|S|dkrgtd ƒ‚ntjd |d ||dd …|d |d |d | d …f}|dkrÑtj|dƒ}ntd|dƒ}tj||jƒ|ddƒ}||d| d!S(sÈsmooth the data using a window with requested size. This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal. :param x: the input signal :param window_len: the dimension of the smoothing window; should be an odd integer :param window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman' flat window will produce a moving average smoothing. :return: the smoothed signal example:: t=linspace(-2,2,0.1) x=sin(t)+randn(len(t))*0.1 y=smooth(x) :see also: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter TODO: the window parameter could be the window itself if an array instead of a string is'smooth only accepts 1 dimension arrays.s1Input vector needs to be bigger than window size.iRŸthanningthammingtbartletttblackmansDWindow is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'iiiÿÿÿÿtdsnumpy.s (window_len)tmodetsame(RŸR R¡R¢R£( tndimRptsizeRZtr_tonestevaltconvolveRB(RyRWtwindowtsR ty((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRY‰s  I "cCsddlm}ddlm}tdtƒ}|dkrN|jƒd }n|j|ƒ\}}}}|j dƒ|j dƒ|j t t |ƒƒ|ddƒ|j t t |ƒƒ|dd ƒ|j t t |ƒƒ|dd ƒ|jt t |ƒƒ|ƒ|jƒ|jƒdS( Niÿÿÿÿ(tbrown(tpylabRi'sSentence Gap indexs Gap ScorestlabelsSmoothed Gap scoress Depth scores(RR°t matplotlibR±RRRtrawR<txlabeltylabeltplotRHRtstemtlegendtshow(R*R°R±tttR®tssR¤R;((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytdemo½s   """ (RRJRZt ImportErrortnltk.tokenize.apiRR R"R|RR$RtobjectRqRfRYRR½(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyt s    ÿZ  4