ó <¿CVc@sÚddlZddlZyddlZWnek r;nXddlmZddf\ZZddf\ZZ dgZ defd„ƒYZdefd„ƒYZ d efd „ƒYZddd „Zed„ZdS(iÿÿÿÿN(t TokenizerIiitTextTilingTokenizerc BsƒeZdZddededdeed„ Zd„Z d„Z d„Zd „Zd „Z d„Zd„Zd „Zd„ZRS(sfTokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns. The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned. :param w: Pseudosentence size :type w: int :param k: Size (in sentences) of the block used in the block comparison method :type k: int :param similarity_method: The method used for determining similarity scores: `BLOCK_COMPARISON` (default) or `VOCABULARY_INTRODUCTION`. :type similarity_method: constant :param stopwords: A list of stopwords that are filtered out (defaults to NLTK's stopwords corpus) :type stopwords: list(str) :param smoothing_method: The method used for smoothing the score plot: `DEFAULT_SMOOTHING` (default) :type smoothing_method: constant :param smoothing_width: The width of the window used by the smoothing method :type smoothing_width: int :param smoothing_rounds: The number of smoothing passes :type smoothing_rounds: int :param cutoff_policy: The policy used to determine the number of boundaries: `HC` (default) or `LC` :type cutoff_policy: constant >>> from nltk.corpus import brown >>> tt = TextTilingTokenizer(demo_mode=True) >>> text = brown.raw()[:10000] >>> s, ss, d, b = tt.tokenize(text) >>> b [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0] ii iic CsO|dkr.ddlm}|jdƒ}n|jjtƒƒ|jd=dS(Niÿÿÿÿ(t stopwordstenglishtself(tNonetnltk.corpusRtwordst__dict__tupdatetlocals( Rtwtktsimilarity_methodRtsmoothing_methodtsmoothing_widthtsmoothing_roundst cutoff_policyt demo_mode((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyt__init__Bs cCsá|jƒ}|j|ƒ}t|ƒ}djd„|Dƒƒ}|j|ƒ}|j|ƒ}x@|D]8}g|jD]} | d|jkru| ^qu|_qeW|j||ƒ} |jt kr×|j || ƒ}n|jtkrõtdƒ‚n|j tkr|j|ƒ}n|j|ƒ} |j| ƒ}|j|||ƒ}g}d}x:|D]2}|dkrtq\n|j|||!ƒ|}q\W||kr²|j||ƒn|sÄ|g}n|jrÝ||| |fS|S(sZReturn a tokenized copy of *text*, where each "token" represents a separate topic.tcss'|]}tjd|ƒr|VqdS(s[a-z\-' ]N(tretmatch(t.0tc((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pys _sis'Vocabulary introduction not implemented(tlowert_mark_paragraph_breakstlentjoint_divide_to_tokensequencest wrdindex_listRt_create_token_tableR tBLOCK_COMPARISONt_block_comparisontVOCABULARY_INTRODUCTIONtNotImplementedErrorRtDEFAULT_SMOOTHINGt_smooth_scorest _depth_scorest_identify_boundariest_normalize_boundariestappendR(Rttexttlowercase_texttparagraph_breaksttext_lengthtnopunct_texttnopunct_par_breaksttokseqsttstwittoken_tablet gap_scorest smooth_scorestdepth_scorestsegment_boundariestnormalized_boundariestsegmented_texttprevbtb((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyttokenizeTsF & cs—‡fd†}g}t|ƒd}xkt|ƒD]]}d\}}} d} ||jdkrm|d}n)|||jkr||}n |j}g|||d|d!D]}|j^q°} g||d||d!D]}|j^qß}x\ˆD]T}|||| ƒ|||ƒ7}|||| ƒd7}| |||ƒd7} qûWy|tj|| ƒ} Wntk rnX|j| ƒq2W|S(s&Implements the block comparison methodcsFt‡fd†ˆ|jƒ}tg|D]}|d^q)ƒ}|S(Ncs|dˆkS(Ni((to(tblock(sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytœsi(tfiltert ts_occurencestsum(ttokR>tts_occsttsocctfreq(R3(R>sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytblk_frq›s#igi(ggg(RtrangeRtindextmathtsqrttZeroDivisionErrorR)(RR0R3RGR4tnumgapstcurr_gaptscore_dividendtscore_divisor_b1tscore_divisor_b2tscoretwindow_sizeR1tb1tb2tt((R3sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR!™s4 ,, cCs'tttj|ƒd|jdƒƒS(s1Wraps the smooth function from the SciPy Cookbookt window_leni(tlisttsmoothtnumpytarrayR(RR4((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR%ÂscCsƒd}tjdƒ}|j|ƒ}d}dg}xI|D]A}|jƒ||kr\q:q:|j|jƒƒ|jƒ}q:W|S(sNIdentifies indented text or line breaks as the beginning of paragraphsids[ ]* [ ]* [ ]*i(RtcompiletfinditertstartR)(RR*t MIN_PARAGRAPHtpatterntmatchest last_breaktpbreakstpb((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRÇs cCs’|j}g}tjd|ƒ}x-|D]%}|j|jƒ|jƒfƒq(Wgtdt|ƒ|ƒD]$}t||||||!ƒ^qjS(s3Divides the text into pseudosentences of fixed sizes\w+i( RRR]R)tgroupR^RHRt TokenSequence(RR*RRRaRti((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRÚs #cCsÍi}d}d}|jƒ}t|ƒ}|dkriyt|ƒ}Wqitk retdƒ‚qiXnx]|D]U}xB|jD]7\} } y-x&| |kr·t|ƒ}|d7}q’WWntk rÌnX| |kr}|| jd7_|| j|kr"||| _|| jd7_n|| j|kr_||| _|| j j |dgƒq·|| j ddcd7|D]6}| t||ƒkrüt||ƒ} |}qÇPqÇW||kr |j|ƒq n|d7}q"q"W|S(sSNormalize the boundaries identified to the original text's paragraph breaksiis (iii(tFalsetTrueRR‰RR€R)( RR*RR,tnorm_boundariest char_countt word_countt gaps_seent seen_wordtchartbest_fittbrtbestbr((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR(Ps. N(t__name__t __module__t__doc__R RR$tHCRRR<R!R%RRRR'R&R((((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRs&( E ) 3 RqcBs#eZdZddddd„ZRS(s[A field in the token table holding parameters for each token, used later in the processiicCs!|jjtƒƒ|jd=dS(NR(RR R (RRhRARiRjRkRl((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRtsN(RšR›RœRR(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRqqs RfcBseZdZdd„ZRS(s3A token list with its original length and its indexcCs3|pt|ƒ}|jjtƒƒ|jd=dS(NR(RRR R (RRIRtoriginal_length((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyR€sN(RšR›RœRR(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRf~sitflatcCs|jdkrtdƒ‚n|j|kr<tdƒ‚n|dkrL|S|dkrgtd ƒ‚ntjd|d||dd …|d|d |d |d …f}|dkrÑtj|dƒ}ntd|dƒ}tj||jƒ|ddƒ}||d|d!S(sÈsmooth the data using a window with requested size. This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal. :param x: the input signal :param window_len: the dimension of the smoothing window; should be an odd integer :param window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman' flat window will produce a moving average smoothing. :return: the smoothed signal example:: t=linspace(-2,2,0.1) x=sin(t)+randn(len(t))*0.1 y=smooth(x) :see also: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter TODO: the window parameter could be the window itself if an array instead of a string is'smooth only accepts 1 dimension arrays.s1Input vector needs to be bigger than window size.iRŸthanningthammingtbartletttblackmansDWindow is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'iiiÿÿÿÿtdsnumpy.s(window_len)tmodetsame(RŸR R¡R¢R£( tndimRptsizeRZtr_tonestevaltconvolveRB(RyRWtwindowtsRty((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyRY‰sI"cCsddlm}ddlm}tdtƒ}|dkrN|jƒd }n|j|ƒ\}}}}|j dƒ|j dƒ|jtt |ƒƒ|ddƒ|jtt |ƒƒ|dd ƒ|jtt |ƒƒ|dd ƒ|jtt |ƒƒ|ƒ|jƒ|jƒdS(Niÿÿÿÿ(tbrown(tpylabRi'sSentence Gap indexs Gap ScorestlabelsSmoothed Gap scoressDepth scores(RR°t matplotlibR±RRRtrawR<txlabeltylabeltplotRHRtstemtlegendtshow(R*R°R±tttR®tssR¤R;((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pytdemo½s """ (RRJRZtImportErrortnltk.tokenize.apiRR R"R|RR$RtobjectRqRfRYRR½(((sj/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tokenize/texttiling.pyt s ÿZ 4