ó <¿CVc@`sÆddlmZmZmZddlZddlZddlZddlZddlm Z ddl m Z m Z ddl mZmZddlmZmZmZd„Zd„Zd „Zd „Zd „Zd „Zd „Zd„Zd„Zd„Zd„Zd„Zdddddddde!de!e!ddddde!dd„Z"d„Z#ddd„Z$ed.d/gƒZ%ed0d1d2d3d4d5d6d7d8g ƒZ&d,„Z'e(d-krÂeƒndS(9i(tprint_functiontabsolute_importtdivisionN(ttreebank(t error_listtTemplate(tWordtPos(tBrillTaggerTrainert RegexpTaggert UnigramTaggercC`s tƒdS(s„ Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions. N(tpostag(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemoscC`stddƒdS(sN Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose")) t ruleformattreprN(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_repr_rule_formatscC`stddƒdS(sN Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose")) R tstrN(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_str_rule_format%scC`stddƒdS(s* Exemplify Rule.format("verbose") R tverboseN(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_verbose_rule_format+scC`s)tdttdddgƒƒgƒdS(s¾ The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right. For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below. t templatesiýÿÿÿiþÿÿÿiÿÿÿÿN(R RR(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_multiposition_feature1s cC`s2tdttdgƒtddgƒƒgƒdS(s8 Templates can have more than a single feature. RiiþÿÿÿiÿÿÿÿN(R RRR(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_multifeature_template?scC`stdtdtƒdS(sh Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined. Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor). tincremental_statsttemplate_statsN(R tTrue(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_template_statisticsEs cC`s¨tjdddgddgdtƒ}tjddddgddgdtƒ}ttj||gddƒƒ}td jt |ƒƒƒt d |d td tƒd S(s  Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details. Note: training with 500 templates can easily fill all available even on relatively small corpora iÿÿÿÿiiit excludezeroiþÿÿÿt combinationsis9Generated {0} templates for transformation-based learningRRRN(ii( RtexpandtFalseRRtlistRtprinttformattlenR (twordtplsttagtplsR((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_generated_templatesPs '*!cC`stdtdtddƒdS(s‚ Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib Rtseparate_baseline_datatlearning_curve_outputslearningcurve.pngN(R R(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_learning_curve_scC`stddƒdS(sW Writes a file with context for each erroneous word after tagging testing data t error_outputs errors.txtN(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_error_analysisgscC`stddƒdS(sm Serializes the learned tagger to a file in pickle format; reloads it and validates the process. tserialize_outputs tagger.pclN(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_serialize_taggermscC`stddddddƒdS(s˜ Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human. t num_sentsi¸ tmin_accg¸…ëQ¸î?t min_scorei N(R (((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pytdemo_high_accuracy_rulestsièi,igš™™™™™é?Rc& C`s-|p t}|dkr:ddlm}m}|ƒ}nt|||||ƒ\}}}}|rtjj|ƒsÆt |d|ƒ}t |dƒ}t j ||ƒWdQXt dj|ƒƒnt |dƒ)}t j|ƒ}t dj|ƒƒWdQXnt |d|ƒ}t d ƒ|rDt d j|j|ƒƒƒntjƒ}t|||d | ƒ}t d ƒ|j||||ƒ}t d jtjƒ|ƒƒ|rÇt d|j|ƒƒn|dkr%t dƒxEt|jƒdƒD]+\}}t dj||j| ƒƒƒqóWn| rÁt dƒ|j||ƒ\} }!t dƒ|sjt dƒn|jƒ}"| rŒ|j|!ƒn|rít||!|"d|ƒt dj|ƒƒqín,t dƒ|j|ƒ} | rí|jƒn| dk rdt | dƒD}#|#jd| ƒ|#jdjt|| ƒƒjdƒdƒWdQXt dj| ƒƒn| dk r)|j|ƒ} t | dƒ}t j ||ƒWdQXt dj| ƒƒt | dƒ}t j|ƒ}$WdQXt dj| ƒƒ|j|ƒ}%| |%krt dƒq)t dƒndS( s’ Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template :param tagged_data: maximum number of rule instances to create :type tagged_data: C{int} :param num_sents: how many sentences of training and testing data to use :type num_sents: C{int} :param max_rules: maximum number of rule instances to create :type max_rules: C{int} :param min_score: the minimum score for a rule in order for it to be considered :type min_score: C{int} :param min_acc: the minimum score for a rule in order for it to be considered :type min_acc: C{float} :param train: the fraction of the the corpus to be used for training (1=all) :type train: C{float} :param trace: the level of diagnostic tracing output to produce (0-4) :type trace: C{int} :param randomize: whether the training data should be a random subset of the corpus :type randomize: C{bool} :param ruleformat: rule output format, one of "str", "repr", "verbose" :type ruleformat: C{str} :param incremental_stats: if true, will tag incrementally and collect stats for each rule (rather slow) :type incremental_stats: C{bool} :param template_stats: if true, will print per-template statistics collected in training and (optionally) testing :type template_stats: C{bool} :param error_output: the file where errors will be saved :type error_output: C{string} :param serialize_output: the file where the learned tbl tagger will be saved :type serialize_output: C{string} :param learning_curve_output: filename of plot of learning curve(s) (train and also test, if available) :type learning_curve_output: C{string} :param learning_curve_take: how many rules plotted :type learning_curve_take: C{int} :param baseline_backoff_tagger: the file where rules will be saved :type baseline_backoff_tagger: tagger :param separate_baseline_data: use a fraction of the training data exclusively for training baseline :type separate_baseline_data: C{bool} :param cache_baseline_tagger: cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions) :type cache_baseline_tagger: C{string} Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high). i(tdescribe_template_setstbrill24tbackofftwNs*Trained baseline tagger, pickled it to {0}trs Reloaded pickled tagger from {0}sTrained baseline taggers" Accuracy on test set: {0:0.4f}R sTraining tbl tagger...s&Trained tbl tagger in {0:0.2f} secondss Accuracy on test set: %.4fis Learned rules: s {0:4d} {1:s}sJIncrementally tagging the test data, collecting individual rule statisticss Rule statistics collectedsbWARNING: train_stats asked for separate_baseline_data=True; the baseline will be artificially highttakes#Wrote plot of learning curve to {0}sTagging the test datasErrors for Brill Tagger %r u sutf-8s s,Wrote tagger errors including context to {0}sWrote pickled tagger to {0}s4Reloaded tagger tried on test set, results identicals;PROBLEM: Reloaded tagger gave different results on test set(t REGEXP_TAGGERtNonetnltk.tag.brillR1R2t_demo_prepare_datatostpathtexistsR topentpickletdumpR R!tloadtevaluatettimeRttraint enumeratetrulestbatch_tag_incrementalt train_statstprint_template_statisticst _demo_plott tag_sentstwritetjoinRtencode(&Rt tagged_dataR-t max_rulesR/R.RDttracet randomizeR RRR)R+R'tlearning_curve_taketbaseline_backoff_taggerR&tcache_baseline_taggerR1R2t training_datat baseline_datat gold_datat testing_datatbaseline_taggert print_rulesttbrillttrainert brill_taggertrulenotrulet taggedtestt teststatst trainstatstftbrill_tagger_reloadedttaggedtest_reloaded((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pyR {s|W   $     "&       2   cC`s|dkr%tdƒtjƒ}n|dksCt|ƒ|krRt|ƒ}n|r{tjt|ƒƒtj|ƒnt||ƒ}|| }|||!}g|D]#}g|D]} | d^q¶^q©} |sá|} n%t|ƒd} || || } }t |ƒ\} }t | ƒ\}}t | ƒ\}}tdj ||ƒƒtdj | |ƒƒtdj |||rƒdndƒƒ|| || fS( Ns%Loading tagged data from treebank... iis)Read testing data ({0:d} sents/{1:d} wds)s*Read training data ({0:d} sents/{1:d} wds)s0Read baseline data ({0:d} sents/{1:d} wds) {2:s}ts[reused the training set]( R8R Rt tagged_sentsR"trandomtseedtshuffletintt corpus_sizeR!(RORDR-RRR&tcutoffRVRXtsentttRYRWt bl_cutofft trainseqst traintokensttestseqst testtokenst bltrainseqst bltraintokens((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pyR:)s0    0  c C`s'|dg}x'|dD]}|j|d|ƒqWg|| D]}d||d^qB}|dg}x'|dD]}|j|d|ƒqxWg|| D]}d||d^q¢}ddlj}ttt|ƒƒƒ} |j| || |ƒ|jddddgƒ|j |ƒdS(Nt initialerrorst rulescoresiÿÿÿÿit tokencountigð?( tappendtmatplotlib.pyplottpyplotRtrangeR"tplottaxisR8tsavefig( R'RbRcR6t testcurvet rulescoretxt traincurvetpltR5((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pyRJGs ) )s^-?[0-9]+(.[0-9]+)?$tCDs.*tNNs(The|the|A|a|An|an)$tATs.*able$tJJs.*ness$s.*ly$tRBs.*s$tNNSs.*ing$tVBGs.*ed$tVBDcC`s t|ƒtd„|DƒƒfS(Ncs`s|]}t|ƒVqdS(N(R"(t.0R„((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pys ks(R"tsum(tseqs((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pyRmjst__main__(s^-?[0-9]+(.[0-9]+)?$R‡(s.*Rˆ(s^-?[0-9]+(.[0-9]+)?$R‡(s(The|the|A|a|An|an)$sAT(s.*able$RŠ(s.*ness$Rˆ(s.*ly$R‹(s.*s$RŒ(s.*ing$R(s.*ed$RŽ(s.*Rˆ()t __future__RRRR;R?RiRCt nltk.corpusRtnltk.tblRRR9RRtnltk.tagRR R R RRRRRRR%R(R*R,R0R8RR R:RJt NN_CD_TAGGERR7Rmt__name__(((s_/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/tbl/demo.pyt sr                ›