ó <¿CVc@sºdZddlZddlTddlTejdƒZejdƒZejdƒZejdƒZe j de fd „ƒYƒZ e j d e fd „ƒYƒZ d efd „ƒYZdS(sâ CorpusReader for reviews corpora (syntax based on Customer Review Corpus). - Customer Review Corpus information - Annotated by: Minqing Hu and Bing Liu, 2004. Department of Computer Sicence University of Illinois at Chicago Contact: Bing Liu, liub@cs.uic.edu http://www.cs.uic.edu/~liub Distributed with permission. The "product_reviews_1" and "product_reviews_2" datasets respectively contain annotated customer reviews of 5 and 9 products from amazon.com. Related papers: - Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-04), 2004. - Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews". Proceedings of Nineteeth National Conference on Artificial Intelligence (AAAI-2004), 2004. - Xiaowen Ding, Bing Liu and Philip S. Yu. "A Holistic Lexicon-Based Appraoch to Opinion Mining." Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford, California, USA. Symbols used in the annotated reviews: [t] : the title of the review: Each [t] tag starts a review. xxxx[+|-n]: xxxx is a product feature. [+n]: Positive opinion, n is the opinion strength: 3 strongest, and 1 weakest. Note that the strength is quite subjective. You may want ignore it, but only considering + and - [-n]: Negative opinion ## : start of each sentence. Each line is a sentence. [u] : feature not appeared in the sentence. [p] : feature not appeared in the sentence. Pronoun resolution is needed. [s] : suggestion or recommendation. [cc]: comparison with a competing product from a different brand. [cs]: comparison with a competing product from the same brand. Note: Some of the files (e.g. "ipod.txt", "Canon PowerShot SD500.txt") do not provide separation between different reviews. This is due to the fact that the dataset was specifically designed for aspect/feature-based sentiment analysis, for which sentence-level annotation is sufficient. For document- level classification and analysis, this peculiarity should be taken into consideration. iÿÿÿÿN(t*s ^\[t\](.*)$s%((?:(?:\w+\s)+)?\w+)\[((?:\+|\-)\d)\]s\[(?!t)(p|u|s|cc|cs)\]s##(.*)$tReviewcBsAeZdZddd„Zd„Zd„Zd„Zd„ZRS(s> A Review is the main block of a ReviewsCorpusReader. cCs.||_|dkr!g|_n ||_dS(sŒ :param title: the title of the review. :param review_lines: the list of the ReviewLines that belong to the Review. N(ttitletNonet review_lines(tselfRR((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyt__init__Ms   cCs)t|tƒst‚|jj|ƒdS(s‡ Add a line (ReviewLine) to the review. :param review_line: a ReviewLine instance that belongs to the Review. N(t isinstancet ReviewLinetAssertionErrorRtappend(Rt review_line((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytadd_lineXscCs.g}x!|jD]}|j|jƒqW|S(s Return a list of features in the review. Each feature is a tuple made of the specific item feature and the opinion strength about that feature. :return: all features of the review as a list of tuples (feat, score). :rtype: list(tuple) (Rtextendtfeatures(RRR ((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRascCsg|jD]}|j^q S(s¡ Return all tokenized sentences in the review. :return: all sentences of the review as lists of tokens. :rtype: list(list(str)) (Rtsent(RR ((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytsentsnscCsdj|j|jƒS(Ns#Review(title="{}", review_lines={})(tformatRR(R((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyt__repr__wsN( t__name__t __module__t__doc__RRR RRR(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRHs  RcBs&eZdZddd„Zd„ZRS(s— A ReviewLine represents a sentence of the review, together with (optional) annotations of its features and notes about the reviewed item. cCsO||_|dkr!g|_n ||_|dkrBg|_n ||_dS(N(RRRtnotes(RRRR((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRs      cCsdj|j|j|jƒS(Ns*ReviewLine(features={}, notes={}, sent={})(RRRR(R((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRsN(RRRRRR(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR{s tReviewsCorpusReadercBseZdZeZeƒdd„Zd d„Zd d„Z d„Z d d„Z d d„Z d d„Z d „Zd „Zd „Zd „ZRS(sÆ Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization. >>> from nltk.corpus import product_reviews_1 >>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt') >>> review = camera_reviews[0] >>> review.sents()[0] ['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am', 'extremely', 'satisfied', 'with', 'the', 'purchase', '.'] >>> review.features() [('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'), ('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'), ('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'), ('option', '+1')] We can also reach the same information directly from the stream: >>> product_reviews_1.features('Canon_G3.txt') [('canon powershot g3', '+3'), ('use', '+2'), ...] We can compute stats for specific product features: >>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture']) >>> # We use float for backward compatibility with division in Python2.7 >>> mean = float(tot)/n_reviews >>> print(n_reviews, tot, mean) 15 24 1.6 tutf8cCs#tj||||ƒ||_dS(sd :param root: The root directory for the corpus. :param fileids: a list or regexp specifying the fileids in the corpus. :param word_tokenizer: a tokenizer for breaking sentences or paragraphs into words. Default: `WordPunctTokenizer` :param encoding: the encoding that should be used to read the corpus. N(t CorpusReaderRt_word_tokenizer(Rtroottfileidstword_tokenizertencoding((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR´s cCsw|dkr|j}nt|tƒr3|g}ntg|j|tƒD]'\}}|j||jd|ƒ^qIƒS(su Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature. :param fileids: a list or regexp specifying the ids of the files whose features have to be returned. :return: all features for the item(s) in the given file(s). :rtype: list(tuple) RN( Rt_fileidsRt string_typestconcattabspathstTruet CorpusViewt_read_features(RRtfileidtenc((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRÁs   cCs_|dkr|j}nt|tƒr3|g}ntg|D]}|j|ƒjƒ^q=ƒS(s× :param fileids: a list or regexp specifying the fileids of the files that have to be returned as a raw string. :return: the given file(s) as a single string. :rtype: str N(RRRR R!topentread(RRtf((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytrawÒs    cCs|jdƒjƒS(sD Return the contents of the corpus README.txt file. s README.txt(R(R)(R((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytreadmeßscCs\|dkr|j}ntg|j|tƒD]'\}}|j||jd|ƒ^q.ƒS(sS Return all the reviews as a list of Review objects. If `fileids` is specified, return all the reviews from each of the specified files. :param fileids: a list or regexp specifying the ids of the files whose reviews have to be returned. :return: the given file(s) as a list of reviews. RN(RRR!R"R#R$t_read_review_block(RRR&R'((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytreviewsås  cCsJtg|j|ttƒD]*\}}}|j||jd|ƒ^qƒS(sY Return all sentences in the corpus or in the specified files. :param fileids: a list or regexp specifying the ids of the files whose sentences have to be returned. :return: the given file(s) as a list of sentences, each encoded as a list of word strings. :rtype: list(list(str)) R(R!R"R#R$t_read_sent_block(RRtpathR'R&((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyRós cCsJtg|j|ttƒD]*\}}}|j||jd|ƒ^qƒS(sK Return all words and punctuation symbols in the corpus or in the specified files. :param fileids: a list or regexp specifying the ids of the files whose words have to be returned. :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) R(R!R"R#R$t_read_word_block(RRR0R'R&((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pytwordss cCsPg}xCtdƒD]5}|jƒ}|s/|S|jtjt|ƒƒqW|S(Ni(trangetreadlineR tretfindalltFEATURES(RtstreamRtitline((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR%s c Cs6xZtr\|jƒ}|sgStjt|ƒ}|rtd|jdƒjƒƒ}PqqWxÒtr1|jƒ}|jƒ}|s‹|gStjt|ƒr±|j |ƒ|gStj t |ƒ}tj t |ƒ}tj t |ƒ}|r|jj|dƒ}ntd|d|d|ƒ} |j| ƒq`WdS(NRiiRRR(R#R4R5tmatchtTITLERtgrouptstripttelltseekR6R7tNOTEStSENTRttokenizeRR ( RR8R:t title_matchtreviewtoldpostfeatsRRR ((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR-s.      cCsJg}x=|j|ƒD],}|jg|jƒD] }|^q/ƒqW|S(N(R-R R(RR8RRER((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR/7s*cCseg}xXtdƒD]J}|jƒ}tjt|ƒ}|r|j|jj|dƒƒqqW|S(Nii(R3R4R5R6RBR RRC(RR8R2R9R:R((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR1=s $N(RRRtStreamBackedCorpusViewR$tWordPunctTokenizerRRRR+R,R.RR2R%R-R/R1(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyR’s       (RR5tnltk.corpus.reader.apit nltk.tokenizetcompileR<R7RARBtcompattpython_2_unicode_compatibletobjectRRRR(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/reviews.pyt<s    2