ó <¿CVc@s„dZddlZddlZddlmZddlmZddlmZm Z m Z ddl m Z de fd„ƒYZ dS( s{ A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON. iÿÿÿÿN(tcompat(tTweetTokenizer(tStreamBackedCorpusViewtconcattZipFilePathPointer(t CorpusReadertTwitterCorpusReadercBsbeZdZeZdeƒdd„Zdd„Zdd„Z dd„Z dd„Z d„Z RS( s7 Reader for corpora that consist of Tweets represented as a list of line-delimited JSON. Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor. Construct a new Tweet corpus reader for a set of documents located at the given root directory. If you made your own tweet collection in a directory called `twitter-files`, then you can initialise the reader as:: from nltk.corpus import TwitterCorpusReader reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json') However, the recommended approach is to set the relevant directory as the value of the environmental variable `TWITTER`, and then invoke the reader as follows:: root = os.environ['TWITTER'] reader = TwitterCorpusReader(root, '.*\.json') If you want to work directly with the raw Tweets, the `json` library can be used:: import json for tweet in reader.docs(): print(json.dumps(tweet, indent=1, sort_keys=True)) tutf8cCs‚tj||||ƒx\|j|jƒD]H}t|tƒrAq)tjj|ƒdkr)t dj |ƒƒ‚q)q)W||_ dS(s :param root: The root directory for this corpus. :param fileids: A list or regexp specifying the fileids in this corpus. :param word_tokenizer: Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words. isFile {} is emptyN( Rt__init__tabspathst_fileidst isinstanceRtostpathtgetsizet ValueErrortformatt_word_tokenizer(tselftroottfileidstword_tokenizertencodingR ((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pyR<s cCsJtg|j|ttƒD]*\}}}|j||jd|ƒ^qƒS(s$ Returns the full Tweet objects, as specified by `Twitter documentation on Tweets `_ :return: the given file(s) as a list of dictionaries deserialised from JSON. :rtype: list(dict) R(RR tTruet CorpusViewt _read_tweets(RRR tenctfileid((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pytdocsVs cCs}|j|ƒ}g}xa|D]Y}y?|d}t|tƒrS|j|jƒ}n|j|ƒWqtk rtqXqW|S(s› Returns only the text content of Tweets in the file(s) :return: the given file(s) as a list of Tweets. :rtype: list(str) ttext(RR tbytestdecodeRtappendtKeyError(RRt fulltweetsttweetstjsonoR((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pytstringsds   cCs8|j|ƒ}|j}g|D]}|j|ƒ^qS(sÊ :return: the given file(s) as a list of the text content of Tweets as as a list of words, screenanames, hashtags, URLs and punctuation symbols. :rtype: list(list(str)) (R%Rttokenize(RRR#t tokenizertt((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pyt tokenizedxs cCsb|dkr|j}nt|tjƒr6|g}ntg|D]}|j|ƒjƒ^q@ƒS(s7 Return the corpora in their raw form. N(tNoneR R Rt string_typesRtopentread(RRtf((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pytraw„s    cCsSg}xFtdƒD]8}|jƒ}|s/|Stj|ƒ}|j|ƒqW|S(sS Assumes that each line in ``stream`` is a JSON-serialised object. i (trangetreadlinetjsontloadsR (RtstreamR#titlinettweet((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pyRs N( t__name__t __module__t__doc__RRR*RRRR%R)R/R(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pyRs    (R:R2R tnltkRt nltk.tokenizeRtnltk.corpus.reader.utilRRRtnltk.corpus.reader.apiRR(((sl/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/corpus/reader/twitter.pyt s