ó ùµÈ[c@sdZddgZddlZddlZddlZddlZddlZddlm Z ddl m Z dd l mZmZmZd d lmZd d lmZmZd e jfd„ƒYZdefd„ƒYZdefd„ƒYZdefd„ƒYZdS(sText datasets.t WikiText2t WikiText103iÿÿÿÿNi(t _constantsi(tdataset(tdownloadt check_sha1t_get_repo_file_urli(ttext(tndtbaset_LanguageModelDatasetcBs8eZd„Zed„ƒZed„ƒZd„ZRS(cCs8||_d|_||_tt|ƒj|dƒdS(N(t_vocabtNonet_countert _namespacetsuperR t__init__(tselftroott namespacet vocabulary((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyR$s   cCs|jS(N(R (R((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyR*scCs|jS(N(R (R((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt frequencies.scCsX|js!tjj|ƒ|_n|jsTtjjd|jdtj gƒ|_ndS(Ntcountertreserved_tokens( R Rtutilstcount_tokens_from_strR tvocabt VocabularyRtCt EOS_TOKEN(Rtcontent((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt _build_vocab2s   (t__name__t __module__RtpropertyRRR(((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyR #s t _WikiTextcBs,eZd„Zd„Zd„Zd„ZRS(c Cstj|dddƒ}|jƒ}WdQX|j|ƒgg|jƒD]}|jƒjƒ^qJD]}|rf|^qf}x|D]}|jtj ƒq…W|j j g|D]}|D]}|r¹|^q¹q¯ƒ}|dd!}|d}t j |dt jƒt j |dt jƒfS(Ntrtencodingtutf8iiÿÿÿÿitdtype(tiotopentreadRt splitlineststriptsplittappendRRRt to_indicestnptarraytint32( RtfilenametfinRtxtlinetraw_datatdatatlabel((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt _read_batch<s 2 8  cCs¬|j\}}|j|j\}}tjj|j|ƒ}tjj|ƒ s`t||ƒ r9d|j }t t ||ƒd|jd|ƒ}t j |dƒ‘}x‡|jƒD]y} tjj| ƒ} | r³tjj|j| ƒ} |j| ƒ/} t| dƒ} tj| | ƒWdQXWdQXq³q³WWdQXn|j|ƒ\}}tj|d|jƒjd|jfƒ|_tj|d|jƒjd|jfƒ|_dS(Nsgluon/dataset/tpatht sha1_hashR$twbR'iÿÿÿÿ(t _archive_filet _data_filet_segmenttosR;tjoint_roottexistsRRRRtzipfiletZipFiletnamelisttbasenameR)tshutilt copyfileobjR:RR1R'treshapet_seq_lent_datat_label(Rtarchive_file_namet archive_hashtdata_file_namet data_hashR;Rtdownloaded_file_pathtzftmemberR3tdesttsourcettargetR8R9((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt _get_dataJs&#   ,-cCs|j||j|fS(N(RMRN(Rtidx((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt __getitem__bscCs t|jƒS(N(tlenRN(R((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyt__len__es(R R!R:RYR[R](((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyR#:s   cBs;eZdZejjejƒddƒdddd„Z RS(sžWikiText-2 word-level dataset for language modeling, from Salesforce research. From https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset License: Creative Commons Attribution-ShareAlike Each sample is a vector of length equal to the specified sequence length. At the end of each sentence, an end-of-sentence token '' is added. Parameters ---------- root : str, default $MXNET_HOME/datasets/wikitext-2 Path to temp folder for storing data. segment : str, default 'train' Dataset segment. Options are 'train', 'validation', 'test'. vocab : :class:`~mxnet.contrib.text.vocab.Vocabulary`, default None The vocabulary to use for indexing the text dataset. If None, a default vocabulary is created. seq_len : int, default 35 The sequence length of each sample, regardless of the sentence boundary. tdatasetss wikitext-2ttraini#cCsYd |_idd6dd6dd 6|_||_||_tt|ƒj|d |ƒdS(Nswikitext-2-v1.zipt(3c914d17d80b1459be871a5039ac23e752a53cbeswiki.train.tokenst(863f29c46ef9d167fff4940ec821195882fe29d1R_swiki.valid.tokenst(0418625c8b4da6e4b5c7a0b9e78d4ae8f7ee5422t validationswiki.test.tokenst(c7b8ce0aa086fb34dab808c5c49224211eb2b172ttests wikitext-2(swikitext-2-v1.zips(3c914d17d80b1459be871a5039ac23e752a53cbe(swiki.train.tokenss(863f29c46ef9d167fff4940ec821195882fe29d1(swiki.valid.tokenss(0418625c8b4da6e4b5c7a0b9e78d4ae8f7ee5422(swiki.test.tokenss(c7b8ce0aa086fb34dab808c5c49224211eb2b172(R>R?R@RLRRR(RRtsegmentRtseq_len((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyRs    N( R R!t__doc__RAR;RBR tdata_dirR R(((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyRiscBs;eZdZejjejƒddƒdddd„Z RS(s¡WikiText-103 word-level dataset for language modeling, from Salesforce research. From https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset License: Creative Commons Attribution-ShareAlike Each sample is a vector of length equal to the specified sequence length. At the end of each sentence, an end-of-sentence token '' is added. Parameters ---------- root : str, default $MXNET_HOME/datasets/wikitext-103 Path to temp folder for storing data. segment : str, default 'train' Dataset segment. Options are 'train', 'validation', 'test'. vocab : :class:`~mxnet.contrib.text.vocab.Vocabulary`, default None The vocabulary to use for indexing the text dataset. If None, a default vocabulary is created. seq_len : int, default 35 The sequence length of each sample, regardless of the sentence boundary. R^s wikitext-103R_i#cCsYd |_idd6dd6dd 6|_||_||_tt|ƒj|d |ƒdS(Nswikitext-103-v1.zipt(0aec09a7537b58d4bb65362fee27650eeaba625aswiki.train.tokenst(b7497e2dfe77e72cfef5e3dbc61b7b53712ac211R_swiki.valid.tokenst(c326ac59dc587676d58c422eb8a03e119582f92bRcswiki.test.tokenst(8a5befc548865cec54ed4273cf87dbbad60d1e47Res wikitext-103(swikitext-103-v1.zips(0aec09a7537b58d4bb65362fee27650eeaba625a(swiki.train.tokenss(b7497e2dfe77e72cfef5e3dbc61b7b53712ac211(swiki.valid.tokenss(c326ac59dc587676d58c422eb8a03e119582f92b(swiki.test.tokenss(8a5befc548865cec54ed4273cf87dbbad60d1e47(R>R?R@RLRRR(RRRfRRg((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyR¦s    N( R R!RhRAR;RBR RiR R(((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyRs(Rht__all__R(RARERItnumpyR0tRRR8RRRRRtcontribRRR t_DownloadedDatasetR R#RR(((s]/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/gluon/contrib/data/text.pyts      /&