U C^ @sddlmZddlZddlZddlmZddlZddlZddl Z ddl Z ddl Z ddl m ZddlmZddlmZdd lmZdd lmZzeWnek reZYnXd Zd jed dZdjeddZdZdZdZ d8ddZ!d9ddZ"d:ddZ#d;ddZ$dd&d'Z(d(d)Z)d*d+Z*d?d,d-Z+dddd.Z,dddd.Z-d@d/d0Z.dAd1d2Z/d3d4Z0dBd6d7Z1dS)C)unicode_literalsN)Counter) cloudpickle)Path)get_file) partition) basestringz)https://github.com/UniversalDependencies/z"{github}/{ancora}/archive/r1.4.zipzUD_Spanish-AnCora)githubZancoraz {github}/{ewtb}/archive/r1.4.zipZ UD_English)r Zewtbz2http://nlp.stanford.edu/projects/snli/snli_1.0.zipz8http://qim.fs.quoracdn.net/quora_duplicate_questions.tsvz>http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gzFcCs8tdtdd}tj|d}tj|d}t|||dS)NzUD_Spanish-AnCora-r1.4Tunzipzes_ancora-ud-train.conlluzes_ancora-ud-dev.conllu) encode_words)rANCORA_1_4_ZIPospathjoin ud_pos_tags)rdata_dir train_locdev_locr7/tmp/pip-install-6_kvzl1k/thinc/thinc/extra/datasets.pyancora_pos_tags&srcCs:tdtdd}tj|d}tj|d}t||||dS)NzUD_English-EWT-r1.4Tr zen-ud-train.conlluzen-ud-dev.conllu) encode_tagsr)r EWTB_1_4_ZIPrrrr)rrrrrrrr ewtb_pos_tags-srTc stt|}tt|}it}|D]<\}}|D]} | tq2|D]} || d7<qLq&ddt|Dfdd} | || |tfS)NrcSs"i|]\}\}}|dkr||qS)r).0iwordfreqrrr Bs zud_pos_tags..csg}g}|D]l\}}r<|tjfdd|Dddn ||rn|tjfdd|Dddq ||q t||S)Ncsg|]}|tqSr)getlen)rr )vocabrr Msz0ud_pos_tags.._encode..Zuint64)Zdtypecsg|] }|qSrrrtagtagmaprrr&SsZint32)appendnumpyZasarrayzip)ZsentsXywordstagsrrr*r%rr_encodeFs  $ zud_pos_tags.._encode)list read_conllr setdefaultr$ enumerate most_common) rrrrZ train_sentsZ dev_sentsZfreqsr0r1r(r r3rr2rr6s    rcCsH|dkrtdtddd}t|d}t|d}t||dt||dfS)NZaclImdbT)Zuntarr traintest)limit)rIMDB_URLr read_imdb)locr;rZtest_locrrrimdb[s   r?c stg}g}|D]\}|sq dd|D}t|\}}}dk rTfdd|D}||||q t||S)NcSsg|]}|ddqS)|r)rsplit)rtrrrr&isz read_wikiner..csg|]}|tqSr)r6r$r'r)rrr&ls)stripsplitr-r+) file_r*ZXsZyslinetokensr0_r1rr)r read_wikinercs  rIc Csg}dD]^\}}||D]H}|jddd}|}W5QRX|dd}|r|||fqqt||dkr|d|}|S)N))posr)negrrutf8encodingz
 r)iterdiropenreadreplacerCr+randomshuffle)rr;ZexamplessubdirlabelfilenamerEtextrrrr=rs    r=c cstj|dd}|d}W5QRX|D]}dd|dD}g}g}t|D]X\}}t|dkrz|\} } } } n|\ } } }}} }} } }}d| krqX|| || qX||fVq0dS) NrMrNrPcSsg|]}|ds|qS)#) startswithrD)rrFrrrr&s zread_conll.. -)iorRrSrCrDr7r$r+)r>rEZ sent_strsZsent_strlinesr0r1rpiecesr rJheadrXidxZlemmaZpos1ZmorphrHZ_2rrrr5s"   r5c csB|0}t|D]}||}||}||fVqW5QRXdS)N)rRcsvreader)Zcsv_locZ label_colZtext_colrErowZ label_strrZrrrread_csvs  ric Csddlm}|\\}}\}}|dd}|dd}|d}|d}|d}|d}tt||}|jd}t||dt |d }|t |d}tt||}|||fS) Nr) load_mnisti`ii'Zfloat32go@rg?) _vendorized.keras_datasetsrjZreshapeZastyper4r-shaperUrVintr$) rjX_trainy_trainX_testy_testZ train_dataZnr_trainZ heldout_dataZ test_datarrrmnists       rrcCs.ddlm}|\\}}\}}||f||ffS)Nr) load_reuters)rkrs)rsrnrorprqrrrreuterss rtc Cs|dkrtdt}t|tr$t|}d}g}|jddd}tj|ddD]l}|rXd}qJ|\}}}}} } t|ts| d }t| ts| d } |rJ| rJ| || ft | fqJW5QRXt |d \} } | | fS) Nzquora_similarity.tsvTrLrMrN ) delimiterFg?)rQUORA_QUESTIONS_URL isinstancer rrRrfrgunicodedecoderCr+rmr ) r>Z is_headerrarErhZid_Zqid1Zqid2Zsent1Zsent2Z is_duplicater9devrrrquora_questionss(    "r|)Z entailmentZ contradictionZneutralcCs`|rtnt}|dkr"tdtdd}t|tr4t|}tt|d|}tt|d|}||fS)Nzsnli_1.0Tr zsnli_1.0_train.jsonlzsnli_1.0_dev.jsonl) THREE_LABELS TWO_LABELSrSNLI_URLrxr r read_snli)r>Zternary label_schemer9r{rrrsnlis  rc Csz|dkrtdg}|jddd<}|D]0}t|}||d|dft|dfq(W5QRXt|d\}}||fS) Nz&No default path for Stack Exchange yetrLrMrNZtext1Ztext2rXgffffff?) ValueErrorrRjsonloadsr+rmr )r>rowsrErFegr9r{rrrstack_exchanges .rc Csfg}|jdddJ}|D]>}t|}|d}|dkr8q||d|df||fqW5QRX|S)NrLrMrNZ gold_labelr_Z sentence1Z sentence2)rRrrr+)r>rrrErFrrXrrrrs *rreuters_word_index.pklcCsFt|dd}t|d}tjdkr,t|}ntj|dd}||S)Nz=https://s3.amazonaws.com/text-datasets/reuters_word_index.pkl)originrb)latin1rN)rrRsys version_infopickleloadclose)rfdatarrrget_word_indexs   r)F)FF)TT)Nr)N)r)rre)N)NF)N)r)2 __future__rrUr` collectionsros.pathrrfr,rrZsrslyrrpathlibrZ_vendorized.keras_data_utilsrZ neural.utilr compatr ry NameErrorstrZGITHUBformatrrrrwr<rrrr?rIr=r5rirrrtr|r}r~rrrrrrrrs`          %