ó ùµÈ[c@@s]dZddlmZddlmZddlZddlZddlZddlZddlZddl Z ddl m Z ddl m Z dd l mZdd l mZdd l mZd „Zd „Zed„Zde jfd„ƒYZedefd„ƒYƒZedefd„ƒYƒZdefd„ƒYZdefd„ƒYZdS(sText token embeddings.i(tabsolute_import(tprint_functionNi(t _constants(tvocabi(tndarray(tregistry(tbasecC@stjtdƒ}||ƒS(s$Registers a new token embedding. Once an embedding is registered, we can create an instance of this embedding with :func:`~mxnet.contrib.text.embedding.create`. Examples -------- >>> @mxnet.contrib.text.embedding.register ... class MyTextEmbed(mxnet.contrib.text.embedding._TokenEmbedding): ... def __init__(self, pretrained_file_name='my_pretrain_file'): ... pass >>> embed = mxnet.contrib.text.embedding.create('MyTokenEmbed') >>> print(type(embed)) stoken embedding(Rtget_register_funct_TokenEmbedding(t embedding_clstregister_text_embedding((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pytregister(scK@stjtdƒ}|||S(s¯Creates an instance of token embedding. Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid `embedding_name` and `pretrained_file_name`, use `mxnet.contrib.text.embedding.get_pretrained_file_names()`. Parameters ---------- embedding_name : str The token embedding name (case-insensitive). Returns ------- An instance of `mxnet.contrib.text.glossary._TokenEmbedding`: A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file. stoken embedding(Rtget_create_funcR(tembedding_nametkwargstcreate_text_embedding((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pytcreate?scC@srtjtƒ}|dk rQ||kr:td|ƒ‚nt||jjƒƒSd„tjtƒjƒDƒSdS(s£Get valid token embedding names and their pre-trained file names. To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use `mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name)`. This method returns all the valid names of `pretrained_file_name` for the specified `embedding_name`. If `embedding_name` is set to None, this method returns all the valid names of `embedding_name` with their associated `pretrained_file_name`. Parameters ---------- embedding_name : str or None, default None The pre-trained token embedding name. Returns ------- dict or list: A list of all the valid pre-trained token embedding file names (`pretrained_file_name`) for the specified token embedding name (`embedding_name`). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (`pretrained_file_name`). They can be plugged into `mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name)`. s‚Cannot find `embedding_name` %s. Use `get_pretrained_file_names(embedding_name=None).keys()` to get all the valid embedding names.cS@s.i|]$\}}t|jjƒƒ|“qS((tlisttpretrained_file_name_sha1tkeys(t.0R R ((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pys s N( Rt get_registryRtNonetKeyErrorRRRtitems(R ttext_embedding_reg((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pytget_pretrained_file_namesZs   RcB@s­eZdZd„Zed„ƒZed„ƒZed„ƒZdd„Zd„Z d„Z d „Z e d „ƒZ e d „ƒZed „Zd „Zed„ƒZRS(s] Token embedding base class. To load token embeddings from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use :func:`~mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name)`. To get all the available `embedding_name` and `pretrained_file_name`, use :func:`~mxnet.contrib.text.embedding.get_pretrained_file_names()`. Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use :class:`~mxnet.contrib.text.embedding.CustomEmbedding`. Moreover, to load composite embedding vectors, such as to concatenate embedding vectors, use :class:`~mxnet.contrib.text.embedding.CompositeEmbedding`. For every unknown token, if its representation `self.unknown_token` is encountered in the pre-trained token embedding file, index 0 of `self.idx_to_vec` maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of `self.idx_to_vec` maps to the token embedding vector initialized by `init_unknown_vec`. If a token is encountered multiple times in the pre-trained token embedding file, only the first-encountered token embedding vector will be loaded and the rest will be skipped. The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, are taken as the indexed tokens of the embedding. Properties ---------- token_to_idx : dict mapping str to int A dict mapping each token to its index integer. idx_to_token : list of strs A list of indexed tokens where the list indices and the token indices are aligned. unknown_token : hashable object The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. reserved_tokens : list of strs or None A list of reserved tokens that will always be indexed. vec_len : int The length of the embedding vector for each token. idx_to_vec : mxnet.ndarray.NDArray For all the indexed tokens in this embedding, this NDArray maps each token's index to an embedding vector. The largest valid index maps to the initialized embedding vector for every reserved token, such as an unknown_token token and a padding token. cK@stt|ƒj|dS(N(tsuperRt__init__(tselfR((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR·scC@s|S(N((tclstpretrained_file_name((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyt_get_download_file_nameºscC@sRtjjdtjƒ}|jjƒ}d}|jd|d|d|j|ƒƒS(NtMXNET_GLUON_REPOs,{repo_url}gluon/embeddings/{cls}/{file_name}trepo_urlRt file_name( tostenvirontgettCtAPACHE_REPO_URLt__name__tlowertformatR (RRR"R t url_format((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyt_get_pretrained_file_url¾s cC@sddlm}m}|jjƒ}tjj|ƒ}|j|ƒ}tjj ||ƒ}tjj ||ƒ}tjj |ƒ} tjj || ƒ} |j |} t |dƒrÃ|j | } n| } tjj|ƒ sì||| ƒ r‰||| d| ƒtjj| ƒd} | dkrLtj| dƒ}|j|ƒWdQXq‰| dkr‰tj| d ƒ}|jd |ƒWdQXq‰n|S( Ni(t check_sha1tdownloadtpretrained_archive_name_sha1t sha1_hashis.ziptrs.gzsr:gztpath(t gluon.utilsR.R/R)R*R$R3t expanduserR-tjointbasenameRthasattrR0texiststsplitexttzipfiletZipFilet extractallttarfiletopen(Rtembedding_rootRR.R/R turlt embedding_dirtpretrained_file_pathtdownloaded_filetdownloaded_file_pathtexpected_file_hashtexpected_downloaded_hashtexttzfttar((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyt_get_pretrained_fileÇs.   tutf8c C@s°tjj|ƒ}tjj|ƒs3tdƒ‚ntjd|ƒd }g}tƒ}d }d} t j |dd|ƒ½} x³| D]«} | d7} | j ƒj |ƒ} t | ƒdksÓtd| |fƒ‚| dg| dD]} t| ƒ^qå}} ||jkr5|d kr5| }|j|jƒq†||kr[tjd| |fƒq†t | ƒdkrŠtjd | || fƒq†|d kr¹t | ƒ}|jdg|ƒn4t | ƒ|ksítd | |t | ƒ|fƒ‚|j| ƒ|jj|ƒt |jƒd|j|<|j|ƒq†WWd QX||_tj|ƒjd |jfƒ|_|d kr“|d |jƒ|jtjJsRNiiN(tsumRctzerosRft idx_to_vectget_vecs_by_tokensRbRg( Rttoken_embeddingst vocab_lentvocab_idx_to_tokent new_vec_lentnew_idx_to_vect col_startR}tcol_end((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyt_set_idx_to_vec_by_embeddings:s  )  cC@s]|dk rYt|tjƒs*tdƒ‚|j|gt|ƒ|jƒ|j|ƒndS(NsUThe argument `vocabulary` must be an instance of mxnet.contrib.text.vocab.Vocabulary.( Rt isinstanceRt VocabularyRXR‰RWRwR|(RR{((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyt_build_embedding_for_vocabularyYs   cC@s|jS(N(Rb(R((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRffscC@s|jS(N(Rg(R((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR€jscC@sòt}t|tƒs'|g}t}n|s[g|D]}|jj|tjƒ^q4}nMg|D]@}||jkr„|j|n|jj|jƒtjƒ^qb}t j t j |ƒ|j |j j d|j j dƒ}|rî|dS|S(s2Look up embedding vectors of tokens. Parameters ---------- tokens : str or list of strs A token or a list of tokens. lower_case_backup : bool, default False If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property `token_to_idx`, the token in the lower case will be looked up. Returns ------- mxnet.ndarray.NDArray: The embedding vector(s) of the token(s). According to numpy conventions, if `tokens` is a string, returns a 1-D NDArray of shape `self.vec_len`; if `tokens` is a list of strings, returns a 2-D NDArray of shape=(len(tokens), self.vec_len). ii(tFalseRŠRtTrueRuR&R'RhR*Rct EmbeddingRdR€RN(RRltlower_case_backupt to_reduceRstindicestvecs((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRns  .J%cC@s‰|jd k stdƒ‚t|tƒ s=t|ƒdkrµt|tjƒrdt|jƒd ksptdƒ‚t|tƒs‹|g}nt|jƒdkrè|j dƒ}qèn3t|tjƒrÜt|jƒdksètdƒ‚|jt|ƒ|j fkstdƒ‚g}xT|D]L}||j krK|j |j |ƒqt d||jtjfƒ‚qW||jtj|ƒs Rži(RR0RRV(RRt src_archiveRŸ((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR ssglove.840B.300d.txtt embeddingscK@shtj|ƒtt|ƒj|tj||ƒ}|j|d|ƒ|dk rd|j|ƒndS(Nt (RR˜RRRKRtRRŒ(RRR@RjR{RRC((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR s   N(R)R™RšR'tGLOVE_PRETRAINED_FILE_SHA1R0tGLOVE_PRETRAINED_ARCHIVE_SHA1RR›R R$R3R6Rtdata_dirRcRRR(((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRÔs<  tFastTextcB@s\eZdZejZejZed„ƒZ de j j e jƒdƒejdd„ZRS(s The fastText word embedding. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/) References: Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606 Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759 FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651 For 'wiki.multi' embeddings: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087 Website: https://fasttext.cc/ To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md License for pre-trained embeddings: https://creativecommons.org/licenses/by-sa/3.0/ Parameters ---------- pretrained_file_name : str, default 'wiki.en.vec' The name of the pre-trained token embedding file. embedding_root : str, default $MXNET_HOME/embeddings The root directory for storing embedding-related files. init_unknown_vec : callback The callback used to initialize the embedding vector for the unknown token. vocabulary : :class:`~mxnet.contrib.text.vocab.Vocabulary`, default None It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed. Properties ---------- token_to_idx : dict mapping str to int A dict mapping each token to its index integer. idx_to_token : list of strs A list of indexed tokens where the list indices and the token indices are aligned. unknown_token : hashable object The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. reserved_tokens : list of strs or None A list of reserved tokens that will always be indexed. vec_len : int The length of the embedding vector for each token. idx_to_vec : mxnet.ndarray.NDArray For all the indexed tokens in this embedding, this NDArray maps each token's index to an embedding vector. The largest valid index maps to the initialized embedding vector for every reserved token, such as an unknown_token token and a padding token. cC@sdj|jdƒd ƒdS(NRžiÿÿÿÿs.zip(R6RV(RR((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR €sswiki.simple.vecR¡cK@shtj|ƒtt|ƒj|tj||ƒ}|j|d|ƒ|dk rd|j|ƒndS(NR¢(R¦R˜RRRKRtRRŒ(RRR@RjR{RRC((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR…s   N(R)R™RšR'tFAST_TEXT_ARCHIVE_SHA1R0tFAST_TEXT_FILE_SHA1RR›R R$R3R6RR¥RcRRR(((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR¦.sJ  tCustomEmbeddingcB@s&eZdZddejdd„ZRS(sT User-defined token embedding. This is to load embedding vectors from a user-defined pre-trained text embedding file. Denote by '[ed]' the argument `elem_delim`. Denote by [v_ij] the j-th element of the token embedding vector for [token_i], the expected format of a custom pre-trained token embedding file is: '[token_1][ed][v_11][ed][v_12][ed]...[ed][v_1k]\\n[token_2][ed][v_21][ed][v_22][ed]...[ed] [v_2k]\\n...' where k is the length of the embedding vector `vec_len`. Parameters ---------- pretrained_file_path : str The path to the custom pre-trained token embedding file. elem_delim : str, default ' ' The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file. encoding : str, default 'utf8' The encoding scheme for reading the custom pre-trained token embedding file. init_unknown_vec : callback The callback used to initialize the embedding vector for the unknown token. vocabulary : :class:`~mxnet.contrib.text.vocab.Vocabulary`, default None It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed. Properties ---------- token_to_idx : dict mapping str to int A dict mapping each token to its index integer. idx_to_token : list of strs A list of indexed tokens where the list indices and the token indices are aligned. unknown_token : hashable object The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. reserved_tokens : list of strs or None A list of reserved tokens that will always be indexed. vec_len : int The length of the embedding vector for each token. idx_to_vec : mxnet.ndarray.NDArray For all the indexed tokens in this embedding, this NDArray maps each token's index to an embedding vector. The largest valid index maps to the initialized embedding vector for every reserved token, such as an unknown_token token and a padding token. R¢RLcK@sLtt|ƒj||j||||ƒ|dk rH|j|ƒndS(N(RR©RRtRRŒ(RRCRiRMRjR{R((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRÇs N(R)R™RšRcRRR(((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyR©“s2tCompositeEmbeddingcB@seZdZd„ZRS(sòComposite token embeddings. For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it. Such embedding vectors can be loaded from externally hosted or custom pre-trained token embedding files, such as via token embedding instances. Parameters ---------- vocabulary : :class:`~mxnet.contrib.text.vocab.Vocabulary` For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it. token_embeddings : instance or list of `mxnet.contrib.text.embedding._TokenEmbedding` One or multiple pre-trained token embeddings to load. If it is a list of multiple embeddings, these embedding vectors will be concatenated for each token. Properties ---------- token_to_idx : dict mapping str to int A dict mapping each token to its index integer. idx_to_token : list of strs A list of indexed tokens where the list indices and the token indices are aligned. unknown_token : hashable object The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. reserved_tokens : list of strs or None A list of reserved tokens that will always be indexed. vec_len : int The length of the embedding vector for each token. idx_to_vec : mxnet.ndarray.NDArray For all the indexed tokens in this embedding, this NDArray maps each token's index to an embedding vector. The largest valid index maps to the initialized embedding vector for every reserved token, such as an unknown_token token and a padding token. cC@s’t|tjƒstdƒ‚t|tƒs9|g}nx)|D]!}t|tƒs@tdƒ‚q@W|j|ƒ|j|t|ƒ|j ƒdS(NsWThe argument `vocabulary` must be an instance of mxnet.contrib.text.indexer.Vocabulary.sÚThe argument `token_embeddings` must be an instance or a list of instances of `mxnet.contrib.text.embedding.TextEmbedding` whose embedding vectors will beloaded or concatenated-then-loaded to map to the indexed tokens.( RŠRR‹RXRRR|R‰RWRw(RR{R‚R}((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRös     (R)R™RšR(((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyRªÐs%(Ršt __future__RRRTRQR$R>R\R;tRR'RRRcRRR RRRR‹RRR¦R©Rª(((s\/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/embedding.pyts0         +ÿPYd=