σ ω΅Θ[c@@s\dZddlmZddlmZddlZddlmZdefd„ƒYZ dS( sText token indexer.i(tabsolute_import(tprint_functionNi(t _constantst VocabularycB@seZdZd d ddd d„Zd„Zd„Zd„Zed„ƒZ ed„ƒZ ed „ƒZ ed „ƒZ d „Z d „ZRS(sr Indexing for text tokens. Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can be used by token embeddings. Parameters ---------- counter : collections.Counter or None, default None Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as `most_freq_count` and `min_freq`. Keys of `counter`, `unknown_token`, and values of `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple. most_freq_count : None or int, default None The maximum possible number of the most frequent tokens in the keys of `counter` that can be indexed. Note that this argument does not count any token from `reserved_tokens`. Suppose that there are different keys of `counter` whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by `counter` and `reserved_tokens`, this argument has no effect. min_freq : int, default 1 The minimum frequency required for a token in the keys of `counter` to be indexed. unknown_token : hashable object, default '<unk>' The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. Keys of `counter`, `unknown_token`, and values of `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple. reserved_tokens : list of hashable objects or None, default None A list of reserved tokens that will always be indexed, such as special symbols representing padding, beginning of sentence, and end of sentence. It cannot contain `unknown_token`, or duplicate reserved tokens. Keys of `counter`, `unknown_token`, and values of `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple. Properties ---------- token_to_idx : dict mapping str to int A dict mapping each token to its index integer. idx_to_token : list of strs A list of indexed tokens where the list indices and the token indices are aligned. unknown_token : hashable object The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. reserved_tokens : list of strs or None A list of reserved tokens that will always be indexed. iscC@s«|dkstdƒ‚|dk rot|ƒ}||ksHtdƒ‚t|ƒt|ƒksotdƒ‚n|j||ƒ|dk r§|j|||||ƒndS(Nis+`min_freq` must be set to a positive value.s0`reserved_token` cannot contain `unknown_token`.s;`reserved_tokens` cannot contain duplicate reserved tokens.(tAssertionErrortNonetsettlent"_index_unknown_and_reserved_tokenst_index_counter_keys(tselftcountertmost_freq_counttmin_freqt unknown_tokentreserved_tokenstreserved_token_set((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyt__init__Os     cC@sg||_|g|_|dkr-d|_n||_|jj|ƒd„t|jƒDƒ|_dS(s$Indexes unknown and reserved tokens.cS@si|]\}}||“qS(((t.0tidxttoken((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pys os N(t_unknown_tokent _idx_to_tokenRt_reserved_tokenstextendt enumeratet _token_to_idx(R RR((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyRbs     c C@s!t|tjƒstdƒ‚|dk r6t|ƒntƒ}|j|ƒt|jƒdd„ƒ}|j dd„dt ƒt |ƒ|dkr‘t |ƒn|}xr|D]j\} } | |ksάt |j ƒ|krΰPn| |kr―|j j | ƒt |j ƒd|j| €scS@s|dS(Ni((R((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyRstreverseiN(t isinstancet collectionstCounterRRRtaddtsortedtitemstsorttTrueRRtappendR( R R RRR R tunknown_and_reserved_tokenst token_freqst token_capRtfreq((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyR qs  !  ! cC@s t|jƒS(N(Rt idx_to_token(R ((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyt__len__scC@s|jS(N(R(R ((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyt token_to_idxscC@s|jS(N(R(R ((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyR,”scC@s|jS(N(R(R ((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyR˜scC@s|jS(N(R(R ((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyRœscC@sqt}t|tƒs'|g}t}ng|D]+}||jkrP|j|ntj^q.}|rm|dS|S(sSConverts tokens to indices according to the vocabulary. Parameters ---------- tokens : str or list of strs A source token or tokens to be converted. Returns ------- int or list of ints A token index or a list of token indices according to the vocabulary. i(tFalseRtlistR&R.tCt UNKNOWN_IDX(R ttokenst to_reduceRtindices((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyt to_indices s  5cC@s¦t}t|tƒs'|g}t}nt|jƒd}g}xQ|D]I}t|tƒ si||kr|td|ƒ‚qG|j|j|ƒqGW|r’|dS|S(sZConverts token indices to tokens according to the vocabulary. Parameters ---------- indices : int or list of ints A source token index or token indices to be converted. Returns ------- str or list of strs A token or a list of tokens according to the vocabulary. is4Token index %d in the provided `indices` is invalid.i( R/RR0R&RR,tintt ValueErrorR'(R R5R4tmax_idxR3R((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyt to_tokensΊs   N(t__name__t __module__t__doc__RRRR R-tpropertyR.R,RRR6R:(((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyRs/      ( R=t __future__RRR tRR1tobjectR(((sX/usr/local/lib/python2.7/site-packages/mxnet-1.3.1-py2.7.egg/mxnet/contrib/text/vocab.pyts