Gensim Dictionary¶
The Predictor
class uses a gensim dictionary to convert words (also called “tokens”) into IDs
for the keras model’s embedding layer. Each word has a unique numeric ID that allows the model to
create deep embedding to capture the lexical context.
This section of the documentation details the use of gensim dictionaries in poetic
, the default,
and custom dictionary options along with examples.
Default Dictionary¶
The poetic
package ships with a default dictionary in both pip
and conda
distributions.
Given its relatively manageable size, it is included as package data with each release, and there
is no need to download the dictionary separately, unlike the default keras model.
The dictionary is constructed using the entire corpus of 18th- and 19th-century literary works on
Project Gutenberg, as opposed to the randomly sample used for traning. It should cover a wide range
of lexicons encountered in both literature of the time and everyday usage although newest words and
slangs may be lacking. In the Predictor
class, all non-existant words are assigned with the
value 0 for consistency.
Format¶
The dictionary is saved with the save_as_text()
method of the
gensim.corpora.dictionary.Dictionary
class. The file has the following format, which is
also documented here:
76242402
440 ! 4922258
36666 # 12419
17501 $ 23781
142 ' 2078602
174 '' 5630856
The first line is the number of entries, and following lines each has three tokens separated by tabs: ID, word, document frequency.
Only dictionaries of this format are supported at this time because a tab-separated text file will
allow the useage without gensim
dependency for best future support with custom classes. If a
custom dictionary is saved with the save()
method, it needs to be loaded manually as documented
below.
Loading¶
The Predictor
class loads the default dictionary by default if the dict
parameter is not
specified as in the example below:
import poetic
pred = poetic.Predictor()
One important thing to note is that the dict
and model
parameters are independent of
each other: loading a custom model will not require a custom dictionary and vice versa. Therefore,
if a custom model is used, it is recommended, though not required by the package, to use a custom
dictionary.
Under the hood, the Predictor
calls the Initializer
class to load the dictionary, which is
also a valid way of loading the dictionary independently:
import poetic
dictionary = poetic.util.Initializer.load_dict()
# If a Predictor is to be used:
pred = poetic.Predictor(dict=dictionary)
The above two snippets are functionally equivalent as shown, but the latter approach allows for the use of a dictionary independently, including accessing its own methods attributes, etc.
The dictionary is stored in the data directory of the package. To access the path of the dictionary, use this snippet:
import pkg_resources
import poetic
data_dir = pkg_resources.resource_filename("poetic", "data/")
dictionary_path = data_dir + "word_dictionary_complete.txt"
Updating¶
The dictionary will be updated with the package itself. However, on the current roadmap, there is no update planned for the gensim dictionary itself. Should there be a change, the process will be automated without manual downloading, renaming, etc.
Custom Dictionary¶
The current version, v1.0.x, does not have have full support for custom dictionary although
the Predictor
class does allow a custom dictionary during initialization. Since there is
not yet support for custom models, using a custom dictionary will be practically meaningless
with one exception: the use of a custom model with the same input shape with a custom model.
See the “Keras Models” section for a more detailed explanation of the state of custom models.
To use a custom model with the Predictor
, the following snippet will work:
import poetic
import gensim
dictionary = gensim.corpora.Dictionary.load_from_text(fname="<PATH>")
pred = poetic.Predictor(dict=dictionary)
To use a dictionary saved in the format saved with the save()
method:
import poetic
import gensim
dictionary = gensim.corpora.Dictionary.load(fname="<PATH>")
pred = poetic.Predictor(dict=dictionary)