Gensim Dictionary¶
The Predictor class uses a gensim dictionary to convert words (also called “tokens”) into IDs
for the keras model’s embedding layer. Each word has a unique numeric ID that allows the model to
create deep embedding to capture the lexical context.
This section of the documentation details the use of gensim dictionaries in poetic, the default,
and custom dictionary options along with examples.
Default Dictionary¶
The poetic package ships with a default dictionary in both pip and conda distributions.
Given its relatively manageable size, it is included as package data with each release, and there
is no need to download the dictionary separately, unlike the default keras model.
The dictionary is constructed using the entire corpus of 18th- and 19th-century literary works on
Project Gutenberg, as opposed to the randomly sample used for traning. It should cover a wide range
of lexicons encountered in both literature of the time and everyday usage although newest words and
slangs may be lacking. In the Predictor class, all non-existant words are assigned with the
value 0 for consistency.
Format¶
The dictionary is saved with the save_as_text() method of the
gensim.corpora.dictionary.Dictionary class. The file has the following format, which is
also documented here:
76242402
440 ! 4922258
36666 # 12419
17501 $ 23781
142 ' 2078602
174 '' 5630856
The first line is the number of entries, and following lines each has three tokens separated by tabs: ID, word, document frequency.
Only dictionaries of this format are supported at this time because a tab-separated text file will
allow the useage without gensim dependency for best future support with custom classes. If a
custom dictionary is saved with the save() method, it needs to be loaded manually as documented
below.
Loading¶
The Predictor class loads the default dictionary by default if the dictionary parameter is
not specified as in the example below:
import poetic
pred = poetic.Predictor()
Under the hood, the Predictor calls the Initializer class to load the dictionary, which is
also a valid way of loading the dictionary independently:
import poetic
dictionary = poetic.util.Initializer.load_dict()
# If a Predictor is to be used:
pred = poetic.Predictor(dictionary=dictionary)
The above two snippets are functionally equivalent as shown, but the latter approach allows for the use of a dictionary independently, including accessing its own methods attributes, etc.
The dictionary is stored in the data directory of the package. To access the path of the dictionary, use this snippet:
import pkg_resources
import poetic
data_dir = pkg_resources.resource_filename("poetic", "data/")
dictionary_path = data_dir + "word_dictionary_complete.txt"
Updating¶
The dictionary will be updated with the package itself. However, on the current roadmap, there is no update planned for the gensim dictionary itself. Should there be a change, the process will be automated without manual downloading, renaming, etc.
Custom Dictionary¶
There is now full support for custom dictionary in both the Predictor and the Initializer
class with all gensim models saved as a text file with the save_as_text() method or files of
the same format. There are mainly two use cases of a custom dictionary, which is similar to the
usage of a default dictionary, as documented below.
Loading¶
The load_dict() method of the Initializer class now supports loading a dictionary through
stored elsewhere:
import poetic
dictionary = poetic.util.Initializer.load_dict(dictionary_path="<PATH>")
Custom Dictionary with Predictor¶
The workflow of using a custom dictionary with the Predictor class is practically combining
the loading snippet with the initialization of a predictor:
import poetic
dictionary = poetic.util.Initializer.load_dict(dictionary_path="<PATH>")
pred = poetic.Predictor(dictionary=dictionary)
One important thing to note is that the dictionary and model parameteres of the Predictor’s
constructor are independent of each other: loading a custom model will not require a custom dictionary
and vice versa. Therefore, if a custom model is used, it is recommended, though not required by the
package, to use a custom dictionary. A more common way of using a custom model and dictionary combination
looks like this:
import poetic
model = poetic.util.Initializer.load_model(dictionary_path="<PATH>")
dictionary = poetic.util.Initializer.load_dict(dictionary_path="<PATH>")
pred = poetic.Predictor(model=model, dictionary=dictionary)
Dictionary Saved with “save()”¶
Gensim’s gensim.corpora.dictionary.Dictionary class has a save() method that saves
a dictionary in a format that is not compatible with the load_dict() method. Therefore,
gensim needs to be imported to load the dictionary separately:
import poetic
import gensim
dictionary = gensim.corpora.Dictionary.load(fname="<PATH>")
pred = poetic.Predictor(dictionary=dictionary)