Making Predictions¶
The most important functionality of poetic is the interface to make predictions using
pretrained keras models. The Predictor class provides a simple-to-use, one-stop solution
to predict how poetic any given input is and how much it resembles 18th- and 19th-century
poetry if the default models are used.
This page documents the usage of the Predictor class along with common topics and examples.
Initialization¶
The Predictor class requires instantiation to work properly since there are no utility
functions or class methods in this class. To create an instance, simply use poetic.Predictor()
becasue it is a package-level class:
import poetic
pred = poetic.Predictor()
In this example, all default parameters are used, and the default lexical model and dictionary
will be loaded. Starting in v.1.1.0, there is support to initialize the Predictor with custom
model and dictionary using either keras and gensim’s API or poetic’s Initializer class:
import poetic
model = poetic.util.Initializer.load_model(model_path="<PATH>", weights_path="<PATH>")
dictionary = poetic.util.Initializer.load_dict(dictionary_path="<PATH>")
pred = poetic.Predictor(model=model, dictionary=dictionary)
If the default models have not been downloaded from its GitHub repo, there is an option to override the user input prompt:
import poetic
pred = poetic.Predictor(force_download_assets=True)
Once a Predictor object is instantiated, it can be reused to make multiple predictions and to
preprocess different inputs. No method will have meaningful side effects, although the tokenize()
method modifies the internal _sentences, which temporarily stores the tokenized input and
will be overridden with each subsequent operation. Therefore, a Predictor instance is fully
reuable and safe.
Making Predictions¶
The Predictor allows users to make poetic predictons using either the predict() or the
predict_file() method. All preprocessing steps are automatically handled without any need
to manually clean inputs.
Prediction with Strings¶
To predict a string, use the predict() method of the Predictor instance. The input
string can consist of multiple sentences, which are then tokenized by the preprocessor. The
longest supported sentence (after sentence tokenization) depends on the model input shape,
which the Predictor recognizes automatically. For the default lexical model, the maximum is
is 456 tokens, including words and punctuations, and unsupported length will throw an
InputLengthError.
As an example of string prediction:
import poetic
pred = poetic.Predictor()
result = pred.predict("Hi. I am poetic. Are you?")
The predict() method will return a Predictions object, which in turn supports post-
processing, such as running diagnostics and saving results to file.
Prediction with Text Files¶
Plain text files are also supported. To load and predict a file, use the predict_file()
method, and all preprocessing and the object returned will function exactly the same as the
predict() method.
Under the hood, it loads the file into a single string, and it then calls the predict()
method. For large files that can potentially exceed system RAM, it will be better to manually
load the files and make predictions.
import poetic
pred = poetic.Predictor()
result = pred.predict_file("<PATH>")
Preprocessing¶
The preprocessing toolchain consists of the following steps: tokenization, word ID conversion, lower-case conversion, and padding. The latter two steps are primarily for keras models while tokenization can apply to other NLP workflows. This sections documents some of the details and their supported usage.
One-step Preprocessing¶
To preprocess the input for the default model of poetic:
import poetic
pred = poetic.Predictor()
model_input = pred.preprocess("This is poetic. Isn't it?")
The preprocess() method returns a 2-d numpy array of tokenized word IDs that can
be directly predicted using the keras model’s predict() method. However, the
predictor’s predict() method does not support a preprocessed input: only raw
input in strings are supported.
Tokenization¶
Tokenization is the process of separating a string input into tokens, which are units
of texts that the algorithms support. The Predictor uses NLTK’s sent_tokenize()
and word_tokenize() functions respectively to perform two-step tokenization: first,
the string, regardless of length, is tokenized into complete sentences; then, each
sentence is tokenized into words and punctuations.
The tokenize() methods can be used as a stand-alone function although it is not a
proper classmethod for compatibility with the Predictions class.
As an example:
import poetic
pred = poetic.Predictor()
model_input = pred.tokenize("This is poetic. Isn't it?")
The output will be a 2-d nested list in the following format:
[['This', 'is', 'poetic', '.'], ['Is', "n't", 'it', '?']]
Padding¶
Padding is part of the preprocess() method, and it cannot be called seprately.
It pads each tokenized input in accordance with the input shape of the keras model
supplied or loaded by default during instantiation. The default lexical model pads
to the length of 456. All custom keras models with the general input shape of
(None, int) and an embedding layer are fully supported.
Under the hood, the tf.keras.preprocessing.sequence.pad_sequences() method is called,
and the default pre-padding is used. Given that the default lexical model uses an LSTM
architechture, the pre-padding strategy makes sense. Currently, there is no support for
other types of padding.
Word IDs¶
All tokens (mostly words, contractions, and punctuations after tokenized) are converted
into word IDs, which are all postive int. By default, the gensim dictionary shipped
by the package is used.
If a custom dictionary is supplied at initialization of the Predictor, it is recommended,
if not necessary to use a custom model even though the constructor does not enforce it.
Custom dictionaries, which have different word IDs will likely be incomptabile with the
default model because models are specifically trained with one set of word IDs.