Making Predictions¶
The most important functionality of poetic
is the interface to make predictions using
pretrained keras models. The Predictor
class provides a simple-to-use, one-stop solution
to predict how poetic any given input is and how much it resembles 18th- and 19th-century
poetry if the default models are used.
This page documents the usage of the Predictor
class along with common topics and examples.
Initialization¶
The Predictor
class requires instantiation to work properly since there are no utility
functions or class methods in this class. To create an instance, simply use poetic.Predictor()
becasue it is a package-level class:
import poetic
pred = poetic.Predictor()
In this example, all default parameters are used, and the default lexical model and dictionary
will be loaded. There is not yet official support for custom model and dictionary. However,
Predictor
does accept previously loaded assets using either keras and gensim’s API or poetic’s
Initializer
class:
import poetic
model = poetic.util.Initializer.load_model()
dictionary = poetic.util.Initializer.load_dict()
pred = poetic.Predictor(model=model, dict=dictionary)
If the default models have not been downloaded from its GitHub repo, there is an option to override the user input prompt:
import poetic
pred = poetic.Predictor(force_download_assets=True)
Once a Predictor
object is instantiated, it can be reused to make multiple predictions and to
preprocess different inputs. No method will have meaningful side effects, although the tokenize()
method modifies the internal _sentences
, which temporarily stores the tokenized input and
will be overridden with each subsequent operation. Therefore, a Predictor
instance is fully
reuable and safe.
Making Predictions¶
The Predictor
allows users to make poetic predictons using either the predict()
or the
predict_file()
method. All preprocessing steps are automatically handled without any need
to manually clean inputs.
Prediction with Strings¶
To predict a string, use the predict()
method of the Predictor
instance. The input
string can consist of multiple sentences, which are then tokenized by preprocessor. The longest
supported sentence (after sentence tokenization) is 456 tokens, including words and
punctuations.
As an example of string prediction:
import poetic
pred = poetic.Predictor()
result = pred.predict("Hi. I am poetic. Are you?")
The predict()
method will return a Predictions
object, which in turn supports post-
processing, such as running diagnostics and saving results to file.
Prediction with Text Files¶
Plain text files are also supported. To load and predict a file, use the predict_file()
method, and all preprocessing and the object returned will function exactly the same as the
predict()
method.
Under the hood, it loads the file into a single string, and it then calls the predict()
method. For large files that can potentially exceed system RAM, it will be better to manually
load the files and make predictions.
import poetic
pred = poetic.Predictor()
result = pred.predict_file("<PATH>")
Preprocessing¶
The preprocessing toolchain consists of the following steps: tokenization, word ID conversion, lower-case conversion, and padding. The latter two steps are primarily for keras models while tokenization can apply to other NLP workflows. This sections documents some of the details and their supported usage.
One-step Preprocessing¶
To preprocess the input for the default model of poetic
:
import poetic
pred = poetic.Predictor()
model_input = pred.preprocess("This is poetic. Isn't it?")
The preprocess()
method returns a 2-d numpy array of tokenized word IDs that can
be directly predicted using the keras model’s predict()
method. However, the
predictor’s predict()
method does not support a preprocessed input: only raw
input in strings are supported.
Tokenization¶
Tokenization is the process of separating a string input into tokens, which are units
of texts that the algorithms support. The Predictor
uses NLTK’s sent_tokenize()
and word_tokenize()
functions respectively to perform two-step tokenization: first,
the string, regardless of length, is tokenized into complete sentences; then, each
sentence is tokenized into words and punctuations.
The tokenize()
methods can be used as a stand-alone function although it is not a
proper classmethod for compatibility with the Predictions
class.
As an example:
import poetic
pred = poetic.Predictor()
model_input = pred.tokenize("This is poetic. Isn't it?")
The output will be a 2-d nested list in the following format:
[['This', 'is', 'poetic', '.'], ['Is', "n't", 'it', '?']]
Padding¶
Padding is part of the preprocess()
method, and it cannot be called seprately.
It pads each tokenized input in accordance with the input shape of the default lexical
model used, which is 456. There is not yet support to adjust the padding length in this
release, and this is the reason why custom model support is very limited.
Under the hood, the tf.keras.preprocessing.sequence.pad_sequences()
method is called,
and the default pre-padding is used. Given that the default lexical model uses an LSTM
architechture, the pre-padding strategy makes sense. Currently, there is no support for
other types of padding.
Word IDs¶
All tokens (mostly words, contractions, and punctuations after tokenized) are converted
into word IDs, which are all postive int
. By default, the gensim dictionary shipped
by the package is used. However, if a custom dictionary is supplied at initialization of
the Predictor
, it will likely be incomptabile with the default model because models
are specifically trained with one set of word IDs. Therefore, it is not recommended
to use a custom dictionary.