Keras Models

poetic relies on machine learning models trained with the tensorflow.keras framework. This page gives an overview on the default models, the package’s infrastructure for loading the model and making predictors, and future model support.


Default Models

The default models are the ones that are “shipped” with package, and currently, poetic supports only one model: the lexical model. See the The Lexical Model section for a detailed explanation of the model’s performance and its backgrounds.

Downloading

The trained model and weights combined are very large (~838MB), which is impractical to ship with the pip or conda package. To address this issue, the models are hosted in their own repository, poetic-models, on Github as releases. This can not only address the package size issue but also allow users to decide whether and when to download the models in case there are bandwidth limitations.

A manual download of the models is not necessary. Whenever the Initializer class is called to load the default models, it will automatically check for the local presence of the models and in the case the models are not present, the poetic.util.Initializer.download_assets() method is called to fetch the models and place it in the correct directory. There is no need to call the method to download the model upon first installation or update.

By default, the download_assets() method will ask for user input with the sample output like the following:

The following important assets are missing:

Downloading from: https://github.com/kevin931/poetic-models/releases/download/v0.1-alpha/sent_model.zip
Download size: 835MB.


Would you like to download? [y/n]

If the user denies the download with letters other than “Y” or “y”, the program may halt because of the lack of a model. If there is a need to bypass the command-line input, set force_download_assets=True when initializing the Predictor class or force_download=True for download_assets() and load_model() methods of the Initializer class. The following demonstrates a few valid ways of force downloading without command-line inputs:

import poetic

# Approach #1
poetic.Predictor(force_download_assets=True)
# Approach #2
poetic.util.download_assets(force_download=True)
# Approach #3
poetic.util.load_model(force_download=True)

Loading

In the simplest use case of poetic through the Predictor class, there is no need to manually load the model as the constructor can automatically load the default model if the class is initialzed with the following:

import poetic

pred = poetic.Predictor(force_download_assets=True)

However, there are benefits to loading the keras models directly, as it can expose the full keras interface. The util module provides a few functions to conveiently load the default model:

  • poetic.util.Initializer.initialize(): This class method returns the command-line arguments, the default keras model, and the gensim dictionary.

  • poetic.util.Initializer.load_model(): This class mothod returns the default keras model.

The advantage of using these two models is that only one function is necessary to load the model without having to know the data directory. However, to access the paths of the model and the weights themselves, use the following snippet:

import pkg_resources
import poetic

data_dir = pkg_resources.resource_filename("poetic", "data/")
weights_path = data_dir + "sent_model.h5"
model_path = data_dir + "sent_model.json"

Updating

Currently, model updates are planned to be handled with package updates. At of now, there is no plan to update the existing model, except for changing its name from “sentence” to “lexical” model.

On the roadmap, there is plan to support meterical and combined lexical and metrical models. With the release of such models, the package will be updated with the new model urls or a new update mechanism.

If a qualitative update occurs, re-downloading the models will likely prove to be necessary, and similar procedures will be in place as the initial download of the model.


The Lexical Model

The lexical model is currently the only default model available in poetic. It is trained using 18th- and 19th-century works with the lexical contexts through embedding (i.e. the contents of the works themselves in the form of words).

Essentially, the model is a classifier that classifies whether a given input is poetic. More precisely, it can be interpreted as whether an input resembles eighteenth- and nineteenth-century poetry. This definition will be the basis of the concept of the “poetic score” throughout the package and the package’s main use case.

A quick note on naming: The model is now called the “sentence model” stored with “sent_model.h5” and “sent_model.json” in v.1.0 because all training sets and inputs are sentence tokenized. Since all other future models will also take the same data format in sentence even though they are not necessarily lexical based, the model will be renamed to the lexical model to better reflect how it was trained and what it represents.

Training and Validation Data

All training and validation data come from Project Gutenberg. The datasets consist of solely 18th- and 19th-century works separated into two categories: poetry and prose (non-poetry). The rationale of this time period is that works during these two centuries are vastly avaible in the public domain and digitized. Further, it is also a time when formal poetry was still the norm instead of the rapid rise of free verse. Thus, this dataset will allow the lexical model to train on the most distinguishing features of poetry.

Given the amount of data available on Project Gutenberg, the training and validation sets consist of a random sample of the aforementioned works. Although a different sample or the entire corpus may result in a different model, the amount of data within the sample used can allow reasonable assumption of representativeness of the sample.

Model Architecture

The overall architecture of the lexical model is a bidirectional long-short-term memmoey neural network (LSTM) trained using the keras API of tensorflow. LSTM is known to work well with lexical data although its performance has now been surpassed by large language models, such as Google’s BERT.

Below is a high-level overview of the layers used in training the model (in sequential order):

Layer

Output Shape

Input

(None, 456)

Embedding

(None, 456, 128)

LSTM

(None, 456, 128)

LSTM Forward

(None, 128)

LSTM Backward

(None, 128)

Concatenate

(None, 256)

Dropout

(None, 256)

Dense

(None, 64)

Dropout

(None, 64)

Dense/Output

(None, 1)

Model Performance

The confusion matrix:

Prose

Poetry

Prose

129168

42082

Poetry

38230

125316

Classification Diagnostics:

  • Accuracy: 0.7601

  • Precision: 0.7662

  • Sensitivity: 0.7486


Custom Models

There is infrastucture in place for the Predictor class to utilize custom models. However, v1.0.x does not support for custom models because the preprocessing pipeline custom models will likely require a different input shape, which is not supported by the preprocessing pipeline.

Future Updates

Custom keras models with the same input dimension and an embedding layer will be fully supported starting v.1.1.0, which is already in development on the dev branch of poetic. This will also be accompanied by allowing custom gensim dictionaries, which are often necessary for different models. No other types of models’ support is planned at this stage of development.