This article demonstrates a modeling example using the
tidymodels
framework for text
classification. Data are downloaded via the
gutenbergr
package, including
5 books written by either Emily Brontë or Charlotte Brontë. The goal is
to predict the author given words in a line, that is the probability of
line being written by one sister instead of another.
The cleaned books
dataset contains lines as individual rows.
To obtain tidy text structure illustrated in Text Mining with
R, I use unnest_tokens()
to perform
tokenization and remove all the stop words. I also removed characters
like '
, 's
, '
and whitespace to return valid column names after
widening. But it turns out this served as some sort of stemming too!
(heathcliff’s becomes heathcliff). Then low frequency words (whose
frequency is less than 0.05% of an author’s total word counts) are
removed. The cutoff may be a little too high if you plot that histogram,
but I really need this to save computation efforts on my laptop
:sweat_smile:.
Comparing Word Frequency
Before building an actual predictive model, let’s do some EDA to see different tendency to use a particular word! This will also shed light on what we would expect from the text classification. Now, we will compare word frequency (proportion) between the two sisters.
Words that fall on or near the dotted line (such as “home”, “head”, and “half”) are used with similar frequency by both sisters. In contrast, words positioned far from the line indicate a preference by one author over the other. For instance, “heathcliff”, “linton”, and “catherine” appear more frequently in one sister’s works compared to the other’s.
What does this plot tell us? Judged only by word frequency, it looks that there are a number of words that are quite characteristic of Emily Brontë (upper left corner). Charlotte, on the other hand, has few representative words (bottom right corner). We will investigate this further in the model.
Modeling
Data Preprocessing
There are 430 and features (words) and 32029 observations in total. Approximately 18% of the response are 1 (Emily Brontë).
Now it’s time to widen our data to reach an appropriate model structure, this similar to a document-term matrix, with rows being a line and column word count.
Train a Penalized Logistic Regression Model
Split the data into training set and testing set.
Specify a L1 penalized logistic model, center and scale all predictors
and combine them in to a workflow
object.
initial_fit
is a simple fitted regression model without any
hyper-parameters. By default glmnet
calls for 100 values of lambda
even if I specify lambda = 0.05. So the extracted result aren’t that
helpful.
We can make predictions with initial_fit
anyway, and examine metrics
like overall accuracy.
How good is our initial model?
Nearly 84% of all predictions are right. This isn’t a very satisfactory
result since “Charlotte Brontë” accounts for 81% of author
, making our
model only slightly better than a classifier that would assign all
author
with “Charlotte Brontë” anyway.
Tuning lambda
We can figure out an appropriate penalty using resampling and tune the model.
Here we build a set of 10 cross validations resamples, and set
levels = 100
to try 100 choices of lambda ranging from 0 to 1. The
lambda grid can be then tuned with the resamples.
We can plot the model metrics across different choices of lambda.
Ok, the two metrics both display a monotone decrease as lambda increases, but does not exhibit much change once lambda is greater than 0.1, which is essentially random guess according to the author’s respective proportion of appearance in the data. This plot shows that the model is generally better at small penalty, suggesting that the majority of the predictors are fairly important to the model. We may lean towards larger penalty with slightly worse performance, because they lead to simpler models. It follows that we may want to choose lambda in top rows in the following data frame.
select_best()
with return the 9th row with lambda = 0.000586 for its
highest performance on roc_auc
. But I’ll stick to the parsimonious
principle and pick $\lambda \approx 0.00376$ at the cost of a fall in
roc_auc
by 0.005 and in accuracy
by 0.001.
Now the model specification in the workflow is filled with the picked lambda:
The next thing is to fit the best model with the training set, and evaluate against the test set.
The accuracy of our logistic model rises by a rough 9% to 93.8%, with
roc_auc
being nearly 0.904. This is pretty good!
There is also the confusion matrix to check. The model did well in identifying Charlotte Brontë (low false positive rate, high sensitivity), yet suffers relatively high false negative rate (mistakenly identify 39% of Emily Brontë as Charlotte Brontë, aka low specificity). In part, this is due to class imbalance (four out of five books were written by Charlotte).
To examine the effect of predictors, We again use fit
and
pull_workflow
to extract model fit. Variable importance plots
implemented in the vip
package provides an intuitive way to visualize importance of predictors
in this scenario, using the absolute value of the t-statistic as a
measure of VI.
Is it cheating to use names of a character to classify authors? Perhaps I should consider include more books and remove names for text classification next time.
Note that variable importance in the left panel is generally smaller than the right, this corresponds to what we find in the word frequency plot that Emily Brontë has more and stronger characteristic words.