Jed Rembold
March 12, 2024
Doing this sort of work benefits greatly from a streamlined architecture and common syntax
In Python, the gold standard is Scikit-Learn:
pip install -U scikit-learn
In R, TidyModels is the best similar resource I’ve seen:
install.package('tidymodels')
Both can take a bit of time to install
To split your data, train_test_split
can assist:
from sklearn.model_selection import train_test_split
Need to include an option
test_size=frac
where
frac
is the amount that you want to set
aside for testing
train_test_split
shuffles the data
before making the splits, so you don’t need to worry about that
Usage:
train_df, test_df = train_test_split(df, test_size=0.2)
To split your data in R, several functions from
rsamples
(part of
tidymodels
) are useful
You indicate the amount of observations you want to use for training
splits <- initial_split(df, prop=0.8)
train_df <- training(splits)
test_df <- testing(splits)
Useful to also set the seed before splitting for reproducability:
set.seed(num)
The logistic regression model is provided directly from Scikit-Learn:
from sklearn.linear_model import LogisticRegression
You need to initialize a model before you can try to fit anything to it
At its most simple:
model = LogisticRegression()
Note that there are a lot of options that can be further provided to the model as arguments
Tidymodels distinguishes between Logistic Regression (2 categories) and Multinomial Regression (>2 categories)
You still need to initialize a model before you can try to fit anything to it
model <- logistic_reg() # or
model <- multinom_reg()
Can tweak with the underlying engine that powers these, but the default is fine
Fitting the model is the act of iteratively improving on the fit parameters
You need to provide your model the training data when doing so, both the feature data and the classification labels (this is supervised remember!)
model_fit = model.fit(train_df[[feature_cols]],
train_df[label_col]
)
model_fit <- model %>%
fit(label_col ~ feature_cols, data=train_df)
Generally, the first predictions that should be made should be made on the testing data!
test_df['predicted'] =
model_fit.predict(test_df[[feature_cols]])
test_df <- test_df %>%
bind_cols(model_fit %>% predict(test_df)
This should return a list of label predictions
You could construct the confusion matrix manually, but imports can help
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay as CMD
Armed with the predictions, you can construct your confusion matrix
confusion_matrix(test_df[label_col], test_df['predicted'])
which will print out the matrix in a Numpy array
Want graphics?
CMD.from_predictions(test_df[label_col], test_df['predicted'])
The yardstick
library in
tidymodels
provides the
conf_mat
function
Provides a special confusion matrix object, which can then have a variety of things done with it
cm <- conf_mat(test_df, label_col, .pred_class)
Printing this will just give a text representation of the matrix
Want graphics?
autoplot(cm, type='heatmap')
Classification models usually internally assign a probability to a point as to what label it should have
It can be useful sometimes to see the predicted probabilities for each point, rather than the final category
model_fit.predict_proba(test_df[[feature_cols])
predict(model_fit, test_df, type='prob')
Need to import:
from sklearn.inspection import DecisionBoundaryDisplay as DBD
Create the plot from the estimator:
DBD.from_estimator(model, df[[features]])