Logistic Regression

Jed Rembold

March 12, 2024

Announcements

HW4 is out!
- You can already do the first two problems in their entirety!
- We’ll be spending the rest of this week and perhaps some of next week getting you the techniques for Problem 3
I’m about 50% of the way through the HW2 feedback
- Things to keep in mind still:
  - If I remove every code cell from your computational essay, it should still make sense
  - Just because you did something when figuring out a problem, doesn’t mean that is something a reader needs to see

Galaxies largely form as the result of large regions of mostly hydrogen gas collapsing inwards under gravity
Galaxies come in three main “flavors”:
- Spiral Galaxies
- Elliptical Galaxies
- Irregular
Classification models commonly use confusion matrices to judge the relative quality of the model
- There are ways to attach numbers to these for comparison, but the goal is almost never to just optimize a number
Machine Learning methods are more concerned with predicting, as opposed to statistical methods focusing on inference
Supervised learning requires labeled data for training, where the categories are already known

Basic data prep in Python and R
Our first model: Logistic Regression
- Building, training, and testing models in Python
- Building, training, and testing models in R

Because of the iterative approach, many models will, if given enough time, perfectly model the data
- THIS IS A BAD THING!
If a model too perfectly matches a given set of data, the chances of it being able to accurate predict other data have greatly diminished
- Generally called overfitting
It is common then to set aside a portion of data that the model is not trained on to serve as a test to compare the model against
These are generally denoted as the “training” and “testing” data sets
- A common split is to put about 80% of the observations into the training set, and reserve the remaining 20% for the testing

Doing this sort of work benefits greatly from a streamlined architecture and common syntax
- It helps greatly for the code to look similar despite whatever model is used
In Python, the gold standard is Scikit-Learn:
```
pip install -U scikit-learn
```
In R, TidyModels is the best similar resource I’ve seen:
```
install.package('tidymodels')
```
Both can take a bit of time to install

To split your data, train_test_split can assist:

from sklearn.model_selection import train_test_split

Need to include an option test_size=frac where frac is the amount that you want to set aside for testing
train_test_split shuffles the data before making the splits, so you don’t need to worry about that

Usage:

train_df, test_df = train_test_split(df, test_size=0.2)

To split your data in R, several functions from rsamples (part of tidymodels) are useful

You indicate the amount of observations you want to use for training

splits <- initial_split(df, prop=0.8)
train_df <- training(splits)
test_df <- testing(splits)

Useful to also set the seed before splitting for reproducability:
```
set.seed(num)
```

Within the supervised machine learning for classification domain, there are many possible specific models that can be used for the training
We’ll end up looking at several in this class:
- Logistic Regression (Multinomial Regression)
- Decision Trees
- Random Forests
Our ML libraries make working with any of these a very similar experience!

The logistic regression model is provided directly from Scikit-Learn:
```
from sklearn.linear_model import LogisticRegression
```
You need to initialize a model before you can try to fit anything to it
At its most simple:
```
model = LogisticRegression()
```
Note that there are a lot of options that can be further provided to the model as arguments

Tidymodels distinguishes between Logistic Regression (2 categories) and Multinomial Regression (>2 categories)
You still need to initialize a model before you can try to fit anything to it
```
model <- logistic_reg() # or
model <- multinom_reg()
```
Can tweak with the underlying engine that powers these, but the default is fine

Scikit-Learn’s handling of multinomial logistic regression uses a “one-vs-rest” model
Binary logistic regressions are run on each category vs all the other categories
Final assignment is determined by whichever model is most confident about that points category
- Confidence usually builds as you move away from the division line

You need to provide your model the training data when doing so, both the feature data and the classification labels (this is supervised remember!)

model_fit = model.fit(train_df[[feature_cols]], 
                      train_df[label_col]
                      )

model_fit <- model %>% 
  fit(label_col ~ feature_cols, data=train_df)

Once the model has been fit, the fit can be used to make predictions
- Generally, the first predictions that should be made should be made on the testing data!
```
test_df['predicted'] = 
  model_fit.predict(test_df[[feature_cols]])
```
```
test_df <- test_df %>% 
  bind_cols(model_fit %>% predict(test_df)
```
- This should return a list of label predictions
These could be compared directly to the known labels of the testing data, or, more likely, you may want to create a confusion matrix

You could construct the confusion matrix manually, but imports can help

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay as CMD

Armed with the predictions, you can construct your confusion matrix
```
confusion_matrix(test_df[label_col], test_df['predicted'])
```
which will print out the matrix in a Numpy array

Want graphics?

CMD.from_predictions(test_df[label_col], test_df['predicted'])

The yardstick library in tidymodels provides the conf_mat function
Provides a special confusion matrix object, which can then have a variety of things done with it
```
cm <- conf_mat(test_df, label_col, .pred_class)
```
Printing this will just give a text representation of the matrix
Want graphics?
```
autoplot(cm, type='heatmap')
```

Classification models usually internally assign a probability to a point as to what label it should have
- The dominant probability is what wins, and that label gets assigned
It can be useful sometimes to see the predicted probabilities for each point, rather than the final category
```
model_fit.predict_proba(test_df[[feature_cols])
```
```
predict(model_fit, test_df, type='prob')
```

It can be a useful aid to visualize where the decision boundaries lie
This is not quite as simple as extracting the lines that bisect each region, since the decision regions will involve the areas of most confidence in a particular classification

Need to import:

from sklearn.inspection import DecisionBoundaryDisplay as DBD

The dataset here has two independent variables and then a label column that can be one of three options
Fit a Logistic Multinomial Regression model to the data and compute the resulting confusion matrix and model accuracy