Logistic Regression

Jed Rembold

March 12, 2024

Announcements

  • HW4 is out!
    • You can already do the first two problems in their entirety!
    • We’ll be spending the rest of this week and perhaps some of next week getting you the techniques for Problem 3
  • I’m about 50% of the way through the HW2 feedback
    • Things to keep in mind still:
      • If I remove every code cell from your computational essay, it should still make sense
      • Just because you did something when figuring out a problem, doesn’t mean that is something a reader needs to see

Recap

  • Galaxies largely form as the result of large regions of mostly hydrogen gas collapsing inwards under gravity
  • Galaxies come in three main “flavors”:
    • Spiral Galaxies
    • Elliptical Galaxies
    • Irregular
  • Classification models commonly use confusion matrices to judge the relative quality of the model
    • There are ways to attach numbers to these for comparison, but the goal is almost never to just optimize a number
  • Machine Learning methods are more concerned with predicting, as opposed to statistical methods focusing on inference
  • Supervised learning requires labeled data for training, where the categories are already known

Discussing Today

  • Basic data prep in Python and R
  • Our first model: Logistic Regression
    • Building, training, and testing models in Python
    • Building, training, and testing models in R

Training vs Testing

  • Because of the iterative approach, many models will, if given enough time, perfectly model the data
    • THIS IS A BAD THING!
  • If a model too perfectly matches a given set of data, the chances of it being able to accurate predict other data have greatly diminished
    • Generally called overfitting
  • It is common then to set aside a portion of data that the model is not trained on to serve as a test to compare the model against
  • These are generally denoted as the “training” and “testing” data sets
    • A common split is to put about 80% of the observations into the training set, and reserve the remaining 20% for the testing

The Libraries

  • Doing this sort of work benefits greatly from a streamlined architecture and common syntax

    • It helps greatly for the code to look similar despite whatever model is used
  • In Python, the gold standard is Scikit-Learn:

    pip install -U scikit-learn
  • In R, TidyModels is the best similar resource I’ve seen:

    install.package('tidymodels')
  • Both can take a bit of time to install

Making the Split (Python)

  • To split your data, train_test_split can assist:

    from sklearn.model_selection import train_test_split
  • Need to include an option test_size=frac where frac is the amount that you want to set aside for testing

  • train_test_split shuffles the data before making the splits, so you don’t need to worry about that

  • Usage:

    train_df, test_df = train_test_split(df, test_size=0.2)

Making the Split (R)

  • To split your data in R, several functions from rsamples (part of tidymodels) are useful

  • You indicate the amount of observations you want to use for training

    splits <- initial_split(df, prop=0.8)
    train_df <- training(splits)
    test_df <- testing(splits)
  • Useful to also set the seed before splitting for reproducability:

    set.seed(num)

Modeling

  • Within the supervised machine learning for classification domain, there are many possible specific models that can be used for the training
  • We’ll end up looking at several in this class:
    • Logistic Regression (Multinomial Regression)
    • Decision Trees
    • Random Forests
  • Our ML libraries make working with any of these a very similar experience!

The Logistic Regression Model (Python)

  • The logistic regression model is provided directly from Scikit-Learn:

    from sklearn.linear_model import LogisticRegression
  • You need to initialize a model before you can try to fit anything to it

  • At its most simple:

    model = LogisticRegression()
  • Note that there are a lot of options that can be further provided to the model as arguments

The Logistic Regression Model (R)

  • Tidymodels distinguishes between Logistic Regression (2 categories) and Multinomial Regression (>2 categories)

  • You still need to initialize a model before you can try to fit anything to it

    model <- logistic_reg() # or
    model <- multinom_reg()
  • Can tweak with the underlying engine that powers these, but the default is fine

Behind the Scenes: Multi-Logistic Regression

  • Scikit-Learn’s handling of multinomial logistic regression uses a “one-vs-rest” model
  • Binary logistic regressions are run on each category vs all the other categories
  • Final assignment is determined by whichever model is most confident about that points category
    • Confidence usually builds as you move away from the division line

Fitting the Model

  • Fitting the model is the act of iteratively improving on the fit parameters

  • You need to provide your model the training data when doing so, both the feature data and the classification labels (this is supervised remember!)

    model_fit = model.fit(train_df[[feature_cols]], 
                          train_df[label_col]
                          )
    model_fit <- model %>% 
      fit(label_col ~ feature_cols, data=train_df)

Evaluating the Model

  • Once the model has been fit, the fit can be used to make predictions
    • Generally, the first predictions that should be made should be made on the testing data!

      test_df['predicted'] = 
        model_fit.predict(test_df[[feature_cols]])
      test_df <- test_df %>% 
        bind_cols(model_fit %>% predict(test_df)
    • This should return a list of label predictions

  • These could be compared directly to the known labels of the testing data, or, more likely, you may want to create a confusion matrix

Avoiding Confusion (Python)

  • You could construct the confusion matrix manually, but imports can help

    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import ConfusionMatrixDisplay as CMD
  • Armed with the predictions, you can construct your confusion matrix

    confusion_matrix(test_df[label_col], test_df['predicted'])

    which will print out the matrix in a Numpy array

  • Want graphics?

    CMD.from_predictions(test_df[label_col], test_df['predicted'])

Avoiding Confusion (R)

  • The yardstick library in tidymodels provides the conf_mat function

  • Provides a special confusion matrix object, which can then have a variety of things done with it

    cm <- conf_mat(test_df, label_col, .pred_class)
  • Printing this will just give a text representation of the matrix

  • Want graphics?

    autoplot(cm, type='heatmap')

Understanding Probabilities

  • Classification models usually internally assign a probability to a point as to what label it should have

    • The dominant probability is what wins, and that label gets assigned
  • It can be useful sometimes to see the predicted probabilities for each point, rather than the final category

    model_fit.predict_proba(test_df[[feature_cols])
    predict(model_fit, test_df, type='prob')

Decision Boundaries

  • It can be a useful aid to visualize where the decision boundaries lie
  • This is not quite as simple as extracting the lines that bisect each region, since the decision regions will involve the areas of most confidence in a particular classification

Decision Boundary (Python)

  • Need to import:

    from sklearn.inspection import DecisionBoundaryDisplay as DBD
  • Create the plot from the estimator:

    DBD.from_estimator(model, df[[features]])
    • Unlike the confusion matrix, here the estimate needs both the model and the feature values to predict from
    • Can also pass in other arguments, like axis labels or the actual axis you want to add the plot to

Activity!

  • The dataset here has two independent variables and then a label column that can be one of three options
  • Fit a Logistic Multinomial Regression model to the data and compute the resulting confusion matrix and model accuracy
// reveal.js plugins // Added plugins ,