Logistic regression on the Iris dataset¶
The notebook will apply logistic classification to the Iris dataset. For information about the logistic classification model see here and for information about maximum-likelihood fitting see here.
The task is to classify the variety of Iris flower based on measurements of its sepal length, sepal width, petal length and petal width. The input \(\mathbf{x}\) is therefore four dimensional. The full problem involves three varieties of Iris (iris setosa, viridia or virginica), but here we consider just the first two classes so we can model this as a binary classification problem where \(y=1 \; \text{or} \; 0\). Here’s a set of pairwise plots showing the data plotted in terms of pairs of inputs.
We have already written all the code that we need to train logistic classification on these data in the previous chapter). Below we recapitulate the functions that implement the logistic function and gradient ascent.
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def gradient_ascent(x, y, init_weights, num_steps, stepsize):
# Add 1's to the inputs as usual, absorbing the bias term
x = np.append(np.ones(shape = (x.shape[0], 1)), x, axis = 1)
# Copy weights (to prevent changing init_weights as a side-effect
w = init_weights.copy()
# Arrays for storing weights and log-liklihoods at each step
w_history = []
log_liks = []
for n in range(num_steps):
# Record current log-lik and current weights
log_lik = np.sum(y * np.log(sigmoid(x.dot(w))) + (1 - y) * np.log(1 - sigmoid(x.dot(w))))
log_liks.append(log_lik)
w_history.append(w.copy())
# Compute σ(x^T x)
sigma = sigmoid(x.dot(w))
# Calculate gradient of log-likelihood w.r.t. w
dL_dw = np.mean((y - sigma)*x.T, axis = 1)
# Update weights and repeat
w += stepsize * dL_dw
return np.array(w_history), np.array(log_liks)
Now we can load the Iris data, remove the third class, split into training and testing sets (75:25), and train the model.
# Import Iris dataset
x = np.load('iris_inputs_full.npy')
y = np.load('iris_labels.npy')
# Original iris dataset has three classes -- remove the last class
x = x[np.where(np.logical_not(y == 2))[0]]
y = y[np.where(np.logical_not(y == 2))[0]]
# Split the data into a training and validation dataset
num_train = (x.shape[0] * 3) // 4
x_train = x[:num_train]
y_train = y[:num_train]
x_test = x[num_train:]
y_test = y[num_train:]
# Set training parameters
init_weights = np.array([0.] * (x.shape[1] + 1))
num_steps = 1000
stepsize = 1e-1
# Optimise weights by maximum likelihood
w_history, log_liks = gradient_ascent(x_train, y_train, init_weights, num_steps, stepsize)
plt.figure(figsize=(8, 3))
plt.plot(log_liks, color='black')
plt.title('Training set log-likelihood', fontsize=18)
plt.xlabel('Step #', fontsize=16)
plt.ylabel('Log-likelihood (nats)', fontsize=16)
plt.show()
The plot of the objective function - the log likelihood - appears to indicate that the algorithm has converged. Let’s now test the model by computing its classification accuracy.
def test_accuracy(test_x, test_y, w):
x_ = np.append(np.ones(shape = (test_x.shape[0], 1)), test_x, axis = 1)
y_ = sigmoid(x_.dot(w))
return 1 - abs((y_.round() - test_y)).mean()
print('Classifier accuracy for 1D dataset = {}%'.format(test_accuracy(x_test, y_test, w_history[-1])*100))
Classifier accuracy for 1D dataset = 100.0%
The classification accuracy is very high, which is to be expected given that the two classes are well separated. Now lets visualise how the model predictions change through learning.
<IPython.core.display.Image object>
In the above animation, each subplot corresponds to a pair of input features. For example the plot in (row, column) = (1, 2) has feature \(1\) as the x-axis and feature \(2\) as the y-axis. Note that the whole picture is symmetric, so that (i, j) and (j, i) are really the same plot, just reflected along the “\(y = x\)” line. The coloured background which changes with time is a coase visualisation of \(p(y=1|\mathbf{x},\mathbf{w}_i)\) at each iteration \(i\). Red means high probability of belonging to class 1 and blue means low probability.
The data and associated predictions live in a four dimensional space and the plots above just show two dimensional projections. So it’s not immediately clear how to produce the contour plots of the predictions. Here we took the following approach: First, the weights of the current optimisation step were used to evaluate the probability
at several points on a 4-dimensional grid. Then in order to visualise these predictions in each of the pairwise plots, the other two axes were eliminated by averaging these predictions over these two axes. For example, to plot the contours in the \((1, 3)\) subplot, axes \(2 \text{ and } 4\) were eliminated by averaging the predictions over them. The resulting contours are shown above. We see that as the optimisation progresses, the descision boundary changes to accommodate the data.
Questions¶
- Applying the binary logistic classification model to other pairs of input features in the Iris dataset: Here is a visualisation of the full Iris dataset showing all three classes.
The classification problem above looked at classes 0 and 1. This was fairly simple as the classes were linearly separable. How would binary logistic regression fair on the test set on the following classification problems:
classes 0 and 2
classes 1 and 2
Write code to train and test logistic classication on these two tasks.
Answer
You should find that distinguishing classes 0 and 2 is easy as the classes are still linearly separable. Distinguising classes 1 and 2 is harder because there is much more overlap between the classes.