Description

Unformatted Attachment Preview

DS-UA 301

Advanced Topics in Data Science

Instructor: Parijat Dube

Due: March 3, 2024

Homework 3

General Instructions

This homework must be turned in on Gradescope by 11:59 pm on the due date. It must be your own work

and your own work only—you must not copy anyone’s work, or allow anyone to copy yours. This extends to

writing code. You may consult with others, but when you write up, you must do so alone. Your homework

submission must be written and submitted using Jupyter Notebook (.ipynb). No handwritten solutions

will be accepted. You should submit:

1. One Jupyter Notebook containing all of your solutions in this homework.

2. One .pdf file generated from the notebook.

Please make sure your answers are clearly structured in the Jupyter Notebooks:

1. Label each question part clearly. Do not include written answers as code comments. The code used to

obtain the answer for each question part should accompany the written answer.

2. All plots should include informative axis labels and legends. All codes should be accompanied by

informative comments. All output of the code should be retained.

3. Math formulas can be typesetted in Markdown in the same way as LATEX. A Markdown Guide is

provided on Brightspace for reference.

For more homework-related policies, please refer to the syllabus.

Problem 1 – Softmax Activation Function

10 points

Consider the softmax activation function in the output layer, in which real-valued outputs v1 , . . . , vk are

converted into probabilities as follows:

exp (vi )

∀i ∈ {1, . . . , k}.

oi = Pk

j=1 exp (vj )

∂oi

1. Show that the value of ∂v

is oi (1 − oi ) when i = j and −oi oj when i ̸= j. (5)

j

Pk

2. Assume that we are using cross-entropy loss L = − i=1 yi log(oi ), where yi ∈ {0, 1} is the one-hot

encoded class label over different values of i ∈ {1, . . . , k}. Use the result in part 1 to show the correctness

of the following equation: (5)

∂L

= oi − yi .

∂vi

Problem 2 – Neural Network Training and Backpropagation

25 points

In this problem, you will first review this notebook on training a 2-layered neural network (one input and one

hidden layer) using sigmoid activations and backpropagation algorithm with and without regularized cost

function. The notebook implements functions for forward propagation, cost calculation, and backpropagation.

Next, in the template provided, you will make a copy of this notebook and modify it to train a 3-layered

neural network with two hidden layers using the same dataset. The number of hidden units in the first and

1

Homework 3

DS-UA 301

Advanced Topics in Data Science

Instructor: Parijat Dube

Due: March 3, 2024

second hidden layers is 20 and 20. The activation function you will use in hidden layers is scaled sigmoid,

given by:

1

σ̂(z) =

1 + e−2z

You will need to make changes to the following functions in the original notebook.

1. sigmoid() to return scaled sigmoid. (1)

2. forward propagate() to account for 2 hidden layers (the original has one hidden layer). (3)

3. cost() to calculate predictions from the 3-layered neural network and hence the cost. You need to make

changes to both versions of the cost() function, with and without regularization, as in the original

notebook. (4)

4. sigmoid gradient() to return gradient of scaled sigmoid function. (2)

5. backprop() to compute the gradients. Your function should return both the cost and the gradient

vector, as in the original notebook. Also, you will need to implement two versions of these functions,

with and without regularization, as in the original notebook. (8)

Then you will

6. Train your 3-layered neural network by minimizing the objective function, as in the original notebook,

keeping the hyperparameters (learning rate, method, jac, options) unchanged. (3)

7. Make forward predictions from your trained model and compute the accuracy. (2)

8. How does your model accuracy compare with the accuracy of the 2-layered neural network in the

original notebook? (2)

Problem 3 – Weight Initialization, Dead Neurons, Leaky ReLU

25 points

Read the two blogs referenced below on weight initialization. You will reuse the code in the GitHub repo

linked in the blog to explain vanishing and exploding gradients. You can use the same 5-layer neural network

model as in the repo and the same dataset.

1. Explain the vanishing gradients phenomenon using RandomNormal initialization with three different

values of standard deviation. You will conduct two groups of experiments: train the model with tanh

and sigmoid activation functions. For each group of experiments, you should have one plot containing

3 subplots of gradients at each of the 5 layers in the neural network, which is similar to the plots in the

blog post. Summerise and explain your observations. (6)

2. Next, show how Xavier (aka Glorot normal) initialization of weights helps in dealing with this problem.

For each of the activation functions in the previous part, you should plot the gradients at each of the

5 layers in the neural network. Compare the plots with the previous part, and briefly discuss your

observations. (4)

3. The dying ReLU is a kind of vanishing gradient, which refers to the problem when ReLU neurons

become inactive and only output 0 for any input. In the worst case of dying ReLU, ReLU neurons at

a certain layer are all dead, i.e., the entire network dies and is referred to as the dying ReLU neural

network in Lu et al (reference below). A dying ReLU neural network collapses to a constant function.

2

Homework 3

DS-UA 301

Advanced Topics in Data Science

Instructor: Parijat Dube

Due: March 3, 2024

Show this phenomenon using any one of the three 1-dimensional functions on page 13 of Lu et al. Use a

ReLU network with 10 hidden layers, each of √

width

√ 2 (hidden units per layer). Use a minibatch size of

64 and draw training data uniformly from [− 7, 7]. Perform 1000 independent training simulations

each with 3,000 training points. Out of these 1000 simulations, what fraction resulted in neural network

collapse? Is your answer close to over 90% as was reported in Lu et al.? (10)

4. Instead of ReLU consider Leaky ReLU activation as defined below:

z

if z > 0

ϕ(z) =

0.01z if z ≤ 0.

Run the 1000 training simulations in the previous part with Leaky ReLU activation and keep everything

else the same. Again calculate the fraction of simulations that resulted in neural network collapse. Did

Leaky ReLU help in preventing dying neurons? If so, why do you think it helps? (5)

References:

• Andre Perunicic. Understand neural network weight initialization.

Available at https://intoli.com/blog/neural-network-initialization/

• Daniel Godoy. Hyper-parameters in Action Part II — Weight Initializers.

Available at https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404

• Initializers – Keras documentation. https://keras.io/initializers/.

• Lu Lu et al. Dying ReLU and Initialization: Theory and Numerical Examples.

Available at https://arxiv.org/pdf/1903.06733.pdf.

Problem 4 – Batch Normalization, Dropout, MNIST

20 points

Batch normalization and Dropout are used as effective regularization techniques in training neural networks.

However, it’s unclear which one should be preferred and whether their benefits add up when used in conjunction. In this problem, we will compare batch normalization, dropout, and their conjunction using MNIST and

LeNet-5 (see e.g., http://yann.lecun.com/exdb/lenet/). LeNet-5 is one of the earliest convolutional neural

networks developed for image classification and its implementation in all major frameworks is available. You

can refer to the lecture slides for the definition of standardization and batch normalization.

3

Homework 3

DS-UA 301

Advanced Topics in Data Science

Instructor: Parijat Dube

Due: March 3, 2024

1. Read two papers referenced below and explain the terms co-adaptation and internal covariate shift.

Use examples if needed. (2)

2. Batch normalization is traditionally used in hidden layers, for the input layer, standard normalization

is used. In standard normalization, the mean and standard deviation are calculated using the entire

training dataset whereas in batch normalization these statistics are calculated for each mini-batch.

Train LeNet-5 for 10 epochs with standard normalization of input and batch normalization for hidden

layers. You may use SGD as the optimizer, and a batch size of 128. What are the learned batch normalization parameters for each layer? Plot the distribution of learned batch normalization parameters

for each layer using violin plots. You should have one figure for each batch normalization parameter,

and within each figure, you will have a violin plot for the distribution of that parameter on each layer.

(5)

3. Next, instead of standard normalization, use batch normalization for the input layer as well. Train the

network again with the same hyperparameters. Plot the distribution of learned batch norm parameters

for each layer (including input) using violin plots similar to the previous part. However, now for each

figure you should have one more violin plot compared with the previous part, which corresponds to the

distribution of the batch normalization parameters on the input layer.

Besides, compare the train/test accuracy and loss for the two cases, by plotting them over epochs. You

should have one plot for accuracy, which includes two subplots: standard normalization for the input

layer (the experiment in part 2) and batch normalization for the input layer (the experiment in part

3). For each subplot, you should have two line plots for training and testing accuracy as a function of

epochs. For the sake of comparison, please make sure both subplots share the exact same scale on the

y-axis. In a similar manner, you should have another plot (which includes two subplots) for training

and testing loss. Briefly summarize your observations. Did batch normalization for the input layer

improve performance? (5)

4. Using the same hyperparameters, train the network without batch normalization but this time use

dropout. For hidden layers, use a dropout probability of 0.5 and for the input layer, take it to be 0.2.

Compare train/test accuracy using dropout to the previous two experiments using batch normalization

in parts 2 and 3 by adding a subplot in the previous accuracy plot. Again, please make sure all three

subplots share the same scale on the y-axis. Briefly summarize your observations. Did dropout help

improve performance? (4)

5. Now, still using the same set of hyperparameters, train the network using both batch normalization(including input layer) and dropout. How does the performance of the network compare with the

cases with dropout alone (in part 4) and with batch normalization alone (in part 3)? You should have

one accuracy plot similar to previous parts, but this time contains three subplots, each corresponding

to experiments in parts 3, 4, and 5. Briefly summarize your observations. (4)

References:

• N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R.Salakhutdinov . Dropout: A Simple Way to

Prevent Neural Networks from Overfitting. Available at at https://www.cs.toronto.edu/ rsalakhu/papers/srivastava14a.pdf.

• S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal

Covariate Shift. Available at https://arxiv.org/abs/1502.03167.

4

Homework 3

DS-UA 301

Advanced Topics in Data Science

Instructor: Parijat Dube

Due: March 3, 2024

Problem 5 – Learning Rate, Batch Size, FashionMNIST

20 points

Recall the cyclical learning rate policy discussed in Lecture 4. The learning rate changes in a cyclical manner

between lrmin and lrmax , which are hyperparameters that need to be specified. For this problem, you first

need to read carefully the article referenced below as you will be making use of the code there (in Keras) and

modifying it as needed. For those who want to work in Pytorch, there are open-source implementations of this

policy available which you can easily search for and build over them. You will work with the FashionMNIST

dataset and LeNet-5.

1. Fix batch size to 64 and start with 11 candidate learning rates: 10−9 , 10−8 , . . . , 101 . Train your model

for 5 epochs for each learning rate. Plot the training loss as a function of the learning rate. You should

see a curve like Figure 2 in the referenced post below. Based on your plot, identify the values of lrmin

and lrmax . (4)

2. Use the cyclical learning rate policy (with exponential decay) and train your network using batch size

64 and lrmin and lrmax values obtained in part 1. Plot train/validation loss and accuracy curve over

the number of epochs (similar to Figure 4 in reference). (8)

3. We want to test if increasing batch size for a fixed learning rate has the same effect as decreasing

learning rate for a fixed batch size. Fix the learning rate to lrmax and train your network starting with

batch size 32 and incrementally going up to 4096 (each time by the power of 2; i.e. 25 , 26 , . . . , 212 ).

You can choose a step size (in terms of the number of epochs) to increment the batch size. Plot the

training loss vs. log2 (batch size). Is the generalization of your final model similar to or different from

than cyclical learning rate policy? Briefly discuss your observations. (8)

References:

1. Leslie N. Smith Cyclical Learning Rates for Training Neural Networks.

Available at https://arxiv.org/abs/1506.01186.

2. Keras implementation of cyclical learning rate policy.

Available at https://www.pyimagesearch.com/2019/08/05/keras-learning-rate-finder/.

5

Purchase answer to see full

attachment