Optimization

BoostCourse 2023. 1. 11. 03:28

From BoostCourse 최성준 (고려대학교 인공지능학과)

1. Gradient Descent Methods

A. Stochastic gradient descent vs Mini-batch gradient descent vs Batch gradient descent

- large-batch methods tend to converge to sharp minimizers

- small-batch methods consistently converge to flat minimizers

>> small-batch methods are better than large-batch methods

B. Momentum

- momentum accumulates the gradient of the past steps to determing the direction to go.

C. Nesterov Accelerated Gradient

- similar to momentum, but move and calculate.

D. Adagrad

- adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.

E. Adadelta

- extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window.

F. RMSprop

- extends Adagrad by considering stepsize

G. Adam

- Adaptive Moment Estimation leverages both past gradients and squared gradients (Momentum + RMSprop)

2. Regularization

A. Early Stopping

B. Parameter Norm Penalty

C. Data Augmentation

D. Label Smoothing

- mixup, cutout, cutmix

E. Dropout

- in each forward pass, randomly set some neurons to zero

F. Batch Normalization

동산 동산