one dimension than in another [1], which are common around local optima....This anticipatory update prevents us from going too fast and results in increased responsiveness, which...a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as...This only works if the input data is sparse, as each update will only modify a fraction of all parameters...Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014).