Why do we move against the gradient, not with it?

The gradient vector ∇f points in the direction of the steepest increase of the function. To minimize the function, we want to go downhill, which is the direction of steepest decrease. Therefore, we subtract the gradient, moving in the direction of -∇f.

What happens if I set the learning rate (η) too high?

An excessively high learning rate causes the algorithm to take steps that are too large. This can lead to overshooting the minimum, resulting in oscillations around it or even causing the path to diverge and move away from the minimum entirely, failing to converge.

Is the minimum found always the global (lowest) minimum?

Not necessarily. Gradient descent converges to a local minimum. In this simulator, the functions are convex (shaped like a bowl), so there is only one local minimum, which is also the global minimum. In more complex, non-convex functions, the algorithm could get stuck in a local minimum that is not the lowest point overall.

How is this related to machine learning?

Training a machine learning model often involves minimizing a 'loss function' that measures prediction error. Gradient descent is the core algorithm used to adjust the model's parameters (weights and biases) by following the negative gradient of this loss, thereby reducing error step-by-step.

Gradient Descent (2D)

Gradient descent is a fundamental optimization algorithm used to find the minimum of a function. This simulator visualizes the process in two dimensions, where the function f(x,y) represents a surface, such as a bowl or an elliptic well. The level sets, or contour lines, of f(x,y) are curves of constant function value, analogous to elevation lines on a topographic map. The algorithm iteratively updates the current point (x,y) by moving a small step in the direction opposite the function's gradient, following the update rule: (x,y) ← (x,y) − η∇f(x,y). Here, ∇f(x,y) is the gradient vector, which points in the direction of steepest ascent, and η (eta) is the learning rate, a positive scalar controlling the step size. By repeatedly stepping against the gradient, the path descends toward a local minimum. The visualization demonstrates how the choice of learning rate and starting position affects convergence. A rate that is too small leads to slow progress, while one that is too large can cause overshooting and oscillation, or even divergence. The model simplifies real-world optimization by using smooth, convex functions with a single global minimum, avoiding complexities like saddle points, noise, or high-dimensional parameter spaces. Interacting with this simulation helps learners build intuition for the core mechanics of gradient-based optimization, a principle underpinning machine learning training, engineering design, and various scientific fitting procedures.

Who it's for: Undergraduate students in calculus, multivariable analysis, or introductory machine learning courses, as well as anyone seeking an intuitive grasp of numerical optimization.

Key terms

Gradient Descent
Gradient
Learning Rate
Level Sets
Contour Plot
Optimization
Convex Function
Local Minimum

How it works

Visualize steepest-descent on a smooth convex bowl — a bridge between calculus and optimization.

Frequently asked questions

Why do we move against the gradient, not with it?: The gradient vector ∇f points in the direction of the steepest increase of the function. To minimize the function, we want to go downhill, which is the direction of steepest decrease. Therefore, we subtract the gradient, moving in the direction of -∇f.
What happens if I set the learning rate (η) too high?: An excessively high learning rate causes the algorithm to take steps that are too large. This can lead to overshooting the minimum, resulting in oscillations around it or even causing the path to diverge and move away from the minimum entirely, failing to converge.
Is the minimum found always the global (lowest) minimum?: Not necessarily. Gradient descent converges to a local minimum. In this simulator, the functions are convex (shaped like a bowl), so there is only one local minimum, which is also the global minimum. In more complex, non-convex functions, the algorithm could get stuck in a local minimum that is not the lowest point overall.
How is this related to machine learning?: Training a machine learning model often involves minimizing a 'loss function' that measures prediction error. Gradient descent is the core algorithm used to adjust the model's parameters (weights and biases) by following the negative gradient of this loss, thereby reducing error step-by-step.

Other simulators in this category — or see all 46.

View category →

NewUniversity / research

Minkowski Diagram

Light cone and boosted axes in 1+1D; γ from v.

Launch Simulator

NewSchool

Twin Paradox

Out-and-back worldlines; proper time τ = T/γ vs Earth time T.

Launch Simulator

NewKids

Monte Carlo π

Uniform samples in a square; 4·(in disk)/N estimates π.

Launch Simulator

NewSchool

Random Walk

1D or 2D steps; trail and running mean ⟨r²⟩ vs diffusion intuition.

Launch Simulator

FeaturedSchool

Vector Addition

Place vectors and see the resultant with head-to-tail animation.

Launch Simulator

FeaturedSchool

Trigonometry Circle

Unit circle with live sin, cos, tan values as you drag.

Launch Simulator

Gradient Descent (2D)

How it works

Frequently asked questions

More from Math Visualization

Minkowski Diagram

Twin Paradox

Monte Carlo π

Random Walk

Vector Addition

Trigonometry Circle

Gradient Descent (2D)

How it works

Frequently asked questions