Why use tanh instead of ReLU here?

**ReLU** is the default in deep networks for speed and sparsity, but **tanh** is smooth and bounded, which makes the loss landscape easier for this tiny teaching model and matches many textbook derivations. You can still see vanishing gradients if weights saturate.

Is full-batch the same as stochastic gradient descent?

**SGD** uses small random subsets (or one point) per step; **full-batch** averages the gradient over **all** points each update. Here batch = entire dataset for clarity; noise-free curves but slower movement on large N.

Why might spirals stay partly wrong even after many epochs?

This is a **very small** network with **axis-free** but still limited capacity; spirals are hard. Try **more hidden units**, a **smaller learning rate** for stability, or many more epochs—sometimes the boundary needs fine “wiggles” that require width and patience.

Toy 2-Layer MLP + Backprop (XOR / spiral)

This page implements the smallest nonlinear classifier that still matches the “neural network” story taught in intro ML: a single hidden layer with tanh activations and a logistic (sigmoid) output for binary labels. Given labeled points (x, y), the network computes z₁ = W₁ x + b₁, a₁ = tanh(z₁), z₂ = w₂ᵀ a₁ + b₂, and ŷ = σ(z₂). Training minimizes mean binary cross-entropy over the full dataset using full-batch gradient descent—the same gradient-averaging idea as large minibatches, but deterministic and easy to read on a chalkboard. Backpropagation is just the chain rule applied in reverse: the output error ŷ − y propagates to W₂ and b₂, then through tanh′(z₁) = 1 − tanh²(z₁) to W₁ and b₁. A dense heatmap of σ(z₂(x, y)) shows how the decision boundary (roughly the 0.5 level set) reshapes after each epoch block; misclassified training points get a white ring so overfitting is visible instantly. Presets include XOR (linearly inseparable) and two spirals (highly non-convex).

Who it's for: Students learning MLP notation, BCE loss, and manual backprop before frameworks; pairs naturally with the axis-aligned decision-tree lab on this site.

Key terms

Multilayer perceptron
Tanh activation
Sigmoid output
Binary cross-entropy
Full-batch gradient descent
Backpropagation
Decision boundary
XOR problem

How it works

A two-layer perceptron maps (x, y) through a hidden vector with tanh, then a logistic output σ(z) trained by full-batch gradient descent on binary cross-entropy. Backpropagation applies the chain rule: error at the sigmoid is ŷ − y, flows to W₂ and biases, then through tanh′ = 1 − tanh² to W₁. The heatmap shows P(class 1) over the plane so you can watch the decision boundary (≈0.5 contour) move after each epoch block. Try XOR (needs curvature) vs intertwined spirals (needs many cuts); increase hidden width or learning rate to see speed vs stability trade-offs.

Frequently asked questions

Why use tanh instead of ReLU here?: ReLU is the default in deep networks for speed and sparsity, but tanh is smooth and bounded, which makes the loss landscape easier for this tiny teaching model and matches many textbook derivations. You can still see vanishing gradients if weights saturate.
Is full-batch the same as stochastic gradient descent?: SGD uses small random subsets (or one point) per step; full-batch averages the gradient over all points each update. Here batch = entire dataset for clarity; noise-free curves but slower movement on large N.
Why might spirals stay partly wrong even after many epochs?: This is a very small network with axis-free but still limited capacity; spirals are hard. Try more hidden units, a smaller learning rate for stability, or many more epochs—sometimes the boundary needs fine “wiggles” that require width and patience.

Other simulators in this category — or see all 61.

View category →

NewUniversity / research

Convolution (pulses)

Two rectangular pulses; overlap length at τ = 0.

Launch Simulator

NewUniversity / research

Euler vs RK4 (Pendulum)

Same nonlinear pendulum ODE and step h; Euler vs RK4 side by side.

Launch Simulator

NewSchool

Lotka–Volterra

N′ = αN−βNP, P′ = δNP−γP; phase plane RK4; equilibrium dot.

Launch Simulator

NewSchool

Logistic Growth

dN/dt = rN(1−N/K); exact S-curve vs carrying capacity K.

Launch Simulator

NewUniversity / research

Logistic Map Bifurcation

x_{n+1}=rx_n(1−x_n): scan r, plot attractors — period doubling to chaos (Feigenbaum cascade).

Launch Simulator

NewSchool

2×2 Matrix & Eigenvectors

Grid deformation under M; real λ eigen-direction arrows.

Launch Simulator

Toy 2-Layer MLP + Backprop (XOR / spiral)

How it works

Frequently asked questions

More from Math Visualization

Convolution (pulses)

Euler vs RK4 (Pendulum)

Lotka–Volterra

Logistic Growth

Logistic Map Bifurcation

2×2 Matrix & Eigenvectors

Toy 2-Layer MLP + Backprop (XOR / spiral)

How it works

Frequently asked questions