PhysSandbox
Classical MechanicsWaves & SoundElectricity & MagnetismOptics & LightGravity & OrbitsLabs
🌙Astronomy & The Sky🌡️Thermodynamics🌍Biophysics, Fluids & Geoscience📐Math Visualization🔧Engineering🧪Chemistry

More from Math Visualization

Other simulators in this category — or see all 61.

View category →
NewUniversity / research

Convolution (pulses)

Two rectangular pulses; overlap length at τ = 0.

Launch Simulator
NewUniversity / research

Euler vs RK4 (Pendulum)

Same nonlinear pendulum ODE and step h; Euler vs RK4 side by side.

Launch Simulator
NewSchool

Lotka–Volterra

N′ = αN−βNP, P′ = δNP−γP; phase plane RK4; equilibrium dot.

Launch Simulator
NewSchool

Logistic Growth

dN/dt = rN(1−N/K); exact S-curve vs carrying capacity K.

Launch Simulator
NewUniversity / research

Logistic Map Bifurcation

x_{n+1}=rx_n(1−x_n): scan r, plot attractors — period doubling to chaos (Feigenbaum cascade).

Launch Simulator
NewSchool

2×2 Matrix & Eigenvectors

Grid deformation under M; real λ eigen-direction arrows.

Launch Simulator
PhysSandbox

Interactive physics, chemistry, and engineering simulators for students, teachers, and curious minds.

Physics

  • Classical Mechanics
  • Waves & Sound
  • Electricity & Magnetism

Science

  • Optics & Light
  • Gravity & Orbits
  • Astronomy & The Sky

More

  • Thermodynamics
  • Biophysics, Fluids & Geoscience
  • Math Visualization
  • Engineering
  • Chemistry

© 2026 PhysSandbox. Free interactive science simulators.

PrivacyTermsContact
Home/Math Visualization/Toy 2-Layer MLP + Backprop (XOR / spiral)

Toy 2-Layer MLP + Backprop (XOR / spiral)

This page implements the smallest nonlinear classifier that still matches the “neural network” story taught in intro ML: a single hidden layer with tanh activations and a logistic (sigmoid) output for binary labels. Given labeled points (x, y), the network computes z₁ = W₁ x + b₁, a₁ = tanh(z₁), z₂ = w₂ᵀ a₁ + b₂, and ŷ = σ(z₂). Training minimizes mean binary cross-entropy over the full dataset using full-batch gradient descent—the same gradient-averaging idea as large minibatches, but deterministic and easy to read on a chalkboard. Backpropagation is just the chain rule applied in reverse: the output error ŷ − y propagates to W₂ and b₂, then through tanh′(z₁) = 1 − tanh²(z₁) to W₁ and b₁. A dense heatmap of σ(z₂(x, y)) shows how the decision boundary (roughly the 0.5 level set) reshapes after each epoch block; misclassified training points get a white ring so overfitting is visible instantly. Presets include XOR (linearly inseparable) and two spirals (highly non-convex).

Who it's for: Students learning MLP notation, BCE loss, and manual backprop before frameworks; pairs naturally with the axis-aligned decision-tree lab on this site.

Key terms

  • Multilayer perceptron
  • Tanh activation
  • Sigmoid output
  • Binary cross-entropy
  • Full-batch gradient descent
  • Backpropagation
  • Decision boundary
  • XOR problem

Network & optimization

10
0.42

Data

96
7

Architecture: z₁ = W₁x + b₁, a₁ = tanh(z₁), ŷ = σ(W₂a₁ + b₂). Loss = mean BCE; gradients averaged over all points (full batch).

Shortcuts

  • •Click — add point (class buttons)
  • •Drag — move point
  • •Shift+click — delete nearest
  • •R — re-randomize weight init (same data)

Measured values

Points96
Epochs trained0
Mean BCE loss0.7865
Train accuracy45.8%

How it works

A two-layer perceptron maps (x, y) through a hidden vector with tanh, then a logistic output σ(z) trained by full-batch gradient descent on binary cross-entropy. Backpropagation applies the chain rule: error at the sigmoid is ŷ − y, flows to W₂ and biases, then through tanh′ = 1 − tanh² to W₁. The heatmap shows P(class 1) over the plane so you can watch the decision boundary (≈0.5 contour) move after each epoch block. Try XOR (needs curvature) vs intertwined spirals (needs many cuts); increase hidden width or learning rate to see speed vs stability trade-offs.

Frequently asked questions

Why use tanh instead of ReLU here?
ReLU is the default in deep networks for speed and sparsity, but tanh is smooth and bounded, which makes the loss landscape easier for this tiny teaching model and matches many textbook derivations. You can still see vanishing gradients if weights saturate.
Is full-batch the same as stochastic gradient descent?
SGD uses small random subsets (or one point) per step; full-batch averages the gradient over all points each update. Here batch = entire dataset for clarity; noise-free curves but slower movement on large N.
Why might spirals stay partly wrong even after many epochs?
This is a very small network with axis-free but still limited capacity; spirals are hard. Try more hidden units, a smaller learning rate for stability, or many more epochs—sometimes the boundary needs fine “wiggles” that require width and patience.