cultivate your garden
February 2, 2026

The Sigmoid Function


While learning more about the building blocks of machine learning, I've found myself compelled by a math concept called the sigmoid function. As far as I can tell, this function was discovered back in the early 1800s, and is having a resurgence in popularity as a result of its involvement in training neural networks. But let me not get ahead of myself.



What is the Sigmoid Function?


The sigmoid function is any mathematical formula whose graph has an s-shape, formally known as a "sigmoid curve." An example would be the logistic function, which is defined by this formula:


the logistic function σ(x) = 1 / (1 + e^-x) σ(x) the output always between 0 and 1 x the input any number (−∞ to +∞) e Euler's number ≈ 2.718 a constant like π big input → e^-x shrinks → output approaches 1 small input → e^-x grows → output approaches 0

In the field of artificial intelligence, the sigmoid function is termed interchangeably with the logistic function, so we'll stick with this example for this article though there are technically other sigmoid functions which graph to a similar shape.


The power and elegance of the sigmoid function is that it can take noisy erratic data and smoothly transform it into a probability, functionally doing this by mapping any number onto a scale between 0 and 1. If the outputs of our machine learning model are -181.2, 38.1, and 0.004, that doesn't mean much to us at a glance. But after mapping those to the sigmoid function, the output will be between 0 and 1—essentially a likelihood of the input corresponding to our targeted output.


squashing any number into 0–1 raw input -181.2 0.004 38.1 sigmoid function output (0–1) ≈ 0.0000 ≈ 0.5010 ≈ 1.0000 ≈ 0% ≈ 50% ≈ 100% large negative numbers → near 0 (very unlikely) numbers near zero → near 0.5 (uncertain) large positive numbers → near 1 (very likely)

This is most commonly and effectively applied to binary classification, meaning you are trying to teach a machine to predict something where the answer is either yes or no.


One example is training a computer to predict whether an image does or does not contain a flower. When we provide a computer with a picture, it uses complicated math to try to understand the image based on vectors and spits out a number that probably isn't comprehensible to humans. But by mapping it onto the sigmoid function, we're able to turn the result into a percent probability we can understand.


This is also used to improve the computer's training, as these values let us calculate "loss"—a measure of how right or wrong the model's predictions are. If the model was 90% sure each image contained a flower, and it was right each time, the loss is low—the model did well. If it was only 50% sure and guessed right, the loss is higher, signaling room for improvement. And if it was 90% sure but wrong, the loss spikes. This feedback loop is how the model learns.



Activation Functions and Alternatives


Sigmoid is one example of an activation function—the step in a neural network where each neuron decides how strongly to "fire" based on what it received. For more on neurons, refer to my article on how Claude actually works. Think of it as a gate that shapes the signal before passing it along.


In recent years, other more efficient activation functions have replaced the sigmoid function in many contexts, especially when the input and output values are not binary (0 or 1). Another commonly used activation function is ReLU, which simply outputs zero for any negative input and passes positive values through unchanged. It's computationally cheaper and avoids some training problems that sigmoid can cause in deep networks.


sigmoid vs ReLU sigmoid -∞ +∞ 1 0 ReLU -∞ +∞ 0 smooth S-curve output always 0–1 zero below 0, linear above output 0 to +∞


I hope this has been an interesting note on a small concept that really captured my attention in its mathematical elegance.