How AI Models Shrink to Fit Your Devices: Quantisation and Compression Explained (2026)

Updated June 2026 · 11-minute read

A GPT-4 class language model has hundreds of billions of parameters and requires multiple high-end GPUs to run. The language model running Alexa's voice processing inside an Echo device runs on a chip with a few hundred megabytes of RAM, drawing a few watts of power. The model enabling wake word detection in the always-on microprocessor inside the same Echo device runs in under 500KB of flash memory, on a chip drawing less than a milliwatt.

The gap between these scales is not bridged by using simpler or less capable algorithms. The mathematical operations underlying all of these systems are fundamentally similar: matrix multiplications applied to high-dimensional tensors. What bridges the scale gap is model compression: a set of techniques that reduce the size and computational requirements of AI models while preserving as much of their capability as possible.

Understanding model compression is increasingly useful for anyone working with or thinking seriously about AI in consumer hardware, because it determines what AI can realistically run on the hardware that exists in devices today and what will be possible as techniques improve.

Why Compression Is Necessary

Neural networks are trained using 32-bit floating point arithmetic (FP32). Each parameter, an individual weight in the network, is stored as a 4-byte float that can represent values with about 7 significant decimal digits of precision, across a range from approximately 1.2 x 10^-38 to 3.4 x 10^38.

This precision is used during training because the gradient-based optimisation algorithms used to train networks are sensitive to small changes in parameter values. A parameter update of 0.00001 needs to be distinguishable from 0.00002 for training to converge correctly.

But once a model is trained and used for inference (applying the trained model to new inputs), this level of precision is generally not required. Research has repeatedly demonstrated that the information in a trained model can be represented with much lower precision without significant loss of output quality. This is the fundamental insight that model compression exploits.

Quantisation: Reducing Numerical Precision

Quantisation converts model parameters from high-precision floating point to lower-precision integer representations. The most common targets are INT8 (8-bit signed integer) and INT4 (4-bit signed integer), though 16-bit float (FP16 or BF16) quantisation is also used for intermediate compression.

How quantisation works mathematically

For INT8 quantisation, the range of FP32 values in a given tensor is mapped to the 256 representable values of an 8-bit integer (-128 to 127 for signed, 0 to 255 for unsigned). This mapping requires two parameters: a scale factor that determines the size of each quantisation step, and a zero point that maps the float zero to an integer value.

The quantisation of a float value x to an integer q is:

q = round(x / scale) + zero_point

And dequantisation (recovering an approximate float from an integer) is:

x_approx = (q - zero_point) * scale

The quantisation error, the difference between the original float and the recovered approximation, is bounded by half the quantisation step size (scale/2). For typical neural network weight distributions, which are approximately Gaussian centred near zero, this error is small relative to the magnitude of the weights, which is why INT8 quantisation preserves most of the model's accuracy.

Post-training quantisation

Post-training quantisation (PTQ) applies the quantisation process after training is complete, without modifying the original training procedure. A calibration dataset (a representative sample of typical inputs) is passed through the model to determine the ranges of values in each layer's activations, and the scale and zero point parameters are chosen to minimise quantisation error for these ranges.

PTQ is fast and convenient: it requires no retraining, only a calibration step that takes minutes. The accuracy cost is typically 0.5 to 2 percent for INT8 quantisation on standard benchmarks, which is acceptable for most deployment scenarios. For INT4 quantisation, the accuracy cost of naive PTQ is larger, often 2 to 5 percent, and more sophisticated techniques are needed to maintain quality.

Quantisation-aware training

Quantisation-aware training (QAT) incorporates the quantisation operation into the training process itself, using fake quantisation nodes that simulate the quantisation error during forward and backward passes. The model learns to represent its function with the limited precision available after quantisation, and the resulting quantised model typically loses less accuracy than PTQ achieves.

QAT requires access to the training pipeline and training data, and takes as long as a full training run, which makes it more expensive than PTQ. For models where the accuracy loss from PTQ is unacceptable, particularly for INT4 or lower bit-width targets, QAT is typically necessary.

Pruning: Removing Unnecessary Parameters

Neural network pruning removes parameters that contribute minimally to the model's output, creating a smaller model. The insight behind pruning is empirical: trained neural networks are typically over-parameterised, with many weights that are close to zero or that can be removed without significantly changing the model's behaviour.

Magnitude-based pruning

The simplest pruning approach removes weights with the smallest absolute magnitude, on the principle that small weights contribute least to the model's computation. A sparsity target (for example, remove 50 percent of weights) is specified, and the weights with the smallest magnitudes are set to zero.

A single pruning step typically degrades model accuracy, because some of the removed weights were small but important. This is addressed by iterative pruning with fine-tuning: prune a fraction of weights, fine-tune the model for a few iterations to recover accuracy, prune again, and repeat until the desired sparsity is reached. This gradual approach allows the remaining weights to compensate for the removed ones.

Structured vs unstructured pruning

Unstructured pruning removes individual weights regardless of their position in the weight matrices, creating sparse matrices with zeros scattered throughout. This achieves good accuracy for a given compression ratio but is difficult to accelerate on standard hardware because sparse matrix operations are not efficiently implemented on most processors.

Structured pruning removes entire neurons, attention heads, or layers, creating a smaller dense model rather than a sparse one. A dense 50-parameter model runs faster than a sparse 100-parameter model with 50 zeros, because the dense model requires fewer memory accesses and does not require sparse matrix kernels. Structured pruning is preferred for deployment on standard processors, while unstructured pruning is used when specialised sparse accelerator hardware is available.

Knowledge Distillation: Learning From a Larger Model

Knowledge distillation trains a small model (the student) to imitate the behaviour of a large, accurate model (the teacher) rather than to match the original training labels directly. This allows the small model to benefit from the generalisation patterns learned by the large model, achieving better performance than a small model trained from scratch.

How distillation works

In standard supervised learning, a model is trained to match hard labels: the correct answer for each training example. In knowledge distillation, the student is trained to match the teacher's soft outputs: the probability distribution the teacher assigns across all classes for each input.

The soft outputs carry more information than hard labels. For an image classification model, the teacher might assign 85 percent probability to "cat" and 10 percent probability to "small dog" and 5 percent to "rabbit" for a given image. A student trained on just the hard label (cat, 100 percent) does not learn that cats and small dogs are visually similar. A student trained on the teacher's soft outputs implicitly learns these relationships.

The distillation loss combines the standard cross-entropy loss against the hard labels with a distillation loss that measures how closely the student's outputs match the teacher's outputs. The relative weight of these two terms is a hyperparameter that balances learning from the teacher versus learning from the ground truth labels.

Real applications: TinyBERT and MobileBERT

TinyBERT and MobileBERT are distilled versions of BERT (Bidirectional Encoder Representations from Transformers), demonstrating knowledge distillation at scale. TinyBERT has 4 layers and 14.5 million parameters, compared to BERT-base's 12 layers and 110 million parameters, and achieves about 96 percent of BERT-base's performance on language understanding benchmarks while running over 9 times faster and requiring 7 times less storage. MobileBERT similarly reduces BERT's size while targeting mobile deployment. These models can run natural language understanding tasks on smartphone hardware that BERT-base cannot practically run on.

Combining Techniques: The Real-World Pipeline

In practice, deploying AI on constrained hardware requires combining multiple compression techniques rather than using any single approach in isolation. A typical pipeline for deploying a model to a mobile or embedded device:

Architecture selection: Choose a model architecture designed for efficiency at the target scale. MobileNetV3, EfficientNet-Lite, or a custom architecture rather than the full ResNet or ViT used in research.
Distillation (optional): If a larger teacher model exists, distil the student to recover the knowledge of the larger model in the smaller architecture.
Quantisation-aware training: Train or fine-tune with simulated quantisation to prepare the model for INT8 or INT4 deployment.
Structured pruning with fine-tuning: Iteratively remove low-importance structures and fine-tune to recover accuracy.
Post-training quantisation: Apply final INT8 or INT4 quantisation with calibration data.
Hardware-specific optimisation: Convert to the target runtime format (TensorFlow Lite, ONNX, or custom format), apply any hardware-specific graph optimisations, and benchmark on the target device.

This pipeline can reduce a model that originally required gigabytes of storage and powerful cloud hardware to something that fits in megabytes and runs efficiently on an embedded processor.

What These Techniques Enable in Consumer Devices

Device	Compression Used	Model Size	What It Enables
Echo Dot (wake word chip)	Heavy quantisation and pruning	Under 500KB	Always-on keyword spotting at under 1mW
iPhone 17 (Writing Tools)	INT8 quant + distillation	Tens to hundreds of MB	On-device text AI without cloud
Sony WH-1000XM6 (ANC)	Quantised DSP model	Under 1MB	Real-time noise cancellation at 2ms latency
Roborock vacuum (obstacle AI)	Quantised MobileNet variant	5 to 20MB	Real-time object detection at 15fps
Oura Ring (health monitoring)	Heavily quantised embedded model	Under 1MB	Continuous HRV and activity on ring battery
Snapdragon 8 Gen 4 (on-device LLM)	INT4 quant + distillation	4 to 8GB	7B parameter LLM on mobile

The Frontier: Sub-1-Bit and Ternary Quantisation

Research has pushed quantisation below 4 bits to extreme compression ratios. Binary neural networks (1-bit weights) and ternary networks (weights of -1, 0, or +1) achieve extreme compression at significant accuracy cost. These are primarily research topics rather than production deployment options, except for very specific tasks where the accuracy requirements are modest.

Microsoft's BitNet work and related research has explored training transformers from scratch in 1-bit precision rather than quantising post-training, finding that models trained at 1-bit precision from the beginning degrade less than models post-trained to 1-bit. This is an active research area with potential implications for ultra-low-power language model deployment.

Frequently Asked Questions

Does quantisation always reduce accuracy?

INT8 quantisation of a well-trained model typically reduces accuracy by less than 1 percent on standard benchmarks, which is negligible for most applications. INT4 quantisation with naive post-training quantisation may reduce accuracy by 2 to 5 percent. Using quantisation-aware training and more sophisticated calibration methods can reduce this gap significantly. Some models are more sensitive to quantisation than others: vision transformers tend to be more sensitive than convolutional networks, and recurrent models can be particularly difficult to quantise effectively.

What is the difference between model compression and model distillation?

Model compression is the general term for all techniques that reduce model size or computational requirements: it includes quantisation, pruning, distillation, and architectural efficiency improvements. Distillation is one specific compression technique that involves training a smaller model to imitate a larger one. Distillation changes the model's training process, whereas quantisation and pruning are post-hoc modifications to a trained model (though they can also be incorporated into training).

Why do different frameworks give different model sizes for the same model?

Different frameworks apply different default optimisations and store models in different formats. A PyTorch model saved as a .pt file stores full FP32 weights with additional framework metadata. The same model exported to TensorFlow Lite with INT8 quantisation might be 75 percent smaller. Converting to ONNX and then applying ONNX Runtime optimisations might give a different size again. The reported model size depends critically on what format and what optimisations have been applied, which is why model size comparisons are only meaningful when the format and quantisation level are specified.

What is layer fusion and how does it relate to model compression?

Layer fusion is a graph-level optimisation that combines multiple consecutive operations into a single operation during inference. A common example is fusing a convolution layer, a batch normalisation layer, and a ReLU activation: instead of computing three separate operations with intermediate memory accesses, all three are computed in a single fused pass. Layer fusion reduces memory bandwidth requirements and improves cache efficiency, which can produce significant speed improvements on memory-bandwidth-limited hardware like embedded processors. Most inference runtimes including TensorFlow Lite, ONNX Runtime, and TensorRT apply layer fusion automatically as part of their model optimisation pipelines.

Technical content in this article draws on published research including the original knowledge distillation paper by Hinton et al. (2015), TinyBERT (Jiao et al., 2020), BitNet (Wang et al., 2023), and documentation from TensorFlow Lite and ONNX Runtime teams. Model size and accuracy figures reference published benchmark results and may vary across implementations.