Quantization in Deep Learning: Shrinking Models Without Sacrificing Performance
1. Introduction
Quantization is a crucial technique in deep learning that reduces the memory footprint and computational cost of large models. Instead of storing and processing parameters in high-precision floating-point formats (e.g., float32
), quantization converts these parameters into lower bit-width representations (e.g., 8-bit integers, 4-bit floats) while preserving model accuracy as much as possible.
As AI models continue to grow in size—with some language models now exceeding hundreds of billions of parameters—efficient deployment becomes increasingly challenging. Quantization addresses this challenge by enabling models to run on devices with limited resources, reducing inference latency, and cutting energy consumption.
This article explores the fundamentals of quantization, its practical applications, implementation techniques, and the critical trade-offs involved when optimizing neural networks for real-world deployment.
2. Understanding Numerical Precision in Deep Learning
2.1 Floating-Point Formats
- Float32 (Single Precision)
- Bit allocation: 1-bit sign, 8-bit exponent, 23-bit fraction (mantissa)
- Memory: 32 bits (4 bytes) per parameter
- Dynamic range: Approximately ±1.18 × 10^-38 to ±3.4 × 10^38
- Usage: Standard in most deep learning frameworks; offers a good balance of precision and range but consumes significant memory for large models
- Float16 (Half Precision)
- Bit allocation: 1-bit sign, 5-bit exponent, 10-bit fraction (mantissa)
- Memory: 16 bits (2 bytes) per parameter
- Dynamic range: Approximately ±6.10 × 10^-5 to ±65,504
- Usage: Common in training on modern GPUs (mixed-precision training); reduces memory usage and can accelerate training but introduces potential numerical stability issues
- BFloat16 (Brain Floating Point)
- Bit allocation: 1-bit sign, 8-bit exponent, 7-bit fraction
- Memory: 16 bits (2 bytes) per parameter
- Advantage: Preserves the same dynamic range as float32 but with reduced precision; offers better numerical stability than float16
- Usage: Increasingly popular for AI training and inference, especially in modern hardware accelerators
Moving from higher to lower precision formats doesn’t simply reduce memory requirements—it fundamentally changes how computations are performed. When implemented correctly, these precision changes can dramatically improve performance with minimal impact on model accuracy.
3. What Is Quantization?
Quantization maps continuous or high-precision values into a discrete set of lower-precision values. In the context of deep learning:
- Weights and activations are converted from their original floating-point representation to a lower-bit representation with a defined mapping.
- This mapping typically involves scale factors and sometimes zero-point offsets that preserve the approximate dynamic range of the original values.
- The quantized values require fewer bits to store and process, leading to smaller models and faster computations.
The core quantization formula for linear quantization is:

Where:
– is the quantized value constrained to the target format’s range (e.g., 0-255 for uint8)
– is the original floating-point value
– is the scale factor that maps the original range to the quantized range
– is the zero-point (the quantized value that represents the floating-point zero)
To recover (dequantize) the original values:

This mapping introduces small rounding errors, but neural networks are typically robust to such minor perturbations if quantization is performed correctly.

This mapping introduces small rounding errors, but neural networks are typically robust to such minor perturbations if quantization is performed correctly.
4. Why Quantization Matters: The Practical Benefits
4.1 Memory Efficiency
- Model size reduction: A model quantized to INT8 (8-bit integers) requires 75% less memory compared to FP32, enabling deployment on memory-constrained devices.
- Example: A 7B parameter transformer model in FP32 requires ~28GB of memory, but only ~7GB when quantized to INT8, making it accessible on consumer-grade GPUs.
- Storage considerations: Smaller quantized models can be more easily distributed and stored on edge devices.
4.2 Computation Speed
- Reduced memory bandwidth: Lower precision means less data transferred between memory and compute units—often the primary bottleneck in model inference.
- Hardware acceleration: Many modern processors (CPUs, GPUs, and specialized AI hardware) offer dedicated low-precision matrix multiplication instructions that can be 2-4x faster.
- Batch throughput: The memory savings allow for processing larger batches, increasing overall throughput in serving systems.
4.3 Energy Efficiency
- Power consumption: Memory access is energy-intensive—quantized models require fewer memory operations and thus consume less power.
- Mobile and edge deployment: Critical for battery-powered devices, quantization can reduce energy consumption by 3-5x compared to full-precision inference.
- Data center implications: At scale, reduced energy requirements translate to significant cost savings and smaller carbon footprints.
4.4 Real-world Performance Preservation
- With modern techniques, INT8 quantization typically maintains accuracy within 0.5-1% of FP32 models.
- 4-bit quantization can preserve accuracy within 1-2% with appropriate calibration methods.
- The accuracy-efficiency trade-off can be carefully managed through selective quantization of less sensitive model components.
5. Types of Quantization
5.1 Post-Training Quantization (PTQ)
- Process: Convert a pre-trained full-precision model to lower precision without retraining.
- Calibration: Uses a small dataset to determine optimal quantization parameters (scales and zero points).
- Advantages:
- No training required
- Fast to implement
- Supported by most frameworks
- Limitations:
- May result in accuracy degradation, especially for <8-bit precision
- Less optimal for complex models with widely varying activation distributions
5.2 Quantization-Aware Training (QAT)
- Process: Simulate quantization effects during training by inserting fake quantization operations.
- Training modifications: Forward pass uses quantized weights, while the backward pass uses full precision for gradients.
- Advantages:
- Better accuracy preservation, especially for lower bit-widths
- Model learns to be robust to quantization noise
- Limitations:
- Requires full retraining or fine-tuning
- Computationally expensive
- More complex to implement

5.3 By Numerical Format
5.3.1 Integer Quantization
- INT8: The most widely supported format; good balance of accuracy and performance
- INT4/INT2: Extreme compression; requires careful implementation to preserve accuracy
- Asymmetric vs. Symmetric:
- Asymmetric uses both zero-point and scale
- Symmetric uses only scale (with zero-point fixed at 0), simpler but potentially less precise
5.3.2 Low-Bit Floating Point
- FP8: Emerging standard with two variants (E4M3, E5M2) for different precision needs
- FP4: Specialized formats like NormalFloat (by bitsandbytes) that maintain small exponent fields
- Advantage: Better handling of outliers and dynamic range compared to integer formats
5.4 By Granularity
5.4.1 Per-Tensor Quantization
- Single scale/zero-point for entire weight tensor
- Simple and efficient but less accurate for tensors with varying distributions
5.4.2 Per-Channel Quantization
- Different scale/zero-point for each output channel (in convolutions) or each row (in fully connected layers)
- Better preserves accuracy, especially for CNNs
- Slightly higher overhead due to storing multiple scaling factors
5.4.3 Per-Group Quantization
- Compromise between per-tensor and per-channel
- Groups multiple channels together under the same quantization parameters
- Useful for balancing accuracy and parameter overhead

6. Quantization Techniques in Practice
6.1 Simple Linear Quantization Example
A basic approach for uniform quantization:
# For 8-bit integer quantization:
scale = (max_val - min_val) / 255
zero_point = round(-min_val / scale)
quant_val = round(original_val / scale) + zero_point
For symmetric quantization (where we assume values are centered around zero):
scale = max(abs(max_val), abs(min_val)) / 127
quant_val = round(original_val / scale) # Range is [-127, 127]
6.2 Advanced Techniques
6.2.1 Double Quantization
Double quantization addresses storage efficiency by quantizing not just the weights but also the quantization parameters themselves:
- Weights are quantized into blocks (e.g., 128 or 256 values per block)
- Each block gets its own scale factor
- These scale factors are themselves quantized to a lower precision
This technique is particularly effective for large language models where storing per-block scales can become expensive.

6.2.2 Outlier-Aware Quantization
Outliers in weight distributions can force scales to accommodate extreme values, reducing precision for the majority of weights:
- Outlier detection: Identify weight values that lie outside a standard range (e.g., beyond 3σ)
- Separate handling: Keep outliers in higher precision while quantizing the rest
- Absmax clipping: Cap extreme values before quantization
6.2.3 Mixed-Precision Quantization
Not all layers are equally sensitive to quantization:
- Analyze layer sensitivity through ablation studies
- Apply different precision to different layers (e.g., INT8 for less sensitive layers, FP16 for critical layers)
- Balance accuracy and performance based on model-specific characteristics
7. Hardware Considerations and Acceleration
7.1 Hardware Support
Different hardware platforms offer varying levels of quantization support:
- CPUs:
- x86 processors offer VNNI/AVX-512 instructions optimized for INT8
- ARM processors include dot-product instructions for INT8/INT4
- GPUs:
- NVIDIA Tensor Cores support INT8, INT4, and specialized FP8 operations
- AMD MI series accelerators offer INT8/INT4 matrix operations
- Specialized Accelerators:
- Google TPUs are heavily optimized for INT8
- Custom edge AI chips (e.g., from Qualcomm, MediaTek) maximize quantized performance
7.2 Framework Support
- PyTorch: Offers Quantization 2.0 with dynamic, static, and QAT approaches
- TensorFlow: Provides TF Lite for quantized deployment on mobile/edge
- ONNX Runtime: Cross-platform quantization with significant performance optimizations
- Specialized Libraries:
- bitsandbytes and GPTQ for LLM-specific quantization
- TVM and Apache TVM for target-specific optimized deployment
8. Implementation Example: PyTorch Quantization
Below is a more comprehensive example of post-training quantization using PyTorch:
import torch
import os
from torch.quantization import get_default_qconfig, quantize_jit, prepare_jit
def ptq_example(model, calibration_data):
# Configure quantization parameters
model.eval()
qconfig = get_default_qconfig("fbgemm") # For server deployment (x86)
# qconfig = get_default_qconfig("qnnpack") # For mobile deployment (ARM)
# Prepare model for calibration
model_prepared = prepare_jit(model, {"": qconfig})
# Calibrate with representative data
with torch.no_grad():
for input_data, _ in calibration_data:
model_prepared(input_data)
# Convert to quantized model
quantized_model = quantize_jit(model_prepared)
# Verify size reduction
torch.jit.save(model, "fp32_model.pt")
torch.jit.save(quantized_model, "int8_model.pt")
fp32_size = os.path.getsize("fp32_model.pt") / (1024 * 1024)
int8_size = os.path.getsize("int8_model.pt") / (1024 * 1024)
print(f"FP32 Model size: {fp32_size:.2f} MB")
print(f"INT8 Model size: {int8_size:.2f} MB")
print(f"Compression ratio: {fp32_size/int8_size:.2f}x")
return quantized_model
For more granular control with per-channel quantization:
import torch
import torch.quantization.quantize_fx as quantize_fx
def per_channel_ptq(model, calibration_data):
# More advanced configuration
qconfig_dict = {
"": None, # Global config (None means no quantization by default)
"module_types": {
torch.nn.Conv2d: torch.quantization.get_default_qconfig("fbgemm"),
torch.nn.Linear: torch.quantization.per_channel_dynamic_qconfig
}
}
# Use FX Graph Mode Quantization for more complex models
model_prepared = quantize_fx.prepare_fx(model, qconfig_dict)
# Calibration
with torch.no_grad():
for input_data, _ in calibration_data:
model_prepared(input_data)
# Convert to fully quantized model
quantized_model = quantize_fx.convert_fx(model_prepared)
return quantized_model
9. Quantization for Different Model Types
9.1 Convolutional Neural Networks (CNNs)
- CNNs typically respond well to 8-bit quantization with minimal accuracy loss
- First and last layers are often more sensitive to quantization
- Per-channel quantization works particularly well for convolutional layers
- Common practice: Keep first/last layers in higher precision
9.2 Transformers and Large Language Models
- Attention mechanisms and layer normalizations require special consideration
- Weight distribution in transformers often exhibits outliers
- Common strategies:
- Use mixed precision (keeping critical layers at higher precision)
- Apply outlier-aware quantization for self-attention
- Employ specialized 4-bit data types (e.g., NF4) for weight matrices
- Use separate quantization parameters for query, key, value projections

9.3 Recurrent Neural Networks (RNNs)
- Long computation chains make RNNs sensitive to quantization error accumulation
- Methods to mitigate:
- Higher precision for state transitions
- Periodic recalibration to prevent error propagation
- Layer-specific quantization parameters
10. Performance Benchmarks and Trade-offs
10.1 Accuracy vs. Size vs. Speed
Precision | Size Reduction | Typical Accuracy Drop | Speed Improvement |
---|---|---|---|
FP16 | 2x | <0.1% | 1.3-2x |
INT8 | 4x | 0.5-1% | 2-4x |
INT4 | 8x | 1-3% | 3-6x |
INT2 | 16x | 5%+ | 4-8x |
Note: Actual results vary by model architecture, dataset, and hardware.
10.2 Where Quantization Works Best
- Overparameterized models with significant redundancy
- Transfer learning scenarios where models have excess capacity
- Inference-only deployments where training dynamics aren’t a concern
- Hardware with dedicated low-precision acceleration
10.3 When to Be Cautious
- Small, already-efficient models (less redundancy to exploit)
- Highly accurate perception tasks requiring precise feature representation
- Models handling long-tailed distributions or rare events
- Safety-critical applications without extensive validation
11. Current Research and Future Directions
11.1 Emerging Techniques
- Vector Quantization: Representing groups of weights with codebooks
- Learned Quantization: Neural networks that learn optimal quantization parameters
- Binary/Ternary Networks: Extreme quantization to 1-2 bits with specialized training
- Low-Rank Adaptation with Quantization: Combining parameter-efficient fine-tuning with quantization
11.2 Standardization Efforts
- Increasing focus on standardizing FP8 formats across hardware vendors
- Growing ecosystem of tools for easy deployment of quantized models
- Integration of quantization into core ML frameworks rather than as add-ons
12. Practical Recommendations
For practitioners looking to implement quantization:
- Start simple: Begin with 8-bit post-training quantization
- Measure thoroughly: Compare accuracy, latency, and memory across precision levels
- Identify bottlenecks: Use profiling to find layers sensitive to quantization
- Iterate granularly: Apply mixed precision rather than forcing uniform quantization
- Consider target hardware: Align quantization strategy with deployment platform capabilities
- Validate with realistic data: Ensure calibration data matches real-world distribution
- Establish accuracy guardrails: Define acceptable performance thresholds upfront
13. Conclusion
Quantization represents one of the most effective techniques for making deep learning models practical in resource-constrained environments. As models continue to grow in size and complexity, quantization becomes increasingly essential—not merely an optimization but a necessity for deployment.
The field continues to advance rapidly, with new methods pushing the boundaries of low-precision inference while maintaining high accuracy. 4-bit and even lower precision techniques are becoming viable for production use, enabling AI capabilities on a wider range of devices.
The future of quantization will likely involve closer integration with model architecture design (quantization-aware architectures), standardized hardware support across the industry, and automated tools that intelligently apply optimal quantization strategies based on model behavior and deployment targets.
For AI practitioners, quantization should be viewed as a core part of the model development lifecycle rather than a post-training consideration—a perspective that will become increasingly important as we balance the competing demands of model capability, computational efficiency, and accessibility.