CNN Model Compression
105× model size reduction with 7.6× latency speedup
42 MB → 0.4 MB · 68 ms → 8.9 ms
Compressed a 42 MB image classification CNN to 0.4 MB using structured pruning, INT8 post-training quantization, and knowledge distillation — reducing inference latency from 68 ms to 8.9 ms with only a 3–5% accuracy drop. 3rd place at Srijan OpenAImer.
Why compression matters
A model that scores 94% on a benchmark but requires 512MB of RAM and a GPU is not deployable in most of the world. A model that scores 91% and fits in 400KB and runs in 9ms on a CPU is.
This was the framing we used going into Srijan OpenAImer. The task was image classification — not the interesting part. The interesting part was how aggressively we could shrink the model without it forgetting how to do its job.
The three-stage pipeline
We applied compression sequentially, each stage targeting a different source of redundancy.
First, structured pruning: we removed entire filters whose L1 norm fell below a threshold — not individual weights, but whole channels. This keeps the computation graph dense and actually speeds up inference, unlike unstructured pruning which creates sparse matrices that are hard to accelerate. We pruned iteratively at 20% increments, retraining briefly between rounds.
Second, INT8 post-training quantisation: float32 → 8-bit integers across weights and activations. 4× memory reduction essentially for free. The key is calibrating quantisation ranges on a representative data subset; skip this and you get significant accuracy loss.
Third, knowledge distillation: training a smaller student network to match the softened output distribution of the original teacher. The soft targets carry more signal than hard labels — the teacher's confidence over incorrect classes encodes similarity structure that cross-entropy training alone misses. This last stage recovered most of the accuracy lost in earlier stages.
The result and what I took from it
42 MB to 0.4 MB. 68 ms to 8.9 ms. 3–5% accuracy drop. Third place.
The thing that stuck with me wasn't the competition result — it was discovering how much of a trained model is doing nothing. A large fraction of filters in a typical CNN learn nearly identical features. Pruning them away doesn't hurt the model because the information was redundant to begin with. The model was fat, not strong.
This directly influenced how I think about model design now. I'm much more suspicious of large models trained without compression in mind. The work also seeded the interest that later became the FPGA deployment research in TADNet — if you can get a model to 0.4 MB, suddenly a whole class of hardware becomes viable.