The frontier has shifted. Open-weight models now match or exceed closed proprietary systems on most inference tasks. The question is no longer capability — it's deployment: which models run on the hardware you actually have.
Quantisation trades a small accuracy penalty for dramatic reductions in VRAM. INT4 quantised 70B runs on a 48GB workstation GPU. A 7B INT4 model fits in 4GB — a gaming laptop. The column headers are GPUs you can actually buy.
| Model Size | Precision | VRAM needed | 4GB GPU GTX 1650 |
8GB GPU RTX 3070 |
16GB GPU Arc A770 |
24GB GPU RTX 3090 |
48GB GPU RTX 6000 Ada |
80GB A100 SXM |
|---|---|---|---|---|---|---|---|---|
| 1B–3B | FP16 | 2–6GB | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 7B | FP16 | 14GB | ✗ | ✗ | ~ | ✓ | ✓ | ✓ |
| 7B | INT4 | 3.5GB | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 13B | FP16 | 26GB | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
| 13B | INT4 | 6.5GB | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 70B | FP16 | 140GB | ✗ | ✗ | ✗ | ✗ | ✗ | ~×2 |
| 70B | INT4 | 35GB | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
| 70B | INT2 | 17GB | ✗ | ✗ | ✗ | ~ | ✓ | ✓ |
| 405B | INT4 | 202GB | ✗ | ✗ | ✗ | ✗ | ✗ | ~×3 |
| CPU only | INT4 | RAM | 7B runs on any modern CPU via llama.cpp · ~5 tok/s · No GPU required · R740 viable today | |||||