website

What LLM Can I Run Locally?

If you want to run a Large Language Model (LLM) on your own computer, you need to make sure your hardware can handle it. The VRAM Calculator at apxml.com/tools/vram-calculator helps you figure out what models will work for you and how to optimize your setup. Here’s a beginner-friendly guide to the choices you’ll see:

1. Model Selection

Choose a Model: Pick from a list of available LLMs (like DeepSeek, Llama, etc.). Bigger models (more parameters) usually need more VRAM and run slower, but can be smarter.

2. Inference Quantization

Model Weights Precision: This sets how detailed the model’s numbers are stored. Common options:
- FP16 (Float16): Uses less VRAM, slightly less accurate than full precision.
- INT8/4-bit: Uses even less VRAM, but may reduce quality. Good for limited hardware.
Tip: Lower precision = less memory needed, but possibly lower output quality.

3. KV Cache Quantization

KV Cache Precision: Controls memory for storing recent context. Lower precision (FP16/BF16) saves VRAM, especially for long conversations.

4. Hardware Configuration

Select Your GPU: Choose your graphics card (e.g., RTX 3060 12GB). The calculator will estimate what fits in your VRAM.
Custom VRAM: If your GPU isn’t listed, enter your VRAM amount manually.

5. Number of GPUs

Num GPUs: If you have more than one GPU, you can use them together for bigger models or faster performance.

6. Batch Size

Batch Size: How many inputs are processed at once. Higher batch size = faster throughput, but needs more VRAM. For most home use, keep this at 1.

7. Sequence Length

Sequence Length: How many tokens (words/pieces) the model can see at once. Longer sequences need more memory, but allow for longer context in conversations.

8. Concurrent Users

Concurrent Users: Number of people using the model at the same time. More users = more memory needed.

How to Use the Calculator

Enter your hardware details (GPU, VRAM).
Pick a model and set quantization options.
Adjust batch size, sequence length, and users as needed.
The calculator shows if your setup can run the model, how much VRAM is used, and how fast it will be.

What to Look For

VRAM Usage: Make sure the model fits in your GPU’s VRAM. If it doesn’t, try lower precision or a smaller model.
Performance: Check the estimated speed (tokens/sec). Lower batch size and sequence length can help if you’re running out of memory.
Quality vs. Speed: Higher precision and bigger models give better results, but need more resources.

Summary Table

Choice	What It Does	Beginner Tip
Model	Pick size & type of LLM	Start small, upgrade later
Quantization	Controls memory vs. quality	FP16 is a good balance
KV Cache Quantization	Context memory precision	FP16/BF16 is usually fine
GPU/VRAM	Your hardware limits	Check your GPU specs
Batch Size	Inputs per step	Use 1 for home use
Sequence Length	Max context window	1024+ for chat, lower for Q&A
Concurrent Users	Simultaneous users	Usually 1 for personal use

By understanding these options, you can pick the best LLM for your hardware and needs. The VRAM Calculator makes it easy to experiment and see what works before you download or run anything.