Overview
Llama.cpp is a C/C++ implementation for running Llama models locally, providing high-performance inference on CPU and GPU.
Key Features
- High Performance: Optimized C/C++ implementation
- CPU/GPU Support: Runs on CPU or with GPU acceleration
- Multiple Platforms: Windows, macOS, Linux, even mobile
- Model Formats: GGML and GGUF format support
- Quantization: Various quantization levels for different hardware
Installation
Pre-built Binaries
Download from releases
Build from Source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
makeUsage
Basic Chat
./main -m models/llama-7b.gguf --prompt "Hello, how are you?"Interactive Mode
./main -m models/llama-7b.gguf -iServer Mode
./server -m models/llama-7b.ggufSupported Models
- Llama 2/3 series
- Code Llama
- Other Llama-compatible models in GGUF format