Llama.Cpp

High Performance: Optimized C/C++ implementation
CPU/GPU Support: Runs on CPU or with GPU acceleration
Multiple Platforms: Windows, macOS, Linux, even mobile
Model Formats: GGML and GGUF format support
Quantization: Various quantization levels for different hardware

Overview

Llama.cpp is a C/C++ implementation for running Llama models locally, providing high-performance inference on CPU and GPU.

Download from releases

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

./main -m models/llama-7b.gguf --prompt "Hello, how are you?"

./main -m models/llama-7b.gguf -i

./server -m models/llama-7b.gguf