No description

Shell 100%

Find a file

Patryk Koscik 62ac2b1a31 Initial commit		2026-04-26 15:09:19 +02:00
llama-cpp-turboquant@11a241d0db	Initial commit	2026-04-26 15:09:19 +02:00
models	Initial commit	2026-04-26 15:09:19 +02:00
.gitignore	Initial commit	2026-04-26 15:09:19 +02:00
.gitmodules	Initial commit	2026-04-26 15:09:19 +02:00
build.sh	Initial commit	2026-04-26 15:09:19 +02:00
init_opencode.sh	Initial commit	2026-04-26 15:09:19 +02:00
LICENSE.txt	Initial commit	2026-04-26 15:09:19 +02:00
README.md	Initial commit	2026-04-26 15:09:19 +02:00
run.sh	Initial commit	2026-04-26 15:09:19 +02:00

README.md

pkoscik's Local OpenCode setup

A local LLM inference setup using llama.cpp (TurboQuant fork) with ROCm on AMD hardware, serving as an OpenCode backend.

Hardware

GPU: AMD Radeon RX 6800 XT (16 GB, gfx1030, RDNA2)
CPU: AMD Ryzen 7 7700X (gfx1036 - hidden from llama.cpp)
RAM: 64 GB system
OS: Arch-based

Quick Start

# 1. Build llama.cpp TurboQuant fork
./build.sh

# 2. Download models and start server (default: fast / 35B-A3B)
./run.sh

# 3. Point OpenCode to http://127.0.0.1:8080/v1

Presets

run.sh has four built-in presets. Set MODE to select one:

Preset	Model	Context	Thinking	CPU-MoE	Batch / UB	Best for
`fast`	35B-A3B MoE	32k	off	28	4096 / 2048	Daily agent work
`smart`	27B dense	32k	on (2048 budget)	0	4096 / 2048	Hard one-shot questions
`bigctx`	27B dense	100k	off	0	2048 / 512	Reading large codebases
`custom`	(you set)	(you set)	(you set)	(you set)	2048 / 512	Experimenting

./run.sh                           # default: fast
MODE=smart ./run.sh                # 27B with thinking
MODE=bigctx ./run.sh               # 27B with 100k context
MODE=fast CTX=65536 ./run.sh       # override context
MODE=fast THINKING=on ./run.sh     # force thinking on
MODE=fast N_CPU_MOE=32 ./run.sh    # tweak expert offload
MODE=bigctx UB=256 ./run.sh        # tighter compute buffer if OOM

Setup

1. Install dependencies

sudo pacman -Syu

# ROCm SDK
sudo pacman -S rocm-hip-sdk rocm-hip-runtime rocm-opencl-runtime \
               hipblas rocblas rocsolver rocsparse rocwmma

# Build dependencies
sudo pacman -S base-devel cmake ninja git curl

# GPU permissions
sudo usermod -aG video,render $USER

Reboot or re-login for group changes to take effect.

2. Verify ROCm

rocminfo | grep -E 'gfx|Name'
#   Name: gfx1030        - your RX 6800 XT
#   Name: gfx1036        - Ryzen iGPU (must be hidden at runtime)

3. Build the TurboQuant fork

./build.sh

This clones https://github.com/TheTom/llama-cpp-turboquant, checks out the feature/turboquant-kv-cache branch, and builds with ROCm.

GGML_HIP_ROCWMMA_FATTN=OFF is required for RDNA2 (the WMMA fast-attention path only exists on RDNA3+)

4. Configure OpenCode

sudo pacman -S opencode

./init_opencode.sh

The model ID (qwen36) is just a label - llama-server serves whatever GGUF is loaded. No config change is needed when switching modes; just stop the server, switch mode, restart, and start a fresh session in OpenCode.

Models

Two GGUFs are downloaded automatically by run.sh:

Model	File	Size	Best for
Qwen3.6-27B Q3_K_XL	`Qwen3.6-27B-UD-Q3_K_XL.gguf`	13.5 GB	Hard one-shot questions, reasoning
Qwen3.6-35B-A3B Q4_K_XL	`Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`	22 GB	Agent loops, daily driver

The 27B is smarter but slower; the 35B-A3B is a MoE model that activates only 3B params per token - faster, with remaining experts offloaded to system RAM.

Tuning Knobs

Knob	What it does	Higher	Lower
`CTX`	Context window in tokens	Remembers more, slower, more VRAM	Snappier, less memory
`B` / `UB`	Batch / micro-batch for prompt eval	Faster ingest, more VRAM	Slower ingest, fits bigger contexts
`THINKING`	Internal reasoning before answering	Better one-shot quality, much slower	Faster, fine for agent loops
`THINK_BUDGET`	Max thinking tokens per turn	More deliberation	Avoids token spirals
`N_CPU_MOE`	MoE experts offloaded to RAM	Less VRAM, slightly slower	More VRAM, slightly faster
`--cache-type-k/v`	KV cache precision	turbo3 = 3-bit (fits more context)	f16/bf16 = full precision (safer)
`-ngl`	Layers on GPU (99 = all)	More on GPU = faster	More on CPU = more RAM
`-np`	Parallel conversation slots	Multiple clients	Single client gets full KV

Quantization

We use Unsloth's Dynamic 2.0 (UD-prefix) non-uniform quantization:

Quant	Size	Quality vs BF16	Note
UD-Q2_K_XL	~10 GB	~92%	Only if really squeezed
UD-Q3_K_XL	~13.5 GB	~99%	Sweet spot for 27B
UD-Q4_K_XL	~16.5 GB	~99.5%	35B-A3B default
UD-Q6_K	~22 GB	~99.9%	Too big without offload