KYNETRA FOUNDRY · Build Playbook

Create Your ModelCYM

From frozen base to governed, served, sovereign model — the Foundry playbook, wired to real open-source tooling and Kynetra Studio.

How to read this

The 50 Foundry names are an owned vocabulary over real, production OSS techniques. The techniques are sufficient to build/fine-tune real models; the wiki is the map, this doc is the route, the OSS tools are the engine.
You almost never pretrain from scratch. Start from an open base and adapt — full pretraining of a 7B+ needs a GPU cluster and six-to-seven figures. CYM starts at a base and earns capability through adaptation, alignment, and retrieval.
Mind the licenses — of the base model *and* every dataset. Apache-2.0 / MIT bases (Qwen2.5, Mistral) are safest for commercial/sovereign use; Llama is permissive-with-conditions.
Hardware reality: QLoRA on a 7B fits in ~16–24 GB VRAM (one consumer GPU). Full fine-tuning and multi-node need FSDP/ZeRO and real clusters.

Worked example

Kynetra-CYM-7B

A domain assistant fine-tuned from an open 7B base — the running example in every stage below.

Base: Qwen2.5-7B-Instruct (Apache-2.0)
Alts: Mistral-7B-v0.3 (Apache-2.0), Llama-3.1-8B (community license), Gemma-2-9B
Goal: A retrieval-grounded support/ops assistant, QLoRA-adapted, DPO-aligned, AWQ-quantized, served on vLLM with an OpenAI-compatible API.

Prerequisites

GPU: 1× 24 GB (RTX 4090 / A5000) for QLoRA on 7B. 2–8× A100/H100 for full FT, merging at scale, or 70B work.
Python 3.10+, CUDA 12.x, PyTorch 2.4+. A clean venv/conda per project.
Install (worked example): pip install unsloth trl peft datasets accelerate bitsandbytes · pip install vllm · pip install llmcompressor lm-eval ragas · git clone https://github.com/ggerganov/llama.cpp
Accounts/access: Hugging Face token (gated bases), a vector DB (local Qdrant via Docker is fine), and a Kynetra Studio workspace.
Kynetra Studio is the full-lifecycle home (aistudio.kynetra.dev): Data → Train → Align → Evaluate → Serve → Govern. Every stage below names the Studio surface that wraps the OSS step.

The open-source stack

Layer	OSS tooling	Foundry components
Data & dedup	HF Datasets · text-dedup (MinHash/LSH) · distilabel (synthetic)	Glyphsieve · Corpusmith · Strataweave
Tokenizer	HF Tokenizers · SentencePiece (reuse base tokenizer)	Latticescript
Fine-tune (PEFT)	Unsloth (1-GPU speed) · Axolotl (YAML/multi-GPU) · TRL+PEFT · TorchTune (FSDP2) · LLaMA-Factory	RankWeave · NibbleGraft · MagnitudeForge
Alignment	TRL — DPOTrainer / ORPO / KTO / PPO	PreferLoom · CreedForge · Concordance Lattice
Merge / distill	mergekit (TIES/DARE/soup) · TRL distillation	Mergespire · Fluxbroth · Distilflux
Quantize	llm-compressor (AWQ, replaces AutoAWQ) · GPTQModel (replaces AutoGPTQ) · llama.cpp (GGUF Q4_K_M)	Errorforge · Nanocrush · Latticeprune
Retrieval / RAG	sentence-transformers / BGE embeddings · Qdrant · pgvector · LanceDB · cross-encoder rerankers	Memvault Lattice · Echograph · Graftsieve
Serving	vLLM (GPU, AWQ-Marlin, PagedAttention, continuous batching) · llama.cpp / Ollama (local/edge)	Pagewright KV · Flowbatch Loom · Speccast Relay
Evaluation	lm-evaluation-harness · RAGAS (RAG) · DeepEval (CI) · LLM-as-judge	Proofgrid · Arbiter Lattice · Veracity Quotient
Governance	Llama Guard / guardrails · FSDP2 (TorchTune) · DeepSpeed ZeRO · checkpoint orchestration	Sentinel Weave · Wardgate · Shardbastion · Anchorpoint

The build — 10 stages

Curate

03 · Corpus & Data Engineering ↗

Build the dataset. The model IS the data — curation decides the ceiling.

FoundryGlyphsieve (MinHash dedup)Corpusmith (synthetic instructions)Strataweave (domain mixture)Curriculord (curriculum)Latticescript (tokenizer)

OSSHF datasetstext-dedup (MinHash/LSH)distilabel (Self/Evol-Instruct)

Gather raw domain docs + any seed instruction pairs into JSONL ({"messages":[...]} chat format).
Near-dedup with MinHash/LSH so you do not over-train on copies.
Synthesize & evolve instructions from a teacher model to expand coverage; filter for diversity and quality.
Set a domain mixture (Strataweave) and an easy→hard ordering (Curriculord). Reuse the base tokenizer — do not retrain one unless you change languages.

python

from datasets import load_dataset
# 1) load your raw + seed instruction data
ds = load_dataset("json", data_files="data/kynetra_raw.jsonl")["train"]

# 2) near-dedup (MinHash) — text-dedup CLI
#    python -m text_dedup.minhash --path data/kynetra_raw.jsonl \
#       --column text --threshold 0.7 --output data/kynetra_dedup.jsonl

# 3) synthesize instructions with distilabel (Self-Instruct / Evol-Instruct)
#    then save the final SFT set as chat-formatted JSONL:
ds.to_json("data/kynetra_sft.jsonl")   # {"messages":[{"role":"user",...},{"role":"assistant",...}]}

Kynetra Studio

Studio › Data — upload sources, run dedup/synthesis, preview the mixture, version the dataset.

Pretrain

08 · Pretraining & Foundation Architecture ↗

Choose the foundation. For 99% of teams this means PICKING an open base, not training one.

FoundryBasalt Core (decoder-only transformer)Forgecurve Doctrine (scaling laws)Attentryx Grip (GQA)Helix Anchor (RoPE)

OSSHugging Face Hubscaling-law intuition (Chinchilla)

Pick a permissively-licensed base sized to your budget — 7–9B is the sweet spot for single-GPU adaptation.
Prefer Apache-2.0/MIT (Qwen2.5, Mistral) for commercial/sovereign freedom.
Only pretrain from scratch if you have cluster budget and a genuine architecture/data reason — otherwise adaptation wins on cost and time.

python

from huggingface_hub import snapshot_download
# Worked example base (Apache-2.0):
snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="bases/qwen2.5-7b")
# Alternatives: mistralai/Mistral-7B-Instruct-v0.3, meta-llama/Llama-3.1-8B-Instruct

Kynetra Studio

Studio › Base Models — browse, license-check, and pin the base your project builds on.

Adapt

01 · Adaptation & PEFT ↗

Specialize the frozen base cheaply — train ~0.1–1% of params with QLoRA.

FoundryRankWeave (LoRA)NibbleGraft (QLoRA)MagnitudeForge (DoRA)

OSSUnsloth (fastest, 1 GPU)Axolotl (YAML, multi-GPU)TRL SFTTrainer + PEFT

Load the base in 4-bit (QLoRA) to fit a single GPU.
Attach LoRA adapters to attention + MLP projections (r=16, alpha=32 is a solid default).
SFT on your curated chat dataset; keep adapters small and swappable.
Merge adapters into the base when you are happy (GraftFold) for a single deployable checkpoint.

python

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

model, tok = FastLanguageModel.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", max_seq_length=4096, load_in_4bit=True)   # NibbleGraft (QLoRA)
model = FastLanguageModel.get_peft_model(                                # RankWeave (LoRA)
    model, r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])

SFTTrainer(model=model, tokenizer=tok,
    args=SFTConfig(output_dir="cym-sft", per_device_train_batch_size=2,
                   gradient_accumulation_steps=8, learning_rate=2e-4, num_train_epochs=2),
    train_dataset=load_dataset("json", data_files="data/kynetra_sft.jsonl")["train"]).train()

Kynetra Studio

Studio › Fine-tune — pick base + dataset + recipe (QLoRA), launch the job, watch loss/throughput live.

Align

04 · Alignment & Preference ↗

Teach taste, not just tokens — make the model prefer good answers.

FoundryPreferLoom (DPO)CreedForge (RLAIF / Constitutional)Concordance Lattice (reward model)Assayloom Cull (rejection sampling)

OSSTRL DPOTrainerORPO / KTO (reference-free / unpaired variants)

Build preference pairs: {prompt, chosen, rejected} — from human ranks, an LLM judge, or best-of-N rejection sampling.
Run DPO from your SFT checkpoint (DPO needs no separate reward model — simpler and stable).
Use ORPO if you want to fold alignment into one SFT-style pass; KTO if your data is unpaired.
Watch the preference margin AND KL from the reference — high margin + high KL means reward hacking.

python

from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

pref = load_dataset("json", data_files="data/kynetra_pref.jsonl")["train"]  # prompt/chosen/rejected
DPOTrainer(model="cym-sft",
    args=DPOConfig(beta=0.1, output_dir="cym-dpo", learning_rate=5e-6, num_train_epochs=1),
    train_dataset=pref).train()

Kynetra Studio

Studio › Align — import preference pairs, choose DPO/ORPO/KTO, launch, track margin + KL.

Distill

05 · Distillation & Merging ↗

Optional — fuse skills or pour a big model into a small one.

FoundryMergespire (TIES merge)Fluxbroth (model soup)Distilflux (logit KD)

OSSmergekit (TIES/DARE/soup)TRL distillation

Merge multiple domain adapters/checkpoints into one with TIES (sign-aware) when you need several skills in one model.
Average several runs of the SAME base (soup) for a flatter, more general checkpoint.
Distill a larger teacher into your 7B when you want its behavior at lower serving cost.
Skip this stage entirely if a single adapter already meets quality.

yaml

# mergekit TIES config (merge two domain checkpoints onto the base)
models:
  - model: cym-dpo
    parameters: { weight: 0.6, density: 0.7 }
  - model: cym-ops-adapter
    parameters: { weight: 0.4, density: 0.7 }
merge_method: ties
base_model: Qwen/Qwen2.5-7B-Instruct
dtype: bfloat16
# run:  mergekit-yaml merge.yml cym-merged

Kynetra Studio

Studio › Compose — merge/soup checkpoints or run a distillation job from a teacher endpoint.

Compress

02 · Quantization & Compression ↗

Shrink for serving — 4-bit weights, a fraction of the memory, near-equal quality.

FoundryErrorforge Calibrate (GPTQ/AWQ)Nanocrush NF4 (GGUF)Latticeprune Sparsefold (2:4 sparsity)

OSSllm-compressor → AWQ (for vLLM; replaces deprecated AutoAWQ)GPTQModel (replaces archived AutoGPTQ)llama.cpp → GGUF Q4_K_M (local/edge)

For GPU serving: produce an AWQ build — vLLM runs it on the fast Marlin kernel.
For laptops/edge: convert to GGUF and quantize to Q4_K_M for llama.cpp / Ollama.
Calibrate on a small in-domain sample so quantization error lands where it matters least.

bash

# A) AWQ for vLLM (llm-compressor)
python -c "from llmcompressor import oneshot; \
  oneshot(model='cym-merged', recipe='awq', output_dir='cym-awq')"

# B) GGUF Q4_K_M for local (llama.cpp)
python llama.cpp/convert_hf_to_gguf.py cym-merged --outfile cym.gguf
./llama.cpp/llama-quantize cym.gguf cym-Q4_K_M.gguf Q4_K_M

Kynetra Studio

Studio › Export — one-click AWQ (server) or GGUF (edge) artifacts, with a quick perplexity check.

Augment

06 · Retrieval, Memory & Long Context ↗

Give the model the knowledge it was never trained on — ground answers in your data.

FoundryMemvault Lattice (RAG)Echograph Embeddings (bi-encoder)Graftsieve Rerank (cross-encoder)Riftspan Rotary (long context)

OSSsentence-transformers / BGE embeddingsQdrant · pgvector · LanceDBcross-encoder reranker

Chunk and embed your knowledge base; store vectors in a DB.
At query time: retrieve top-k by vector similarity, rerank with a cross-encoder, then prepend the survivors to the prompt.
Retrieval recall is the ceiling — invest in chunking and reranking before bigger models.
Extend context (YaRN/RoPE) only with a short long-context fine-tune; raw interpolation degrades quality.

python

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

emb = SentenceTransformer("BAAI/bge-base-en-v1.5")          # Echograph
qc = QdrantClient(":memory:")                               # Memvault store
# index chunks -> qc.upsert(...); at query time:
hits = qc.search("kb", emb.encode("how do I reset billing?").tolist(), limit=20)
# rerank top-20 with a cross-encoder (Graftsieve), keep top-4, prepend to prompt

Kynetra Studio

Studio › Knowledge — connect sources, build the index, tune retrieval/rerank, attach to the model.

Serve

07 · Inference & Serving Architecture ↗

Tokens per dollar — expose the model behind a fast, OpenAI-compatible API.

FoundryPagewright KV (PagedAttention)Flowbatch Loom (continuous batching)Speccast Relay (speculative decoding)

OSSvLLM (GPU, production)llama.cpp / Ollama (local/edge)

Serve the AWQ build on vLLM — you get PagedAttention + continuous batching for free.
Add a small draft model for speculative decoding if latency matters and outputs are low-temperature.
For local/offline, run the GGUF build via Ollama or llama.cpp server.

bash

# GPU production (OpenAI-compatible at :8000/v1)
vllm serve cym-awq --quantization awq_marlin --max-model-len 8192 --port 8000

# Local / edge
ollama create cym -f Modelfile   # FROM ./cym-Q4_K_M.gguf
ollama run cym

Kynetra Studio

Studio › Deploy — promote an artifact to a managed endpoint (scale, keys, rate limits, logs).

Evaluate

09 · Evaluation, Observability & Governance ↗

If you can't measure it, you can't ship it. Score before and after every change.

FoundryProofgrid (eval harness)Arbiter Lattice (LLM-as-judge)Veracity Quotient (groundedness)Driftwatch Sentinel (drift)

OSSlm-evaluation-harness (academic benchmarks)RAGAS (RAG faithfulness)DeepEval (CI gates) + LLM-as-judge

Run standardized benchmarks against the served endpoint for a comparable baseline.
Use an LLM judge with rubric + position-swap for task-specific quality; RAGAS for retrieval faithfulness.
Wire DeepEval into CI so regressions block releases; monitor production drift continuously.

bash

# Benchmark the live model via vLLM backend
lm_eval --model local-completions \
  --model_args base_url=http://localhost:8000/v1,model=cym-awq \
  --tasks mmlu,gsm8k,ifeval --batch_size auto

# RAG faithfulness / answer-relevancy with RAGAS, judge quality with DeepEval in CI.

Kynetra Studio

Studio › Evaluate — benchmark suites, judge rubrics, RAG metrics, and a drift dashboard on the live endpoint.

Govern

10 · Safety, Sovereignty & Orchestration ↗

Train on your terms, behind your walls — guardrails, sharding, sovereign deploy.

FoundrySentinel Weave (guardrails)Wardgate (jailbreak defense)Shardbastion (FSDP/ZeRO)Sovryn Vault (air-gapped)Anchorpoint (checkpointing)

OSSLlama Guard / guardrail classifiersFSDP2 (TorchTune) · DeepSpeed ZeROresumable checkpoint orchestration

Wrap the endpoint with input/output guardrails and jailbreak/prompt-injection defenses.
For big training, shard with FSDP/ZeRO-3 across GPUs; checkpoint with RNG + data-position state so runs resume exactly.
For regulated/sovereign work, run the whole pipeline air-gapped against a mirrored artifact registry.

bash

# Guardrail in front of the endpoint (Llama Guard pattern): classify -> allow/deny -> model
# Sharded training when you outgrow one GPU (TorchTune FSDP2):
tune run --nproc_per_node 8 full_finetune_distributed --config qwen2_5_7B_full
# Anchorpoint: enable resumable checkpoints (weights + optimizer + RNG + dataloader position).

Kynetra Studio

Studio › Govern — policies & guardrails, sovereign/air-gapped deploy targets, lineage and audit trail.

Recipe card · Kynetra-CYM-7B

base      Qwen2.5-7B-Instruct (Apache-2.0)
curate    HF datasets + text-dedup (MinHash) + distilabel synth
adapt     QLoRA r=16 a=32   (Unsloth + TRL SFTTrainer)
align     DPO beta=0.1      (TRL DPOTrainer)
compress  AWQ (vLLM)   +   GGUF Q4_K_M (local)
augment   BGE embeddings + Qdrant + cross-encoder rerank
serve     vLLM  (OpenAI-compatible :8000/v1)
evaluate  lm-eval-harness + RAGAS + DeepEval (CI)
govern    guardrails + FSDP2 + resumable checkpoints

Sources

Fine-tuning frameworks (Axolotl/Unsloth/TRL/TorchTune) ↗HF TRL — DPO/ORPO/KTO ↗Unsloth ↗Axolotl ↗llm-compressor (AWQ; replaces AutoAWQ) ↗GPTQModel (replaces AutoGPTQ) ↗llama.cpp (GGUF) ↗vLLM serving ↗lm-evaluation-harness ↗RAGAS ↗mergekit ↗