Create Your ModelCYM
From frozen base to governed, served, sovereign model — the Foundry playbook, wired to real open-source tooling and Kynetra Studio.
How to read this
- The 50 Foundry names are an owned vocabulary over real, production OSS techniques. The techniques are sufficient to build/fine-tune real models; the wiki is the map, this doc is the route, the OSS tools are the engine.
- You almost never pretrain from scratch. Start from an open base and adapt — full pretraining of a 7B+ needs a GPU cluster and six-to-seven figures. CYM starts at a base and earns capability through adaptation, alignment, and retrieval.
- Mind the licenses — of the base model *and* every dataset. Apache-2.0 / MIT bases (Qwen2.5, Mistral) are safest for commercial/sovereign use; Llama is permissive-with-conditions.
- Hardware reality: QLoRA on a 7B fits in ~16–24 GB VRAM (one consumer GPU). Full fine-tuning and multi-node need FSDP/ZeRO and real clusters.
Worked example
A domain assistant fine-tuned from an open 7B base — the running example in every stage below.
- Base
- Qwen2.5-7B-Instruct (Apache-2.0)
- Alts
- Mistral-7B-v0.3 (Apache-2.0), Llama-3.1-8B (community license), Gemma-2-9B
- Goal
- A retrieval-grounded support/ops assistant, QLoRA-adapted, DPO-aligned, AWQ-quantized, served on vLLM with an OpenAI-compatible API.
Prerequisites
- GPU: 1× 24 GB (RTX 4090 / A5000) for QLoRA on 7B. 2–8× A100/H100 for full FT, merging at scale, or 70B work.
- Python 3.10+, CUDA 12.x, PyTorch 2.4+. A clean venv/conda per project.
- Install (worked example):
pip install unsloth trl peft datasets accelerate bitsandbytes·pip install vllm·pip install llmcompressor lm-eval ragas·git clone https://github.com/ggerganov/llama.cpp - Accounts/access: Hugging Face token (gated bases), a vector DB (local Qdrant via Docker is fine), and a Kynetra Studio workspace.
- Kynetra Studio is the full-lifecycle home (
aistudio.kynetra.dev): Data → Train → Align → Evaluate → Serve → Govern. Every stage below names the Studio surface that wraps the OSS step.
The open-source stack
| Layer | OSS tooling | Foundry components |
|---|---|---|
| Data & dedup | HF Datasets · text-dedup (MinHash/LSH) · distilabel (synthetic) | Glyphsieve · Corpusmith · Strataweave |
| Tokenizer | HF Tokenizers · SentencePiece (reuse base tokenizer) | Latticescript |
| Fine-tune (PEFT) | Unsloth (1-GPU speed) · Axolotl (YAML/multi-GPU) · TRL+PEFT · TorchTune (FSDP2) · LLaMA-Factory | RankWeave · NibbleGraft · MagnitudeForge |
| Alignment | TRL — DPOTrainer / ORPO / KTO / PPO | PreferLoom · CreedForge · Concordance Lattice |
| Merge / distill | mergekit (TIES/DARE/soup) · TRL distillation | Mergespire · Fluxbroth · Distilflux |
| Quantize | llm-compressor (AWQ, replaces AutoAWQ) · GPTQModel (replaces AutoGPTQ) · llama.cpp (GGUF Q4_K_M) | Errorforge · Nanocrush · Latticeprune |
| Retrieval / RAG | sentence-transformers / BGE embeddings · Qdrant · pgvector · LanceDB · cross-encoder rerankers | Memvault Lattice · Echograph · Graftsieve |
| Serving | vLLM (GPU, AWQ-Marlin, PagedAttention, continuous batching) · llama.cpp / Ollama (local/edge) | Pagewright KV · Flowbatch Loom · Speccast Relay |
| Evaluation | lm-evaluation-harness · RAGAS (RAG) · DeepEval (CI) · LLM-as-judge | Proofgrid · Arbiter Lattice · Veracity Quotient |
| Governance | Llama Guard / guardrails · FSDP2 (TorchTune) · DeepSpeed ZeRO · checkpoint orchestration | Sentinel Weave · Wardgate · Shardbastion · Anchorpoint |
The build — 10 stages
Build the dataset. The model IS the data — curation decides the ceiling.
datasetstext-dedup (MinHash/LSH)distilabel (Self/Evol-Instruct)- Gather raw domain docs + any seed instruction pairs into JSONL (
{"messages":[...]}chat format). - Near-dedup with MinHash/LSH so you do not over-train on copies.
- Synthesize & evolve instructions from a teacher model to expand coverage; filter for diversity and quality.
- Set a domain mixture (Strataweave) and an easy→hard ordering (Curriculord). Reuse the base tokenizer — do not retrain one unless you change languages.
from datasets import load_dataset
# 1) load your raw + seed instruction data
ds = load_dataset("json", data_files="data/kynetra_raw.jsonl")["train"]
# 2) near-dedup (MinHash) — text-dedup CLI
# python -m text_dedup.minhash --path data/kynetra_raw.jsonl \
# --column text --threshold 0.7 --output data/kynetra_dedup.jsonl
# 3) synthesize instructions with distilabel (Self-Instruct / Evol-Instruct)
# then save the final SFT set as chat-formatted JSONL:
ds.to_json("data/kynetra_sft.jsonl") # {"messages":[{"role":"user",...},{"role":"assistant",...}]}Studio › Data — upload sources, run dedup/synthesis, preview the mixture, version the dataset.
Choose the foundation. For 99% of teams this means PICKING an open base, not training one.
- Pick a permissively-licensed base sized to your budget — 7–9B is the sweet spot for single-GPU adaptation.
- Prefer Apache-2.0/MIT (Qwen2.5, Mistral) for commercial/sovereign freedom.
- Only pretrain from scratch if you have cluster budget and a genuine architecture/data reason — otherwise adaptation wins on cost and time.
from huggingface_hub import snapshot_download
# Worked example base (Apache-2.0):
snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="bases/qwen2.5-7b")
# Alternatives: mistralai/Mistral-7B-Instruct-v0.3, meta-llama/Llama-3.1-8B-InstructStudio › Base Models — browse, license-check, and pin the base your project builds on.
Specialize the frozen base cheaply — train ~0.1–1% of params with QLoRA.
SFTTrainer + PEFT- Load the base in 4-bit (QLoRA) to fit a single GPU.
- Attach LoRA adapters to attention + MLP projections (r=16, alpha=32 is a solid default).
- SFT on your curated chat dataset; keep adapters small and swappable.
- Merge adapters into the base when you are happy (GraftFold) for a single deployable checkpoint.
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
model, tok = FastLanguageModel.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct", max_seq_length=4096, load_in_4bit=True) # NibbleGraft (QLoRA)
model = FastLanguageModel.get_peft_model( # RankWeave (LoRA)
model, r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])
SFTTrainer(model=model, tokenizer=tok,
args=SFTConfig(output_dir="cym-sft", per_device_train_batch_size=2,
gradient_accumulation_steps=8, learning_rate=2e-4, num_train_epochs=2),
train_dataset=load_dataset("json", data_files="data/kynetra_sft.jsonl")["train"]).train()Studio › Fine-tune — pick base + dataset + recipe (QLoRA), launch the job, watch loss/throughput live.
Teach taste, not just tokens — make the model prefer good answers.
DPOTrainerORPO / KTO (reference-free / unpaired variants)- Build preference pairs:
{prompt, chosen, rejected}— from human ranks, an LLM judge, or best-of-N rejection sampling. - Run DPO from your SFT checkpoint (DPO needs no separate reward model — simpler and stable).
- Use ORPO if you want to fold alignment into one SFT-style pass; KTO if your data is unpaired.
- Watch the preference margin AND KL from the reference — high margin + high KL means reward hacking.
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
pref = load_dataset("json", data_files="data/kynetra_pref.jsonl")["train"] # prompt/chosen/rejected
DPOTrainer(model="cym-sft",
args=DPOConfig(beta=0.1, output_dir="cym-dpo", learning_rate=5e-6, num_train_epochs=1),
train_dataset=pref).train()Studio › Align — import preference pairs, choose DPO/ORPO/KTO, launch, track margin + KL.
Optional — fuse skills or pour a big model into a small one.
mergekit (TIES/DARE/soup)TRL distillation- Merge multiple domain adapters/checkpoints into one with TIES (sign-aware) when you need several skills in one model.
- Average several runs of the SAME base (soup) for a flatter, more general checkpoint.
- Distill a larger teacher into your 7B when you want its behavior at lower serving cost.
- Skip this stage entirely if a single adapter already meets quality.
# mergekit TIES config (merge two domain checkpoints onto the base)
models:
- model: cym-dpo
parameters: { weight: 0.6, density: 0.7 }
- model: cym-ops-adapter
parameters: { weight: 0.4, density: 0.7 }
merge_method: ties
base_model: Qwen/Qwen2.5-7B-Instruct
dtype: bfloat16
# run: mergekit-yaml merge.yml cym-mergedStudio › Compose — merge/soup checkpoints or run a distillation job from a teacher endpoint.
Shrink for serving — 4-bit weights, a fraction of the memory, near-equal quality.
llm-compressor → AWQ (for vLLM; replaces deprecated AutoAWQ)GPTQModel (replaces archived AutoGPTQ)llama.cpp → GGUF Q4_K_M (local/edge)- For GPU serving: produce an AWQ build — vLLM runs it on the fast Marlin kernel.
- For laptops/edge: convert to GGUF and quantize to Q4_K_M for llama.cpp / Ollama.
- Calibrate on a small in-domain sample so quantization error lands where it matters least.
# A) AWQ for vLLM (llm-compressor) python -c "from llmcompressor import oneshot; \ oneshot(model='cym-merged', recipe='awq', output_dir='cym-awq')" # B) GGUF Q4_K_M for local (llama.cpp) python llama.cpp/convert_hf_to_gguf.py cym-merged --outfile cym.gguf ./llama.cpp/llama-quantize cym.gguf cym-Q4_K_M.gguf Q4_K_M
Studio › Export — one-click AWQ (server) or GGUF (edge) artifacts, with a quick perplexity check.
Give the model the knowledge it was never trained on — ground answers in your data.
sentence-transformers / BGE embeddingsQdrant · pgvector · LanceDBcross-encoder reranker- Chunk and embed your knowledge base; store vectors in a DB.
- At query time: retrieve top-k by vector similarity, rerank with a cross-encoder, then prepend the survivors to the prompt.
- Retrieval recall is the ceiling — invest in chunking and reranking before bigger models.
- Extend context (YaRN/RoPE) only with a short long-context fine-tune; raw interpolation degrades quality.
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
emb = SentenceTransformer("BAAI/bge-base-en-v1.5") # Echograph
qc = QdrantClient(":memory:") # Memvault store
# index chunks -> qc.upsert(...); at query time:
hits = qc.search("kb", emb.encode("how do I reset billing?").tolist(), limit=20)
# rerank top-20 with a cross-encoder (Graftsieve), keep top-4, prepend to promptStudio › Knowledge — connect sources, build the index, tune retrieval/rerank, attach to the model.
Tokens per dollar — expose the model behind a fast, OpenAI-compatible API.
- Serve the AWQ build on vLLM — you get PagedAttention + continuous batching for free.
- Add a small draft model for speculative decoding if latency matters and outputs are low-temperature.
- For local/offline, run the GGUF build via Ollama or llama.cpp server.
# GPU production (OpenAI-compatible at :8000/v1) vllm serve cym-awq --quantization awq_marlin --max-model-len 8192 --port 8000 # Local / edge ollama create cym -f Modelfile # FROM ./cym-Q4_K_M.gguf ollama run cym
Studio › Deploy — promote an artifact to a managed endpoint (scale, keys, rate limits, logs).
If you can't measure it, you can't ship it. Score before and after every change.
lm-evaluation-harness (academic benchmarks)RAGAS (RAG faithfulness)DeepEval (CI gates) + LLM-as-judge- Run standardized benchmarks against the served endpoint for a comparable baseline.
- Use an LLM judge with rubric + position-swap for task-specific quality; RAGAS for retrieval faithfulness.
- Wire DeepEval into CI so regressions block releases; monitor production drift continuously.
# Benchmark the live model via vLLM backend lm_eval --model local-completions \ --model_args base_url=http://localhost:8000/v1,model=cym-awq \ --tasks mmlu,gsm8k,ifeval --batch_size auto # RAG faithfulness / answer-relevancy with RAGAS, judge quality with DeepEval in CI.
Studio › Evaluate — benchmark suites, judge rubrics, RAG metrics, and a drift dashboard on the live endpoint.
Train on your terms, behind your walls — guardrails, sharding, sovereign deploy.
- Wrap the endpoint with input/output guardrails and jailbreak/prompt-injection defenses.
- For big training, shard with FSDP/ZeRO-3 across GPUs; checkpoint with RNG + data-position state so runs resume exactly.
- For regulated/sovereign work, run the whole pipeline air-gapped against a mirrored artifact registry.
# Guardrail in front of the endpoint (Llama Guard pattern): classify -> allow/deny -> model # Sharded training when you outgrow one GPU (TorchTune FSDP2): tune run --nproc_per_node 8 full_finetune_distributed --config qwen2_5_7B_full # Anchorpoint: enable resumable checkpoints (weights + optimizer + RNG + dataloader position).
Studio › Govern — policies & guardrails, sovereign/air-gapped deploy targets, lineage and audit trail.
Recipe card · Kynetra-CYM-7B
base Qwen2.5-7B-Instruct (Apache-2.0) curate HF datasets + text-dedup (MinHash) + distilabel synth adapt QLoRA r=16 a=32 (Unsloth + TRL SFTTrainer) align DPO beta=0.1 (TRL DPOTrainer) compress AWQ (vLLM) + GGUF Q4_K_M (local) augment BGE embeddings + Qdrant + cross-encoder rerank serve vLLM (OpenAI-compatible :8000/v1) evaluate lm-eval-harness + RAGAS + DeepEval (CI) govern guardrails + FSDP2 + resumable checkpoints