CY
CREATE YOUR MODEL · CYM
Kynetra ↗ ← Foundry wiki
KYNETRA FOUNDRY · Build Playbook

Create Your ModelCYM

From frozen base to governed, served, sovereign model — the Foundry playbook, wired to real open-source tooling and Kynetra Studio.

How to read this

  • The 50 Foundry names are an owned vocabulary over real, production OSS techniques. The techniques are sufficient to build/fine-tune real models; the wiki is the map, this doc is the route, the OSS tools are the engine.
  • You almost never pretrain from scratch. Start from an open base and adapt — full pretraining of a 7B+ needs a GPU cluster and six-to-seven figures. CYM starts at a base and earns capability through adaptation, alignment, and retrieval.
  • Mind the licenses — of the base model *and* every dataset. Apache-2.0 / MIT bases (Qwen2.5, Mistral) are safest for commercial/sovereign use; Llama is permissive-with-conditions.
  • Hardware reality: QLoRA on a 7B fits in ~16–24 GB VRAM (one consumer GPU). Full fine-tuning and multi-node need FSDP/ZeRO and real clusters.

Worked example

Kynetra-CYM-7B

A domain assistant fine-tuned from an open 7B base — the running example in every stage below.

Base
Qwen2.5-7B-Instruct (Apache-2.0)
Alts
Mistral-7B-v0.3 (Apache-2.0), Llama-3.1-8B (community license), Gemma-2-9B
Goal
A retrieval-grounded support/ops assistant, QLoRA-adapted, DPO-aligned, AWQ-quantized, served on vLLM with an OpenAI-compatible API.

Prerequisites

  • GPU: 1× 24 GB (RTX 4090 / A5000) for QLoRA on 7B. 2–8× A100/H100 for full FT, merging at scale, or 70B work.
  • Python 3.10+, CUDA 12.x, PyTorch 2.4+. A clean venv/conda per project.
  • Install (worked example): pip install unsloth trl peft datasets accelerate bitsandbytes · pip install vllm · pip install llmcompressor lm-eval ragas · git clone https://github.com/ggerganov/llama.cpp
  • Accounts/access: Hugging Face token (gated bases), a vector DB (local Qdrant via Docker is fine), and a Kynetra Studio workspace.
  • Kynetra Studio is the full-lifecycle home (aistudio.kynetra.dev): Data → Train → Align → Evaluate → Serve → Govern. Every stage below names the Studio surface that wraps the OSS step.

The open-source stack

LayerOSS toolingFoundry components
Data & dedupHF Datasets · text-dedup (MinHash/LSH) · distilabel (synthetic)Glyphsieve · Corpusmith · Strataweave
TokenizerHF Tokenizers · SentencePiece (reuse base tokenizer)Latticescript
Fine-tune (PEFT)Unsloth (1-GPU speed) · Axolotl (YAML/multi-GPU) · TRL+PEFT · TorchTune (FSDP2) · LLaMA-FactoryRankWeave · NibbleGraft · MagnitudeForge
AlignmentTRL — DPOTrainer / ORPO / KTO / PPOPreferLoom · CreedForge · Concordance Lattice
Merge / distillmergekit (TIES/DARE/soup) · TRL distillationMergespire · Fluxbroth · Distilflux
Quantizellm-compressor (AWQ, replaces AutoAWQ) · GPTQModel (replaces AutoGPTQ) · llama.cpp (GGUF Q4_K_M)Errorforge · Nanocrush · Latticeprune
Retrieval / RAGsentence-transformers / BGE embeddings · Qdrant · pgvector · LanceDB · cross-encoder rerankersMemvault Lattice · Echograph · Graftsieve
ServingvLLM (GPU, AWQ-Marlin, PagedAttention, continuous batching) · llama.cpp / Ollama (local/edge)Pagewright KV · Flowbatch Loom · Speccast Relay
Evaluationlm-evaluation-harness · RAGAS (RAG) · DeepEval (CI) · LLM-as-judgeProofgrid · Arbiter Lattice · Veracity Quotient
GovernanceLlama Guard / guardrails · FSDP2 (TorchTune) · DeepSpeed ZeRO · checkpoint orchestrationSentinel Weave · Wardgate · Shardbastion · Anchorpoint

The build — 10 stages

01

Curate

03 · Corpus & Data Engineering ↗

Build the dataset. The model IS the data — curation decides the ceiling.

  1. Gather raw domain docs + any seed instruction pairs into JSONL ({"messages":[...]} chat format).
  2. Near-dedup with MinHash/LSH so you do not over-train on copies.
  3. Synthesize & evolve instructions from a teacher model to expand coverage; filter for diversity and quality.
  4. Set a domain mixture (Strataweave) and an easy→hard ordering (Curriculord). Reuse the base tokenizer — do not retrain one unless you change languages.
python
from datasets import load_dataset
# 1) load your raw + seed instruction data
ds = load_dataset("json", data_files="data/kynetra_raw.jsonl")["train"]

# 2) near-dedup (MinHash) — text-dedup CLI
#    python -m text_dedup.minhash --path data/kynetra_raw.jsonl \
#       --column text --threshold 0.7 --output data/kynetra_dedup.jsonl

# 3) synthesize instructions with distilabel (Self-Instruct / Evol-Instruct)
#    then save the final SFT set as chat-formatted JSONL:
ds.to_json("data/kynetra_sft.jsonl")   # {"messages":[{"role":"user",...},{"role":"assistant",...}]}
Kynetra Studio

Studio › Data — upload sources, run dedup/synthesis, preview the mixture, version the dataset.

02

Pretrain

08 · Pretraining & Foundation Architecture ↗

Choose the foundation. For 99% of teams this means PICKING an open base, not training one.

  1. Pick a permissively-licensed base sized to your budget — 7–9B is the sweet spot for single-GPU adaptation.
  2. Prefer Apache-2.0/MIT (Qwen2.5, Mistral) for commercial/sovereign freedom.
  3. Only pretrain from scratch if you have cluster budget and a genuine architecture/data reason — otherwise adaptation wins on cost and time.
python
from huggingface_hub import snapshot_download
# Worked example base (Apache-2.0):
snapshot_download("Qwen/Qwen2.5-7B-Instruct", local_dir="bases/qwen2.5-7b")
# Alternatives: mistralai/Mistral-7B-Instruct-v0.3, meta-llama/Llama-3.1-8B-Instruct
Kynetra Studio

Studio › Base Models — browse, license-check, and pin the base your project builds on.

03

Adapt

01 · Adaptation & PEFT ↗

Specialize the frozen base cheaply — train ~0.1–1% of params with QLoRA.

OSSUnsloth (fastest, 1 GPU)Axolotl (YAML, multi-GPU)TRL SFTTrainer + PEFT
  1. Load the base in 4-bit (QLoRA) to fit a single GPU.
  2. Attach LoRA adapters to attention + MLP projections (r=16, alpha=32 is a solid default).
  3. SFT on your curated chat dataset; keep adapters small and swappable.
  4. Merge adapters into the base when you are happy (GraftFold) for a single deployable checkpoint.
python
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

model, tok = FastLanguageModel.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", max_seq_length=4096, load_in_4bit=True)   # NibbleGraft (QLoRA)
model = FastLanguageModel.get_peft_model(                                # RankWeave (LoRA)
    model, r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])

SFTTrainer(model=model, tokenizer=tok,
    args=SFTConfig(output_dir="cym-sft", per_device_train_batch_size=2,
                   gradient_accumulation_steps=8, learning_rate=2e-4, num_train_epochs=2),
    train_dataset=load_dataset("json", data_files="data/kynetra_sft.jsonl")["train"]).train()
Kynetra Studio

Studio › Fine-tune — pick base + dataset + recipe (QLoRA), launch the job, watch loss/throughput live.

04

Align

04 · Alignment & Preference ↗

Teach taste, not just tokens — make the model prefer good answers.

  1. Build preference pairs: {prompt, chosen, rejected} — from human ranks, an LLM judge, or best-of-N rejection sampling.
  2. Run DPO from your SFT checkpoint (DPO needs no separate reward model — simpler and stable).
  3. Use ORPO if you want to fold alignment into one SFT-style pass; KTO if your data is unpaired.
  4. Watch the preference margin AND KL from the reference — high margin + high KL means reward hacking.
python
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

pref = load_dataset("json", data_files="data/kynetra_pref.jsonl")["train"]  # prompt/chosen/rejected
DPOTrainer(model="cym-sft",
    args=DPOConfig(beta=0.1, output_dir="cym-dpo", learning_rate=5e-6, num_train_epochs=1),
    train_dataset=pref).train()
Kynetra Studio

Studio › Align — import preference pairs, choose DPO/ORPO/KTO, launch, track margin + KL.

05

Distill

05 · Distillation & Merging ↗

Optional — fuse skills or pour a big model into a small one.

OSSmergekit (TIES/DARE/soup)TRL distillation
  1. Merge multiple domain adapters/checkpoints into one with TIES (sign-aware) when you need several skills in one model.
  2. Average several runs of the SAME base (soup) for a flatter, more general checkpoint.
  3. Distill a larger teacher into your 7B when you want its behavior at lower serving cost.
  4. Skip this stage entirely if a single adapter already meets quality.
yaml
# mergekit TIES config (merge two domain checkpoints onto the base)
models:
  - model: cym-dpo
    parameters: { weight: 0.6, density: 0.7 }
  - model: cym-ops-adapter
    parameters: { weight: 0.4, density: 0.7 }
merge_method: ties
base_model: Qwen/Qwen2.5-7B-Instruct
dtype: bfloat16
# run:  mergekit-yaml merge.yml cym-merged
Kynetra Studio

Studio › Compose — merge/soup checkpoints or run a distillation job from a teacher endpoint.

06

Compress

02 · Quantization & Compression ↗

Shrink for serving — 4-bit weights, a fraction of the memory, near-equal quality.

OSSllm-compressor → AWQ (for vLLM; replaces deprecated AutoAWQ)GPTQModel (replaces archived AutoGPTQ)llama.cpp → GGUF Q4_K_M (local/edge)
  1. For GPU serving: produce an AWQ build — vLLM runs it on the fast Marlin kernel.
  2. For laptops/edge: convert to GGUF and quantize to Q4_K_M for llama.cpp / Ollama.
  3. Calibrate on a small in-domain sample so quantization error lands where it matters least.
bash
# A) AWQ for vLLM (llm-compressor)
python -c "from llmcompressor import oneshot; \
  oneshot(model='cym-merged', recipe='awq', output_dir='cym-awq')"

# B) GGUF Q4_K_M for local (llama.cpp)
python llama.cpp/convert_hf_to_gguf.py cym-merged --outfile cym.gguf
./llama.cpp/llama-quantize cym.gguf cym-Q4_K_M.gguf Q4_K_M
Kynetra Studio

Studio › Export — one-click AWQ (server) or GGUF (edge) artifacts, with a quick perplexity check.

07

Augment

06 · Retrieval, Memory & Long Context ↗

Give the model the knowledge it was never trained on — ground answers in your data.

OSSsentence-transformers / BGE embeddingsQdrant · pgvector · LanceDBcross-encoder reranker
  1. Chunk and embed your knowledge base; store vectors in a DB.
  2. At query time: retrieve top-k by vector similarity, rerank with a cross-encoder, then prepend the survivors to the prompt.
  3. Retrieval recall is the ceiling — invest in chunking and reranking before bigger models.
  4. Extend context (YaRN/RoPE) only with a short long-context fine-tune; raw interpolation degrades quality.
python
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

emb = SentenceTransformer("BAAI/bge-base-en-v1.5")          # Echograph
qc = QdrantClient(":memory:")                               # Memvault store
# index chunks -> qc.upsert(...); at query time:
hits = qc.search("kb", emb.encode("how do I reset billing?").tolist(), limit=20)
# rerank top-20 with a cross-encoder (Graftsieve), keep top-4, prepend to prompt
Kynetra Studio

Studio › Knowledge — connect sources, build the index, tune retrieval/rerank, attach to the model.

08

Serve

07 · Inference & Serving Architecture ↗

Tokens per dollar — expose the model behind a fast, OpenAI-compatible API.

  1. Serve the AWQ build on vLLM — you get PagedAttention + continuous batching for free.
  2. Add a small draft model for speculative decoding if latency matters and outputs are low-temperature.
  3. For local/offline, run the GGUF build via Ollama or llama.cpp server.
bash
# GPU production (OpenAI-compatible at :8000/v1)
vllm serve cym-awq --quantization awq_marlin --max-model-len 8192 --port 8000

# Local / edge
ollama create cym -f Modelfile   # FROM ./cym-Q4_K_M.gguf
ollama run cym
Kynetra Studio

Studio › Deploy — promote an artifact to a managed endpoint (scale, keys, rate limits, logs).

09

Evaluate

09 · Evaluation, Observability & Governance ↗

If you can't measure it, you can't ship it. Score before and after every change.

OSSlm-evaluation-harness (academic benchmarks)RAGAS (RAG faithfulness)DeepEval (CI gates) + LLM-as-judge
  1. Run standardized benchmarks against the served endpoint for a comparable baseline.
  2. Use an LLM judge with rubric + position-swap for task-specific quality; RAGAS for retrieval faithfulness.
  3. Wire DeepEval into CI so regressions block releases; monitor production drift continuously.
bash
# Benchmark the live model via vLLM backend
lm_eval --model local-completions \
  --model_args base_url=http://localhost:8000/v1,model=cym-awq \
  --tasks mmlu,gsm8k,ifeval --batch_size auto

# RAG faithfulness / answer-relevancy with RAGAS, judge quality with DeepEval in CI.
Kynetra Studio

Studio › Evaluate — benchmark suites, judge rubrics, RAG metrics, and a drift dashboard on the live endpoint.

10

Govern

10 · Safety, Sovereignty & Orchestration ↗

Train on your terms, behind your walls — guardrails, sharding, sovereign deploy.

OSSLlama Guard / guardrail classifiersFSDP2 (TorchTune) · DeepSpeed ZeROresumable checkpoint orchestration
  1. Wrap the endpoint with input/output guardrails and jailbreak/prompt-injection defenses.
  2. For big training, shard with FSDP/ZeRO-3 across GPUs; checkpoint with RNG + data-position state so runs resume exactly.
  3. For regulated/sovereign work, run the whole pipeline air-gapped against a mirrored artifact registry.
bash
# Guardrail in front of the endpoint (Llama Guard pattern): classify -> allow/deny -> model
# Sharded training when you outgrow one GPU (TorchTune FSDP2):
tune run --nproc_per_node 8 full_finetune_distributed --config qwen2_5_7B_full
# Anchorpoint: enable resumable checkpoints (weights + optimizer + RNG + dataloader position).
Kynetra Studio

Studio › Govern — policies & guardrails, sovereign/air-gapped deploy targets, lineage and audit trail.

Recipe card · Kynetra-CYM-7B

base      Qwen2.5-7B-Instruct (Apache-2.0)
curate    HF datasets + text-dedup (MinHash) + distilabel synth
adapt     QLoRA r=16 a=32   (Unsloth + TRL SFTTrainer)
align     DPO beta=0.1      (TRL DPOTrainer)
compress  AWQ (vLLM)   +   GGUF Q4_K_M (local)
augment   BGE embeddings + Qdrant + cross-encoder rerank
serve     vLLM  (OpenAI-compatible :8000/v1)
evaluate  lm-eval-harness + RAGAS + DeepEval (CI)
govern    guardrails + FSDP2 + resumable checkpoints

Sources