PR #691

open

PR #414 + 30-Epoch Cosine TTT (1.0988 BPB)

val_bpb
1.0988
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,900,191 bytes

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
SmearGate
Added gating mechanism in the PR #414 stack
parameters: null
BigramHash
Hash-based bigram feature component with 2048 buckets
parameters: {"buckets":2048}
XSA
Applied XSA in the last 4 layers
parameters: {"layers":4}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"chunk_tokens":32768,"learning_rate":0.002}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"base_lr":0.0005,"per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5,"others":1}}
LR Schedule
cosine decay
parameters: {"epochs":30}
Regularization
gradient clipping
parameters: {"clip_norm":1}

Novel Contributions

  • 30-epoch cosine pre-eval test-time training on the PR #414 consensus stack
  • Legal score-first TTT protocol that scores each validation chunk before training on it
  • Per-layer learning-rate grouping during TTT
  • Sliding-window evaluation with stride 64 after TTT
  • Use of GPTQ-lite int6 quantization with zstd-22 compression