PR #691

open

PR #414 + 30-Epoch Cosine TTT (1.0988 BPB)

by xexyzView on GitHub

val_bpb

1.0988

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,900,191 bytes

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

SmearGate

Added gating mechanism in the PR #414 stack

parameters: null

BigramHash

Hash-based bigram feature component with 2048 buckets

parameters: {"buckets":2048}

XSA

Applied XSA in the last 4 layers

parameters: {"layers":4}

KV head count

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs":3,"chunk_tokens":32768,"learning_rate":0.002}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"base_lr":0.0005,"per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5,"others":1}}

LR Schedule

cosine decay

parameters: {"epochs":30}

Regularization

gradient clipping

parameters: {"clip_norm":1}

Novel Contributions

30-epoch cosine pre-eval test-time training on the PR #414 consensus stack
Legal score-first TTT protocol that scores each validation chunk before training on it
Per-layer learning-rate grouping during TTT
Sliding-window evaluation with stride 64 after TTT
Use of GPTQ-lite int6 quantization with zstd-22 compression