val_bpb
1.0988
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,900,191 bytes
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
SmearGate
Added gating mechanism in the PR #414 stack
parameters: null
BigramHash
Hash-based bigram feature component with 2048 buckets
parameters: {"buckets":2048}
XSA
Applied XSA in the last 4 layers
parameters: {"layers":4}
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"chunk_tokens":32768,"learning_rate":0.002}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"base_lr":0.0005,"per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5,"others":1}}
LR Schedule
cosine decay
parameters: {"epochs":30}
Regularization
gradient clipping
parameters: {"clip_norm":1}
Novel Contributions
- 30-epoch cosine pre-eval test-time training on the PR #414 consensus stack
- Legal score-first TTT protocol that scores each validation chunk before training on it
- Per-layer learning-rate grouping during TTT
- Sliding-window evaluation with stride 64 after TTT
- Use of GPTQ-lite int6 quantization with zstd-22 compression