PR #330

open

Non-record: 11L Int6 + Online Logit Bias (val_bpb=1.1609)

by bopmiteView on GitHub
val_bpb
1.1609
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,977,633 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all weights per-row
Architecture
MLP3x
3x MLP with 1536 hidden size
parameters: {"hidden_size":1536}
GQA
Grouped-query attention with 8/4 heads
parameters: {"query_heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied
parameters: null
SmearGate
Custom gating mechanism used in the model
parameters: null
BigramHash
Bigram hash feature module
parameters: {"size":"2048x128"}
RoPE
Rotary positional embeddings with NTK scaling
parameters: {"sequence_length":2048}
Partial RoPE
Applies RoPE to only part of the dimensions
parameters: {"dimensions":"16/64"}
XSA
XSA applied on the last 4 layers
parameters: {"layers":4}
Initialization
OrthoInit
Orthogonal initialization combined with muP
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN Scale
parameters: null
weight decay
parameters: {"value":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
online logit bias
parameters: {"learning_rate":0.1,"enabled":false}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zstd
level: null

Novel Contributions

  • Online logit bias (OLB) evaluation technique that updates a per-token bias vector during sliding-window evaluation using the exact cross-entropy gradient
  • Int6 per-row quantized model with zstd compression
  • Sliding-window evaluation with stride 64
  • Custom 11-layer architecture with SmearGate, BigramHash, XSA, partial RoPE, and tied embeddings