PR #545
closedRecord: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)
by EthanYangTWView on GitHub
val_bpb
1.1179
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.53 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all weights
QAT
bits: 5
scope: all weights
Architecture
XSA
XSA applied to all layers
parameters: {"layers":11}
BigramHash
BigramHash token feature/module
parameters: {"dimensions":8192}
Partial RoPE
Partial rotary positional embeddings
parameters: {"train_length":null,"eval_length":null}
KV head count
8 attention heads / 8 KV heads
parameters: {"heads":8,"kv_heads":8}
MLP3.5x
Expanded MLP width to 3.5x hidden size
parameters: {"hidden_size":512,"mlp_size":1792}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0001}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"chunk_tokens":131072,"freeze_blocks":2,"optimizer":"AdamW"}
Initialization
OrthoInit
Used for model initialization
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Novel Contributions
- First submission to achieve int5 quantization on a 33.6M model within the artifact size limit
- GPTQ error compensation enabling near-lossless int5 quantization
- Legal score-first test-time training where tokens are scored before any gradient update
- 33.6M parameter architecture with full attention and BigramHash under the 16MB limit