PR #1438

open

Non-record: Mixed-Int6 LZMA9 B3072 Warm5000

by sabdulmajidView on GitHub
val_bpb
1.2029
Architecture
Transformer
Optimizer
Artifact Size
15,991,188 bytes

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
BigramHash
Bigram hash embedding stack
parameters: {"vocab_size":3072,"dimensions":128}
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"slope":0.5}
Quantization
mixed int6
bits: 6
scope: mlp;attn;embed
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"seed":42}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":5000}
Other
other
Long single-GPU unlimited-compute training run
parameters: {"steps":16000,"training_time_seconds":57979.039}

Novel Contributions

  • Non-record unlimited-compute 16MB submission
  • Longer single-GPU training of the EMA + XSA(last-4) + BigramHash3072 + LeakyReLU^2 flat-transformer stack
  • Broad mixed-int6 export over mlp;attn;embed
  • LZMA9 extreme compression to fit the 16MB artifact cap
  • Preserved raw checkpoint and re-exported it independently for evaluation