PR #1620

open

Submit Lim Shiaw Yong: 1.66 BPB 12MB Squeeze Architecture

by shiawyonglimView on GitHub
val_bpb
1.6644
Architecture
Transformer
Optimizer
Artifact Size
11.74 MB

Training Techniques

Architecture
depth recurrence
Physically instantiates 6 unique Transformer blocks and routes data through them in a palindrome loop to simulate 12 logical layers.
parameters: {"unique_blocks":6,"logical_layers":12}
parallel residuals
Computes attention and MLP branches simultaneously and injects them into the residual stream together to help gradient flow.
parameters: null
Test-Time Training
full TTT
parameters: {"micro_batching":true}
Quantization
QAT
bits: 6
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 65536
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}

Novel Contributions

  • Symmetrical modulo routing with palindrome depth recurrence
  • Parallel residual computation for improved gradient flow
  • TTT micro-batching during evaluation
  • 6-bit quantization-aware training for artifact size reduction