PR #228

open

Record: 10-Layer 4xMLP (val_bpb: 1.4444)

val_bpb
1.4444
Architecture
Transformer
Optimizer
Artifact Size
14.68 MB

Training Techniques

Architecture
10-layer 4xMLP
Expanded the standard 9-layer architecture to 10 layers and increased the MLP multiplier from 2x to 4x.
parameters: {"layers":10,"mlp_multiplier":4}
Quantization
int8
bits: 8
scope: all weights
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"overlapping":true}
Test-Time Training
LoRA TTT
parameters: {"batched":true}

Novel Contributions

  • Expanded the baseline architecture from 9 layers to 10 layers
  • Increased the MLP multiplier from 2x to 4x
  • Used standard INT8 per-row post-training quantization
  • Applied zlib compression to fit within the 16MB limit
  • Evaluated with an overlapping sliding window
  • Used batched LoRA test-time training