PR #433

open

EBLS Learned Sharing (10min/16MB)

by Robby955View on GitHub
val_bpb
1.3441
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,224,826 bytes

Training Techniques

Architecture
weight tying
Empirical Bayes Layer Sharing with 3 shared transformer blocks applied 3 times to create 9 effective virtual layers, with per-virtual-layer LoRA deviations gated by learned shrinkage factors.
parameters: {"shared_blocks":3,"virtual_layers":9,"lora_rank":8}
SmearGate
Custom gating mechanism included as part of the architecture.
parameters: null
BigramHash
Bigram hashing feature with a 10240-sized hash space.
parameters: {"size":10240}
MLP3x
Uses a 3x expansion MLP with ReLU² activation.
parameters: {"expansion":3}
KV head count
Grouped-query attention with 16 query heads and 4 key/value heads.
parameters: {"q_heads":16,"kv_heads":4}
U-Net skip connections
Adds U-Net style skip connections to the transformer blocks.
parameters: null
Quantization
STE QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adam_used_for":"LoRA, embeddings, scalars"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"LoRA, embeddings, scalars"}
Weight Averaging
SWA
parameters: null
Compression
zstd
level: 22
Other
other
Empirical Bayes Layer Sharing with learned shrinkage factors gamma_i to automatically determine how much each virtual layer deviates from shared weights.
parameters: {"shrinkage_gated_lora_rank":8}

Novel Contributions

  • Empirical Bayes Layer Sharing (EBLS) with learned shrinkage factors for automatic layer sharing
  • 3 shared transformer blocks reused as 9 effective virtual layers
  • Per-virtual-layer rank-8 LoRA deviations gated by learned gamma shrinkage
  • Evidence that MLP layers can be fully shared while attention specializes only minimally in early layers
  • Combination of SmearGate, BigramHash, and U-Net skip connections in a compact transformer
  • Int6 STE QAT with zstd-22 compression to fit the 16MB budget