PR #433

open

EBLS Learned Sharing (10min/16MB)

by Robby955View on GitHub

val_bpb

1.3441

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,224,826 bytes

Training Techniques

Architecture

weight tying

Empirical Bayes Layer Sharing with 3 shared transformer blocks applied 3 times to create 9 effective virtual layers, with per-virtual-layer LoRA deviations gated by learned shrinkage factors.

parameters: {"shared_blocks":3,"virtual_layers":9,"lora_rank":8}

SmearGate

Custom gating mechanism included as part of the architecture.

parameters: null

BigramHash

Bigram hashing feature with a 10240-sized hash space.

parameters: {"size":10240}

MLP3x

Uses a 3x expansion MLP with ReLU² activation.

parameters: {"expansion":3}

KV head count

Grouped-query attention with 16 query heads and 4 key/value heads.

parameters: {"q_heads":16,"kv_heads":4}

U-Net skip connections

Adds U-Net style skip connections to the transformer blocks.

parameters: null

Quantization

STE QAT

bits: 6

scope: all

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adam_used_for":"LoRA, embeddings, scalars"}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"LoRA, embeddings, scalars"}

Weight Averaging

SWA

parameters: null

Compression

zstd

level: 22

Other

other

Empirical Bayes Layer Sharing with learned shrinkage factors gamma_i to automatically determine how much each virtual layer deviates from shared weights.

parameters: {"shrinkage_gated_lora_rank":8}

Novel Contributions

Empirical Bayes Layer Sharing (EBLS) with learned shrinkage factors for automatic layer sharing
3 shared transformer blocks reused as 9 effective virtual layers
Per-virtual-layer rank-8 LoRA deviations gated by learned gamma shrinkage
Evidence that MLP layers can be fully shared while attention specializes only minimally in early layers
Combination of SmearGate, BigramHash, and U-Net skip connections in a compact transformer
Int6 STE QAT with zstd-22 compression to fit the 16MB budget