PR #211

open

Add WaveletWeightedWidenet submission directory with README and metadata

by dubthecatView on GitHub
val_bpb
1.1719
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,367,830 bytes (~14.7MB)

Training Techniques

Architecture
tied embeddings
Uses tied FP16 token embeddings.
parameters: null
KV head count
12-layer transformer with 8 attention heads and 4 KV heads (GQA).
parameters: {"layers":12,"heads":8,"kv_heads":4,"dim":512}
encoder-decoder skip connections
U-Net style skip connections between first 6 encoder layers and last 6 decoder layers.
parameters: {"layers":12}
phase-transition residual mixing
Uses sigmoid-scheduled residual mixing per layer.
parameters: null
logit softcap
Applies logit softcap at 30.0.
parameters: {"softcap":30}
Quantization
ternary VQ
bits: 1
scope: MLP
int8
bits: 8
scope: attention
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"warmup_momentum":0.85,"warmup_steps":500}
Adam
weight_decay: 0.01
momentum: null
other_params: {"parameter_groups":["token embedding","ternary MLP weights","scalar/control params"],"learning_rates":{"token_embedding":0.6,"ternary_mlp_weights":0.02,"scalar_control_params":0.04}}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"targets":["Q","V","LM head"],"chunk_size":256,"batch_size":64}
Initialization
spectral init
Tied FP16 embeddings use overtone spectral initialization.
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":2500}
Regularization
weight decay
parameters: {"value":0.01,"applied_to":"ternary weights"}
Other
other
Straight-Through Estimator used for ternary weight quantization during training.
parameters: null
other
relu² activation between ternary linear layers.
parameters: null

Novel Contributions

  • 12-layer transformer with wider MLPs and U-Net style encoder-decoder skip connections
  • Ternary MLP weights compressed with vector quantization to about 1 bit per parameter
  • Hybrid compression scheme combining ternary VQ for MLPs and int8 for attention layers
  • Four-way optimizer split across embeddings, attention, ternary MLP weights, and scalar/control parameters
  • Sliding-window evaluation and optional TTT LoRA adaptation