PR #211

open

Add WaveletWeightedWidenet submission directory with README and metadata

by dubthecatView on GitHub

val_bpb

1.1719

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,367,830 bytes (~14.7MB)

Training Techniques

Architecture

tied embeddings

Uses tied FP16 token embeddings.

parameters: null

KV head count

12-layer transformer with 8 attention heads and 4 KV heads (GQA).

parameters: {"layers":12,"heads":8,"kv_heads":4,"dim":512}

encoder-decoder skip connections

U-Net style skip connections between first 6 encoder layers and last 6 decoder layers.

parameters: {"layers":12}

phase-transition residual mixing

Uses sigmoid-scheduled residual mixing per layer.

parameters: null

logit softcap

Applies logit softcap at 30.0.

parameters: {"softcap":30}

Quantization

ternary VQ

bits: 1

scope: MLP

int8

bits: 8

scope: attention

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"warmup_momentum":0.85,"warmup_steps":500}

Adam

weight_decay: 0.01

momentum: null

other_params: {"parameter_groups":["token embedding","ternary MLP weights","scalar/control params"],"learning_rates":{"token_embedding":0.6,"ternary_mlp_weights":0.02,"scalar_control_params":0.04}}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"targets":["Q","V","LM head"],"chunk_size":256,"batch_size":64}

Initialization

spectral init

Tied FP16 embeddings use overtone spectral initialization.

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iterations":2500}

Regularization

weight decay

parameters: {"value":0.01,"applied_to":"ternary weights"}

Other

other

Straight-Through Estimator used for ternary weight quantization during training.

parameters: null

other

relu² activation between ternary linear layers.

parameters: null

Novel Contributions

12-layer transformer with wider MLPs and U-Net style encoder-decoder skip connections
Ternary MLP weights compressed with vector quantization to about 1 bit per parameter
Hybrid compression scheme combining ternary VQ for MLPs and int8 for attention layers
Four-way optimizer split across embeddings, attention, ternary MLP weights, and scalar/control parameters
Sliding-window evaluation and optional TTT LoRA adaptation