val_bpb
1.1691
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.04 MB
Training Techniques
Architecture
Tensor-Train attention
TT/MPS decomposition applied to attention.c_q and attention.proj to compress square 512x512 attention matrices.
parameters: {"layers":13,"d_model":512,"rank":8,"mode_shape":[8,8,8]}
BigramHash
Doubled bigram vocabulary for token features.
parameters: {"vocab":8192}
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
MLP3x
Expanded MLP width to 3x hidden dimension.
parameters: {"multiplier":3}
U-Net skip connections
Encoder-decoder style skip connections with learnable skip weights.
parameters: {"encoder_layers":6,"decoder_layers":7}
SmearGate
Learnable temporal mixing on input embeddings.
parameters: null
XSA
Exclusive Self-Attention applied to the last 4 layers.
parameters: {"layers":4}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"negative_slope":0.5}
Partial RoPE
Rotary positional encoding applied only to part of the head dimensions.
parameters: {"numerator":16,"denominator":64}
Quantization
int6
bits: 6
scope: model artifact
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Newton-Schulz orthogonalization":true,"TT core reshape":true}
Weight Averaging
EMA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Initialization
OrthoInit
TT-SVD initialization from a freshly orthogonal-initialized weight matrix.
Novel Contributions
- Tensor-Train (TT/MPS) decomposition applied to attention layers for parameter compression.
- Reinvesting TT parameter savings into +2 transformer layers and a larger BigramHash vocabulary.
- TT-SVD initialization and Muon adaptation for 3D TT cores.
- State-dict export via ParameterList to avoid materialization overflow.
- Sliding window evaluation with stride 64 for improved validation BPB.