PR #1226
openNon-record: 4090 single-GPU ablations on ValCalib GPTQ + XSA stack (partial logs)
by Wolfie8935View on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,965,978 bytes
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP and attention weights
Architecture
BigramHash
Hash-based bigram embedding over consecutive token pairs with learned projection to model dimension.
parameters: {"buckets":10240,"dim":128}
weight tying
Tied embeddings between input and output representations.
parameters: null
SmearGate
SmearGate component used in the model stack.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-times expansion MLP.
parameters: {"hidden":1536}
ReLU²
Squared ReLU activation in the MLP.
parameters: null
U-Net skip connections
Skip connections inspired by U-Net added to the transformer stack.
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"scope":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50,"checkpoints":24}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"sparsity":"3%"}
Compression
zstd
level: null
Novel Contributions
- Mixed int5/int6 quantization with int5 applied to MLP weights and int6 to attention weights
- BigramHash enlarged to 10240 buckets to reduce token-pair collisions
- SWA with start_frac=0.4 using only the most converged checkpoints
- 10-layer model enabled by savings from int5 MLP quantization
- Single-GPU 4090 ablation documentation for the ValCalib GPTQ + XSA stack