PR #682
open[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650)
by gthgomezView on GitHub
val_bpb
1.1233
Architecture
Transformer
Optimizer
Muon
Artifact Size
6.7 MB
Training Techniques
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: large 2-D tensors / model weights
Architecture
Partial RoPE
Rotary positional embedding applied only to the first subset of head dimensions, with the remaining dimensions passed through unchanged.
parameters: {"dimensions":16}
MLP3x
Float-supported MLP multiplier enabling 3.0x hidden expansion.
parameters: {"mlp_mult":3}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"token-embedding and scalar parameters"}
Other
other
GTX 1650 compatibility patches including NO_COMPILE, math SDP fallback, and MAX_VAL_SEQS cap.
parameters: {"no_compile":true,"max_val_seqs":256}
Compression
zlib
level: 9
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 64
Novel Contributions
- GTX 1650 compatibility patches for running the pipeline on constrained hardware
- EMA implementation with competition-scale and locally calibrated decay settings
- Int6 clip-search quantizer with per-row percentile search and zlib-compressed export comparison
- Partial RoPE applied to only the first 16 of 64 head dimensions
- Layerwise LN scaling by 1/sqrt(layer_idx+1)
- Muon decoupled weight decay plus AdamW for scalar/token optimizers
- Float-supported MLP multiplier enabling MLP_MULT=3.0
- Local ablation pipeline documenting export size and bpb tradeoffs