PR #1456
openNon-Record: HybridQuantGPT v6.1 H100 + Aggressive SLOT (steps=100, 3-seed 1.146523)
by sisegodView on GitHub
val_bpb
1.1465
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.13 MB
Training Techniques
Quantization
mixed int6/int5/int4/fp16
bits: null
scope: Q/K 6-bit, V/O 5-bit, MLP up ~2.3-bit, MLP down 4-bit, embeddings fp16
Architecture
U-Net skip connections
Encoder-decoder Transformer with learned skip connections
parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}
XSA
Cross-Self Attention removing self-value projection from attention output
parameters: null
Value Residual
First-layer value propagated to later layers via learned lambda
parameters: null
SmearGate
Blends each token with the previous token via a learned gate
parameters: null
BigramHash
Hash-based bigram embedding
parameters: {"vocab":2048,"dim":128}
VE128
Token identity re-injection at later layers
parameters: {"layers":[9,10]}
Partial RoPE
Rotary positional encoding applied to only part of head dimensions
parameters: {"rope_dims":16,"head_dims":64}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: {"negative_slope":0.5}
Regularization
LN scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}
logit softcap
parameters: {"value":15}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"warmup_from":0.85,"warmup_steps":500}
Weight Averaging
SWA
parameters: {"snapshots":7,"start_step":9700,"end_step":10000,"interval":50}
EMA
parameters: {"decay":0.997,"type":"HMA"}
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_ratio":0.175}
Novel Contributions
- Increased SLOT optimization steps from 20 to 100 via a one-line default change
- Showed that SLOT performance improves monotonically up to 100 steps under full stride-64 evaluation
- Re-evaluated prior diminishing-returns conclusions using full evaluation instead of stride-256 quick eval
- Verified the improvement across three seeds and multiple slot step counts
- Reused the exact same training artifacts and rANS checkpoint, changing only the evaluation recipe