PR #1456

open

Non-Record: HybridQuantGPT v6.1 H100 + Aggressive SLOT (steps=100, 3-seed 1.146523)

by sisegodView on GitHub

val_bpb

1.1465

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.13 MB

Training Techniques

Quantization

mixed int6/int5/int4/fp16

bits: null

scope: Q/K 6-bit, V/O 5-bit, MLP up ~2.3-bit, MLP down 4-bit, embeddings fp16

Architecture

U-Net skip connections

Encoder-decoder Transformer with learned skip connections

parameters: {"layers":11,"dim":512,"heads":8,"kv_heads":4}

XSA

Cross-Self Attention removing self-value projection from attention output

parameters: null

Value Residual

First-layer value propagated to later layers via learned lambda

parameters: null

SmearGate

Blends each token with the previous token via a learned gate

parameters: null

BigramHash

Hash-based bigram embedding

parameters: {"vocab":2048,"dim":128}

VE128

Token identity re-injection at later layers

parameters: {"layers":[9,10]}

Partial RoPE

Rotary positional encoding applied to only part of head dimensions

parameters: {"rope_dims":16,"head_dims":64}

LeakyReLU

LeakyReLU squared activation in the MLP

parameters: {"negative_slope":0.5}

Regularization

LN scale

parameters: {"scale_rule":"1/sqrt(layer+1)"}

logit softcap

parameters: {"value":15}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"warmup_from":0.85,"warmup_steps":500}

Weight Averaging

SWA

parameters: {"snapshots":7,"start_step":9700,"end_step":10000,"interval":50}

EMA

parameters: {"decay":0.997,"type":"HMA"}

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":32}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_ratio":0.175}

Novel Contributions

Increased SLOT optimization steps from 20 to 100 via a one-line default change
Showed that SLOT performance improves monotonically up to 100 steps under full stride-64 evaluation
Re-evaluated prior diminishing-returns conclusions using full evaluation instead of stride-256 quick eval
Verified the improvement across three seeds and multiple slot step counts
Reused the exact same training artifacts and rANS checkpoint, changing only the evaluation recipe