PR #682

open

[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650)

by gthgomezView on GitHub

val_bpb

1.1233

Architecture

Transformer

Optimizer

Muon

Artifact Size

6.7 MB

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: large 2-D tensors / model weights

Architecture

Partial RoPE

Rotary positional embedding applied only to the first subset of head dimensions, with the remaining dimensions passed through unchanged.

parameters: {"dimensions":16}

MLP3x

Float-supported MLP multiplier enabling 3.0x hidden expansion.

parameters: {"mlp_mult":3}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"token-embedding and scalar parameters"}

Other

other

GTX 1650 compatibility patches including NO_COMPILE, math SDP fallback, and MAX_VAL_SEQS cap.

parameters: {"no_compile":true,"max_val_seqs":256}

Compression

zlib

level: 9

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 64

Novel Contributions

GTX 1650 compatibility patches for running the pipeline on constrained hardware
EMA implementation with competition-scale and locally calibrated decay settings
Int6 clip-search quantizer with per-row percentile search and zlib-compressed export comparison
Partial RoPE applied to only the first 16 of 64 head dimensions
Layerwise LN scaling by 1/sqrt(layer_idx+1)
Muon decoupled weight decay plus AdamW for scalar/token optimizers
Float-supported MLP multiplier enabling MLP_MULT=3.0
Local ablation pipeline documenting export size and bpb tradeoffs