PR #1420

open

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08014 (5-seed mean)

by abaybektursunView on GitHub

val_bpb

1.0801

Architecture

Transformer

Optimizer

Muon

Artifact Size

19,811 bytes

Training Techniques

Architecture

depth recurrence

Added an extra loop pass through layers 4-5, increasing virtual depth from the prior configuration.

parameters: {"num_loops":3,"virtual_layers":17,"loop_layers":[4,5]}

parallel residuals

Used GPT-J style parallel attention and MLP residual branches for layers 7-10.

parameters: {"start_layer":7,"end_layer":10}

LeakyReLU

Used LeakyReLU(0.5) squared activation in the MLP.

parameters: {"negative_slope":0.5}

Quantization

GPTQ

bits: 8

scope: embeddings

mixed int5/int8

bits: null

scope: all

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.667,"final_lr":0}

Evaluation

sliding window eval

parameters: {"stride":64}

n-gram tilt

parameters: {"token_orders":[8,16],"within_word_orders":[1,3],"word_start_bigrams":true,"base_beta":2,"within_beta":0.92,"agree_bonus":0.1}

Compression

lzma

level: 9

Other

other

Fused MLP kernels using Triton TMA forward and CUTLASS EVT backward to improve throughput and fit more training steps in the time budget.

parameters: {"forward":"Triton TMA","backward":"CUTLASS EVT"}

other

Double-buffered async data prefetch with pinned memory and a separate CUDA stream.

parameters: null

Novel Contributions

Triple loop recurrence through layers 4-5 with earlier loop activation
Fused MLP kernels using Triton TMA forward and CUTLASS EVT backward
Parallel residuals for later layers
Eval-time causal n-gram tilt with one-token exponential boosting
Double-buffered async data prefetch
PyTorch/Inductor-related platform fixes to recover training throughput