PR #1420

open

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08014 (5-seed mean)

by abaybektursunView on GitHub
val_bpb
1.0801
Architecture
Transformer
Optimizer
Muon
Artifact Size
19,811 bytes

Training Techniques

Architecture
depth recurrence
Added an extra loop pass through layers 4-5, increasing virtual depth from the prior configuration.
parameters: {"num_loops":3,"virtual_layers":17,"loop_layers":[4,5]}
parallel residuals
Used GPT-J style parallel attention and MLP residual branches for layers 7-10.
parameters: {"start_layer":7,"end_layer":10}
LeakyReLU
Used LeakyReLU(0.5) squared activation in the MLP.
parameters: {"negative_slope":0.5}
Quantization
GPTQ
bits: 8
scope: embeddings
mixed int5/int8
bits: null
scope: all
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.667,"final_lr":0}
Evaluation
sliding window eval
parameters: {"stride":64}
n-gram tilt
parameters: {"token_orders":[8,16],"within_word_orders":[1,3],"word_start_bigrams":true,"base_beta":2,"within_beta":0.92,"agree_bonus":0.1}
Compression
lzma
level: 9
Other
other
Fused MLP kernels using Triton TMA forward and CUTLASS EVT backward to improve throughput and fit more training steps in the time budget.
parameters: {"forward":"Triton TMA","backward":"CUTLASS EVT"}
other
Double-buffered async data prefetch with pinned memory and a separate CUDA stream.
parameters: null

Novel Contributions

  • Triple loop recurrence through layers 4-5 with earlier loop activation
  • Fused MLP kernels using Triton TMA forward and CUTLASS EVT backward
  • Parallel residuals for later layers
  • Eval-time causal n-gram tilt with one-token exponential boosting
  • Double-buffered async data prefetch
  • PyTorch/Inductor-related platform fixes to recover training throughput