PR #59

closed

NTK Eval + Overtone Init (val_bpb=1.2160)

by notapplicaView on GitHub
val_bpb
1.2160
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.80MB

Training Techniques

Initialization
spectral init
SVD-based overtone embedding initialization that reshapes tied embedding singular values to follow a power-law decay.
resid mix
Sigmoid-scheduled residual mixing initialization across layers, blending current hidden state with the initial embedding.
Evaluation
NTK-aware RoPE scaling
parameters: {"train_length":1024,"eval_length":2048}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
Optimizer
AdamW
weight_decay: 0.01
momentum: null
other_params: {"tied_embedding_lr":0.1}
LR Schedule
warmdown
parameters: {"warmdown_steps":2500}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Architecture
tied embeddings
Uses tied input/output embeddings with increased tied embedding learning rate.
parameters: null
RoPE
Dynamic NTK-aware rotary positional embedding scaling at evaluation time.
parameters: {"train_length":1024,"eval_length":2048}

Novel Contributions

  • SVD-based overtone embedding initialization with power-law spectral shaping
  • Sigmoid-scheduled phase-transition residual mixing across layers
  • NTK-aware RoPE scaling to evaluate at 2048 tokens after training at 1024
  • Increased AdamW weight decay and warmdown duration to reduce quantization gap
  • Higher tied embedding learning rate