PR #460

open

feat: Add non-record dense 2048 sliding-window ablation submission

by abhishekrajdharView on GitHub
val_bpb
1.2928
Architecture
Transformer
Optimizer
Muon
Artifact Size
13039699 bytes

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings in the dense transformer.
parameters: null
grouped-query attention
Uses grouped-query attention in the transformer blocks.
parameters: null
residual mixing
Applies residual mixing in the model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"decoupled_weight_decay":true}
Initialization
spectral init
Spectral tied-embedding initialization.
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Test-Time Training
LoRA TTT
parameters: {"rank":0}
LR Schedule
warmdown
parameters: {"warmdown_iters":3200}
Regularization
weight decay
parameters: {"decoupled":true}

Novel Contributions

  • Dense 10-layer transformer branch using the provided SP-1024 tokenizer
  • 2048-token train/eval context with sliding-window evaluation at stride 64
  • Structured ablation loop to identify effective vs. regressing ideas
  • Post-training int8 quantization plus zlib compression under the 16MB cap
  • Disabled-by-default LoRA TTT code path after instability/regression
  • Documentation of negative-result ablations including 4096-token context, lower matrix LR, longer warmdown, and recurrent/shared-depth variants