PR #460
openfeat: Add non-record dense 2048 sliding-window ablation submission
by abhishekrajdharView on GitHub
val_bpb
1.2928
Architecture
Transformer
Optimizer
Muon
Artifact Size
13039699 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings in the dense transformer.
parameters: null
grouped-query attention
Uses grouped-query attention in the transformer blocks.
parameters: null
residual mixing
Applies residual mixing in the model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"decoupled_weight_decay":true}
Initialization
spectral init
Spectral tied-embedding initialization.
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Test-Time Training
LoRA TTT
parameters: {"rank":0}
LR Schedule
warmdown
parameters: {"warmdown_iters":3200}
Regularization
weight decay
parameters: {"decoupled":true}
Novel Contributions
- Dense 10-layer transformer branch using the provided SP-1024 tokenizer
- 2048-token train/eval context with sliding-window evaluation at stride 64
- Structured ablation loop to identify effective vs. regressing ideas
- Post-training int8 quantization plus zlib compression under the 16MB cap
- Disabled-by-default LoRA TTT code path after instability/regression
- Documentation of negative-result ablations including 4096-token context, lower matrix LR, longer warmdown, and recurrent/shared-depth variants