val_bpb
1.8389
Architecture
Transformer
Optimizer
—
Artifact Size
13361078 bytes
Training Techniques
Architecture
tied embeddings
Uses tied embeddings as part of the model configuration.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
QAT
bits: null
scope: all
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Other
other
Selective FP16 passthrough for a few sensitive tensors during training.
parameters: null
Novel Contributions
- Negative-result submission for the 10-minute, 16MB track
- 10-layer, 4K-context training run
- Overlapping sliding-window evaluation
- Rank-8 LoRA test-time training
- QAT-style fake quantization during training
- Selective FP16 passthrough for sensitive tensors
- Documentation of coverage collapse under the 10-minute budget