PR #237

open

Add 10L 4K long-context negative-result submission

by takoyakisoftView on GitHub
val_bpb
1.8389
Architecture
Transformer
Optimizer
Artifact Size
13361078 bytes

Training Techniques

Architecture
tied embeddings
Uses tied embeddings as part of the model configuration.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
QAT
bits: null
scope: all
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Other
other
Selective FP16 passthrough for a few sensitive tensors during training.
parameters: null

Novel Contributions

  • Negative-result submission for the 10-minute, 16MB track
  • 10-layer, 4K-context training run
  • Overlapping sliding-window evaluation
  • Rank-8 LoRA test-time training
  • QAT-style fake quantization during training
  • Selective FP16 passthrough for sensitive tensors
  • Documentation of coverage collapse under the 10-minute budget