PR #1038

open

Add 1.20 BPB submission with Legal TTT and Calibration (9L/448D)

by Vibes-meView on GitHub
val_bpb
1.2058
Architecture
Transformer
Optimizer
Artifact Size
15.87MB

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Model configuration uses 9 layers, 512 model dimension, 8 attention heads, 4 KV heads, and MLP multiplier 2.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null

Novel Contributions

  • Legal test-time training (TTT)
  • Calibration
  • Weight tying
  • Long-context training with sequence length 2048
  • Int8 quantized roundtrip submission