PR #976
openAdd 1.20 BPB submission with Legal TTT and Calibration (9L/448D)
by Vibes-meView on GitHub
val_bpb
1.2058
Architecture
Transformer
Optimizer
—
Artifact Size
15.87MB
Training Techniques
Architecture
weight tying
Input and output embeddings are tied.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Uses a 9-layer, 512-dimension Transformer with 8 attention heads and 4 KV heads, MLP multiplier 2.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
Quantization
int8
bits: 8
scope: model weights
parameters: null
Compression
zlib
level: null
Novel Contributions
- Legal TTT
- Calibration
- weight tying
- Long context training at sequence length 2048
- int8 + zlib roundtrip submission