val_bpb
1.1057
Architecture
Transformer
Optimizer
—
Artifact Size
15631603 B
Training Techniques
Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
BigramHash
Bigram-2048 context features.
parameters: {"dimensions":2048}
RoPE
RoPE with 16 dimensions.
parameters: {"dimensions":16}
XSA
XSA applied on the last 11 layers.
parameters: {"layers":11}
Quantization
mixed int5/int6
bits: null
scope: attn=int5, mlp=int6, aux=int6, embed=int8, other=int8
Compression
Brotli
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: null
Novel Contributions
- 12-layer Rascal II decoder submission
- Added a 12th layer while staying under the 16MB artifact cap
- Used mixed-int quantization across attention, MLP, auxiliary, embedding, and other weights
- Applied Brotli-compressed mixed checkpoint artifacts
- Combined GQA, BigramHash, RoPE-16, and XSA on the last 11 layers