PR #1458

open

Midnight 12L — 1.10567949 val_bpb (seed 444)

by newjordanView on GitHub
val_bpb
1.1057
Architecture
Transformer
Optimizer
Artifact Size
15631603 B

Training Techniques

Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
BigramHash
Bigram-2048 context features.
parameters: {"dimensions":2048}
RoPE
RoPE with 16 dimensions.
parameters: {"dimensions":16}
XSA
XSA applied on the last 11 layers.
parameters: {"layers":11}
Quantization
mixed int5/int6
bits: null
scope: attn=int5, mlp=int6, aux=int6, embed=int8, other=int8
Compression
Brotli
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • 12-layer Rascal II decoder submission
  • Added a 12th layer while staying under the 16MB artifact cap
  • Used mixed-int quantization across attention, MLP, auxiliary, embedding, and other weights
  • Applied Brotli-compressed mixed checkpoint artifacts
  • Combined GQA, BigramHash, RoPE-16, and XSA on the last 11 layers