PR #1458

open

Midnight 12L — 1.10567949 val_bpb (seed 444)

val_bpb

1.1057

Architecture

Transformer

Optimizer

—

Artifact Size

15631603 B

Training Techniques

Architecture

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

BigramHash

Bigram-2048 context features.

parameters: {"dimensions":2048}

RoPE

RoPE with 16 dimensions.

parameters: {"dimensions":16}

XSA

XSA applied on the last 11 layers.

parameters: {"layers":11}

Quantization

mixed int5/int6

bits: null

scope: attn=int5, mlp=int6, aux=int6, embed=int8, other=int8

Compression

Brotli

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Evaluation

sliding window eval

parameters: null

12-layer Rascal II decoder submission
Added a 12th layer while staying under the 16MB artifact cap
Used mixed-int quantization across attention, MLP, auxiliary, embedding, and other weights
Applied Brotli-compressed mixed checkpoint artifacts
Combined GQA, BigramHash, RoPE-16, and XSA on the last 11 layers