PR #1813

open

Record: Scylla QK5.25 depth recurrence, val_bpb 0.94166

val_bpb

0.9417

Architecture

Transformer

Optimizer

—

Artifact Size

15,868,157 bytes

Training Techniques

Architecture

XSA

XSA active on all layers in the Scylla attention/eval path.

parameters: {"layers":11}

weight tying

Tied embeddings are used.

parameters: null

depth recurrence

Reuses layers 3-5 as virtual layers after part of training without increasing serialized parameter count.

parameters: {"layers":[3,4,5],"enable_after_training_frac":0.35}

BigramHash

Reduced bigram dimension to create artifact headroom while retaining quality gains.

parameters: {"dimensions":40}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

QK gain set to 5.25.

parameters: {"qk_gain_init":5.25}

other

Training uses 8xH100 SXM and stops by wallclock around 10 minutes.

parameters: {"gpus":8,"hardware":"H100 SXM","time_limit_seconds":600}