PR #937

open

[Non-Record Submission] CompressedUT CE + EMA Export + Export-Aligned Late QAT (1.4457 BPB)

val_bpb

1.4457

Architecture

Transformer

Optimizer

—

Artifact Size

14,707,311 bytes

Training Techniques

Architecture

BigramHash

Uses hashed bigram features in the byte-level compressed_ut model.

parameters: {"dimensions":96}

Partial RoPE

Uses partial rotary position encoding in the transformer backbone.

parameters: {"dimensions":32}

weight tying

Not explicitly stated in the PR body, but no evidence of it is present.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: 6

scope: exported artifact

Compression

zlib

level: 9

Sequence Length

sequence_length

train_length: 1536

eval_length: 32768

Evaluation

full validation eval

parameters: {"scope":"full FineWeb validation split"}

Other

other

Export-aligned quantization-aware training to match the quantizer used at artifact export.

parameters: {"threshold":0.05}