PR #1937

open

Add LocalGlobal_SwiGLU_SeqPack_MixedQuant submission

val_bpb

1.2058

Architecture

Transformer

Optimizer

—

Artifact Size

15,538,222 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped-query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

attention modification

Depth-scheduled local/global attention pattern with local windows followed by full attention.

parameters: {"pattern":"40,80,full"}

SwiGLU

Replaced standard MLP blocks with SwiGLU feedforward blocks.

parameters: {"mlp_mult":1.625}

sequence packing

Randomized sequence packing with synchronized per-step offsets across DDP ranks.

parameters: null

Quantization

mixed int6/int8

bits: 6

scope: attn.proj.weight int6, elsewhere int8

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.75}

Other

other

SP-1536 tokenizer and dataset variant fineweb10B_sp1536.

parameters: {"vocab_size":1536}

other

Periodic validation every 1000 steps on the full validation split.

parameters: {"val_every_steps":1000}