PR #981

open

Non-record: Sliding Patch Attentions + MoE (2-layer compact run)

val_bpb

1.4893

Architecture

Transformer

Optimizer

—

Artifact Size

3938328 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped-query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

U-Net skip connections

Includes encoder/decoder-style skip connections in the experimental branch.

parameters: null

attention modifications

Experimental sliding-patch attention and router-path attention variants are present in the codebase.

parameters: null

MoE

Mixture-of-experts routing code paths are included, though the logged run reports moe_layers:0/2 so they were inactive in the measured submission.

parameters: {"moe_layers":0,"total_layers":2}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Compression

zlib

level: null