PR #1307

open

Add 07c1 strict RunPod base submission

by amrayachView on GitHub

val_bpb

1.1101

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.73 MB

Training Techniques

Architecture

weight tying

Tied embeddings are used in the model.

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

LeakyReLU

Leaky ReLU squared MLP activation is used.

parameters: {"slope":0.5}

BigramHash

Bigram vocabulary component used in the architecture.

parameters: {"vocab_size":5120}

VE128

Value embedding dimension setting used in the model.

parameters: {"dimensions":128}

Gated Attention

Windowed attention layers are used at selected depths.

parameters: {"layers":[2,4,6,8,10],"window_size":512}

Optimizer

Muon

weight_decay: null

momentum: 0.985

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

TTT

parameters: {"enabled":false}

Sequence Length

sequence_length

train_length: 2048

eval_length: 6144

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Quantization

int6

bits: 6

scope: per-row export

Compression

Brotli

level: null

Novel Contributions

Strict RunPod H100 SXM base proof for the 07c1 line
Faithful reproduction of PR #1212 with eval/export hygiene fixes
Base-path-only submission with TTT disabled
Four-seed strict proof under a 598-second wallclock budget
Per-row int6 export with Brotli compression
Sliding-window evaluation at stride 64