PR #1971

open

Record: SP10240 SimCTG + 3-Layer Recurrence — 1.07502 sliding-window (3-seed)

by BharathSShankarView on GitHub

val_bpb

1.0750

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence where the encoder loops layers 3-5.

parameters: {"layers":3}

Parallel Residuals

Parallel residual connections applied from layer 7 onward.

parameters: {"start_layer":7}

LeakyReLU

Uses LeakyReLU(0.5)^2 as the MLP activation.

parameters: {"slope":0.5}

Partial RoPE

Partial rotary positional embedding applied with a 16/64 setting.

parameters: {"dimensions":16,"base_dimensions":64}

XSA

XSA attention used on all 11 layers.

parameters: {"layers":11}

weight tying

Input and output embeddings are tied.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"iterations":5,"scope":"matrix params"}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embed/scalar"}

Quantization

GPTQ

bits: null

scope: attention/MLP matrices and embeddings

Compression

lzma

level: null

brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: null

Regularization

SimCTG

parameters: {"lambda":0.3,"margin":0.4}

Novel Contributions

SimCTG contrastive regularizer with lambda=0.3 and margin=0.4 added to the CE objective
3-seed validation showing reproducible improvement over the unregularized N9 lineage
Sliding-window baseline submission with no test-time training