PR #1971
openRecord: SP10240 SimCTG + 3-Layer Recurrence — 1.07502 sliding-window (3-seed)
by BharathSShankarView on GitHub
val_bpb
1.0750
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence where the encoder loops layers 3-5.
parameters: {"layers":3}
Parallel Residuals
Parallel residual connections applied from layer 7 onward.
parameters: {"start_layer":7}
LeakyReLU
Uses LeakyReLU(0.5)^2 as the MLP activation.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embedding applied with a 16/64 setting.
parameters: {"dimensions":16,"base_dimensions":64}
XSA
XSA attention used on all 11 layers.
parameters: {"layers":11}
weight tying
Input and output embeddings are tied.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"iterations":5,"scope":"matrix params"}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embed/scalar"}
Quantization
GPTQ
bits: null
scope: attention/MLP matrices and embeddings
Compression
lzma
level: null
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: null
Regularization
SimCTG
parameters: {"lambda":0.3,"margin":0.4}
Novel Contributions
- SimCTG contrastive regularizer with lambda=0.3 and margin=0.4 added to the CE objective
- 3-seed validation showing reproducible improvement over the unregularized N9 lineage
- Sliding-window baseline submission with no test-time training