PR #1440

open

[Submission] EngramLite + Mousse + Progressive Depth Recurrence + TTT — val_bpb 1.1026 | 15.95MB | 8×H100

by MertyandimataView on GitHub

val_bpb

1.1026

Architecture

Transformer

Optimizer

Mousse

Artifact Size

15.95MB

Training Techniques

Architecture

BigramHash

Replaced legacy BigramHash with EngramLite multi-head gated bigram+trigram hashing.

parameters: {"buckets":3072,"heads":2}

TrigramHash

Added trigram hashing as part of the EngramLite multi-order n-gram hash.

parameters: {"buckets":3072,"heads":2}

depth recurrence

Repeated selected layers to create effective deeper recurrence.

parameters: {"layers":[4,5],"effective_layers":13}

U-Net skip connections

Learned skip-gated U-Net style connections were used in the architecture.

parameters: null

XSA

Applied value-orthogonal projection across all layers.

parameters: {"layers":11}

Partial RoPE

Used rotary position embeddings on only part of the head dimensions.

parameters: {"dimensions":16}

LeakyReLU

Used LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

weight tying

Tied embeddings were used.

parameters: null

KV head count

Used grouped key/value heads in the transformer.

parameters: {"attention_heads":8,"kv_heads":4}

Optimizer

Mousse

weight_decay: 0.09

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"scalar_lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

EMA

parameters: {"decay":0.995,"start_step":892}

Quantization

late QAT

bits: 6

scope: all

GPTQ

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs":3,"learning_rate":0.01,"reset_per_chunk":0}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"muon_embed":0.09,"adam":0.02}

logit softcap

parameters: {"value":30}

Novel Contributions

EngramLite multi-head gated bigram+trigram hash
Mousse optimizer with diagonal curvature-aware Muon preconditioning
Progressive Depth Recurrence with phased activation
Score-first full-weight TTT outperforming LoRA TTT on this architecture
Auto-QMax artifact packing
Adaptive Markov curriculum from the previous Raki v5 approach