PR #1221

open

Non-record: Oscillatory Recurrence at Layer 0 (1.1915 BPB, 3-seed)

by amabitoView on GitHub

val_bpb

1.1915

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.02 MB

Training Techniques

Architecture

TRN

Adds a Temporal Resonance Network / oscillatory recurrence block at layer 0, followed by standard Transformer layers.

parameters: {"layers":1,"oscillators":240}

weight tying

Tied input and output embeddings.

parameters: null

BigramHash

XOR hash of consecutive token pairs into a learned embedding table added to input embeddings.

parameters: {"buckets":2816,"dimensions":112}

Partial RoPE

Applies rotary position embedding to only part of the head dimensions.

parameters: {"dimensions":16}

XSA

Cross-Scale Attention subtracts the value-direction projection from attention output to reduce head redundancy.

parameters: {"layers":11}

LeakyReLU

Uses leaky_relu(x, 0.5).square() as the MLP activation.

parameters: {"negative_slope":0.5}

Weight Averaging

SWA

parameters: {"every":50}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1024}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"backend_steps":5,"lr":0.03}

Adam

weight_decay: null

momentum: null

other_params: {"beta1":0.9,"beta2":0.95,"lr":0.025}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"schedule":"cosine decay"}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Compression

zstd

level: 22

Quantization

mixed int4/int5

bits: null

scope: all

Novel Contributions

Oscillatory recurrence (TRN) inserted at layer 0 of an otherwise standard Transformer
3-seed result with mean val_bpb 1.1915 and low variance
Sliding window evaluation to improve reported BPB
Dynamic mixed int4/int5 quantization to fit within the artifact budget
Ablation showing TRN improves per-step convergence but loses on wall-clock due to overhead
Combination of BigramHash, partial RoPE, XSA, and SWA in a compact hybrid model