PR #1221
openNon-record: Oscillatory Recurrence at Layer 0 (1.1915 BPB, 3-seed)
by amabitoView on GitHub
val_bpb
1.1915
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.02 MB
Training Techniques
Architecture
TRN
Adds a Temporal Resonance Network / oscillatory recurrence block at layer 0, followed by standard Transformer layers.
parameters: {"layers":1,"oscillators":240}
weight tying
Tied input and output embeddings.
parameters: null
BigramHash
XOR hash of consecutive token pairs into a learned embedding table added to input embeddings.
parameters: {"buckets":2816,"dimensions":112}
Partial RoPE
Applies rotary position embedding to only part of the head dimensions.
parameters: {"dimensions":16}
XSA
Cross-Scale Attention subtracts the value-direction projection from attention output to reduce head redundancy.
parameters: {"layers":11}
LeakyReLU
Uses leaky_relu(x, 0.5).square() as the MLP activation.
parameters: {"negative_slope":0.5}
Weight Averaging
SWA
parameters: {"every":50}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"backend_steps":5,"lr":0.03}
Adam
weight_decay: null
momentum: null
other_params: {"beta1":0.9,"beta2":0.95,"lr":0.025}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"schedule":"cosine decay"}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Compression
zstd
level: 22
Quantization
mixed int4/int5
bits: null
scope: all
Novel Contributions
- Oscillatory recurrence (TRN) inserted at layer 0 of an otherwise standard Transformer
- 3-seed result with mean val_bpb 1.1915 and low variance
- Sliding window evaluation to improve reported BPB
- Dynamic mixed int4/int5 quantization to fit within the artifact budget
- Ablation showing TRN improves per-step convergence but loses on wall-clock due to overhead
- Combination of BigramHash, partial RoPE, XSA, and SWA in a compact hybrid model