PR #2037

open

Non-record: Linear Combiner on Frozen Base — h' = (1+α)·h + M·T_1 + β

by organic-intelligence-1976View on GitHub
val_bpb
1.2670
Architecture
Transformer
Optimizer
AdamW
Artifact Size
14.13 MB

Training Techniques

Architecture
weight tying
Tied input embeddings and output embeddings in the base model.
parameters: null
RoPE
Uses RoPE with a reduced rotary dimension.
parameters: {"dimensions":16}
Gated Attention
Attention stack includes XSA / gated attention style modifications.
parameters: {"last_n":11}
KV head count
Grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth recurrence
Submission notes explicitly state the base is built without recurrence; no depth recurrence is used.
parameters: {"enabled":false}
Quantization
GPTQ
bits: 6
scope: base weights
GPTQ
bits: 8
scope: embeddings
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
chunked last-position eval
parameters: {"chunk_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"lr":0.001}
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"embed_wd":0.085,"muon_wd":0.085,"adam_wd":0.02}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.667}
Other
other
Frozen-base linear combiner trained on cached hidden states and one-step Coconut-style thinking states: h' = (1+α)·h + M·T₁ + β.
parameters: {"combiner_params":262657,"zero_init_identity":true}

Novel Contributions

  • A frozen-base linear combiner that mixes the last hidden state with a one-step Coconut-style thinking forward.
  • Bit-exact identity at zero initialization for the combiner.
  • Training the combiner via linear regression on cached features from a frozen, fully trained base.
  • Compact storage of the combiner alongside a GPTQ-quantized base within the 16 MB artifact limit.
  • Separate chunked last-position evaluation showing a small BPB improvement from the trained combiner.