PR #2093

open

Non-record: ACN Output Accumulator on the Naive Baseline

by chrico-bu-uabView on GitHub
val_bpb
1.2265
Architecture
Transformer
Optimizer
Artifact Size
15,875,628 bytes

Training Techniques

Architecture
ACN output accumulator
Adds each transformer block hidden state to the final pre-final_norm representation using learnable per-layer scalar output scales initialized to zero and gated by ACN_OUTPUT=1.
parameters: {"layers":9}
Regularization
weight decay
parameters: null
Compression
custom
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Uses tied embeddings in the baseline model.
parameters: null
other
Uses grouped query attention with 8 query heads and 4 key/value heads.
parameters: {"query_heads":8,"kv_heads":4}
LR Schedule
warmdown
parameters: null

Novel Contributions

  • ACN output accumulator added to the official baseline
  • Per-layer learnable output scales initialized to zero
  • ACN enabled via ACN_OUTPUT=1 as a strict superset of the baseline
  • Single-seed idea probe showing a small per-step improvement but no wallclock gain due to implementation overhead