PR #2093

open

Non-record: ACN Output Accumulator on the Naive Baseline

by chrico-bu-uabView on GitHub

val_bpb

1.2265

Architecture

Transformer

Optimizer

—

Artifact Size

15,875,628 bytes

Training Techniques

Architecture

ACN output accumulator

Adds each transformer block hidden state to the final pre-final_norm representation using learnable per-layer scalar output scales initialized to zero and gated by ACN_OUTPUT=1.

parameters: {"layers":9}

Regularization

weight decay

parameters: null

Compression

custom

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Other

other

Uses tied embeddings in the baseline model.

parameters: null

other

Uses grouped query attention with 8 query heads and 4 key/value heads.

parameters: {"query_heads":8,"kv_heads":4}

LR Schedule

warmdown

parameters: null

Novel Contributions

ACN output accumulator added to the official baseline
Per-layer learnable output scales initialized to zero
ACN enabled via ACN_OUTPUT=1 as a strict superset of the baseline
Single-seed idea probe showing a small per-step improvement but no wallclock gain due to implementation overhead