val_bpb
1.2265
Architecture
Transformer
Optimizer
—
Artifact Size
15,875,628 bytes
Training Techniques
Architecture
ACN output accumulator
Adds each transformer block hidden state to the final pre-final_norm representation using learnable per-layer scalar output scales initialized to zero and gated by ACN_OUTPUT=1.
parameters: {"layers":9}
Regularization
weight decay
parameters: null
Compression
custom
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Uses tied embeddings in the baseline model.
parameters: null
other
Uses grouped query attention with 8 query heads and 4 key/value heads.
parameters: {"query_heads":8,"kv_heads":4}
LR Schedule
warmdown
parameters: null
Novel Contributions
- ACN output accumulator added to the official baseline
- Per-layer learnable output scales initialized to zero
- ACN enabled via ACN_OUTPUT=1 as a strict superset of the baseline
- Single-seed idea probe showing a small per-step improvement but no wallclock gain due to implementation overhead