PR #904

open

Non-record: Diffusion-Noised Teacher AR Hybrid (val_bpb=1.2734, 8xH100)

by anthony-maioView on GitHub
val_bpb
1.2734
Architecture
Transformer
Optimizer
Artifact Size
15.8MB

Training Techniques

Other
other
Diffusion-inspired denoising auxiliary loss during training by corrupting input prefix tokens and interpolating clean and noisy autoregressive losses.
parameters: {"clean_noisy_loss_mix":0.35,"noise_ratio_start":0.05,"noise_ratio_end":0.35,"random_replace_prob":0.15,"mask_token_id":2}
Architecture
weight tying
Tied input embeddings and output embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":4,"kv_heads":2}
Sequence Length
sequence_length
train_length: 512
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Diffusion-inspired auxiliary denoising loss added to standard autoregressive training
  • Corrupts prefix tokens with masking and random replacement while preserving standard validation
  • Noise ratio ramps from 5% to 35% over training
  • Interpolates clean and noisy losses with a fixed auxiliary weight
  • Demonstrates a portable smoke-test implementation without changing tokenizer, dataset, or evaluation metric