PR #904

open

Non-record: Diffusion-Noised Teacher AR Hybrid (val_bpb=1.2734, 8xH100)

by anthony-maioView on GitHub

val_bpb

1.2734

Architecture

Transformer

Optimizer

—

Artifact Size

15.8MB

Training Techniques

Other

other

Diffusion-inspired denoising auxiliary loss during training by corrupting input prefix tokens and interpolating clean and noisy autoregressive losses.

parameters: {"clean_noisy_loss_mix":0.35,"noise_ratio_start":0.05,"noise_ratio_end":0.35,"random_replace_prob":0.15,"mask_token_id":2}

Architecture

weight tying

Tied input embeddings and output embeddings.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":4,"kv_heads":2}

Sequence Length

sequence_length

train_length: 512

eval_length: null

Compression

zlib

level: null

Novel Contributions

Diffusion-inspired auxiliary denoising loss added to standard autoregressive training
Corrupts prefix tokens with masking and random replacement while preserving standard validation
Noise ratio ramps from 5% to 35% over training
Interpolates clean and noisy losses with a fixed auxiliary weight
Demonstrates a portable smoke-test implementation without changing tokenizer, dataset, or evaluation metric