PR #1547

closed

Non record: Single H100 10 min 1.24 BPB

by adityasasidharView on GitHub

val_bpb

1.1928

Architecture

Transformer

Optimizer

Adam

Artifact Size

—

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01}

Evaluation

stride-based eval

parameters: {"chunk_size":256,"eval_seq_len":1024,"batch_size":64}

Sequence Length

sequence_length

train_length: null

eval_length: 1024

Optimizer

Adam

weight_decay: null

momentum: null

other_params: {"betas":[0.9,0.95]}

Architecture

weight tying

LoRA adapters target lm_head and attention projections across transformer blocks; base architecture remains a standard Transformer.

parameters: null