PR #454
openNon-record: Competitive Stack + Phonetic Tokenization Exploration (val_bpb=1.2055, 4xH100)
by nalediymView on GitHub
val_bpb
1.2055
Architecture
Transformer
Optimizer
—
Artifact Size
19.6MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all
Architecture
BigramHash
4096-bucket hash embedding for bigram context
parameters: {"buckets":4096}
SmearGate
Learned gate blending current and previous token embeddings
parameters: null
MLP3x
Wider feedforward network
parameters: {"hidden_dim":1536}
Initialization
OrthoInit
Orthogonal weight initialization with muP-style 1/sqrt(2L) projection scaling
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
grad_clip
parameters: {"norm":0.3}
Other
other
Phonetic tokenization exploration using IPA/G2P conversion and SentencePiece BPE on phonetic output
parameters: {"cmudict_exceptions":4795,"word_coverage":0.846,"tokenizer_vocab_size":1024}
Novel Contributions
- Competitive training stack combining int6 STE QAT, BigramHash, SmearGate, OrthoInit, and 3x MLP
- Sliding-window evaluation with stride 64 achieving val_bpb 1.2055
- IPA phonetic tokenization research with a controlled comparison against standard BPE
- Negative result showing phonetic encoding provides only marginal gains in isolation and is largely subsumed by the competitive training stack