Understanding Addition in Transformers

We theoretically model how transformers learn addition and compare with the training loss over epochs

