Convolutional layers got the same initialization as Keras
Updating CAI Transformer
fixes pointwise softmax with no forward and skip derivative
adding debug code
better normalization methods
fixes backpropagation with branching