Since the introduction of the transformer architecture [1], which addressed the limitations of earlier seq2seq models, attention-based models have become the standard for most Natural Language ...