Transformers — A conceptual Understanding

4 min readMar 1, 2024

Transformers architecture was introduced in the Attention is All You Need paper. Similar to CNN for Computer vision, the transformers are for NLP. A conversational chatbot is a simple daily use case that one can build using transformers.

I’ve explained the below concepts, but this is not limited to it,

Encoders in Transformers

1. Attention

Transforming Words into Vector Representation (Word Embedding)

Attention in transformers allows the model to focus on different parts of input sequences when making predictions. It assigns weights to input tokens, emphasizing their relevance in the task context.

2. Positional Encoding

Positional encoding is a vector generated using a function based on the position of the input embeddings. For instance, a common approach is to use both sine and cosine functions to generate positional encodings. The position of each element in the input sequence determines the frequency of the sine and cosine functions used to calculate the corresponding positional encoding vector. This technique is often employed in sequence-to-sequence models, especially in the Transformer architecture, to provide information about the order of elements in the input sequence to the model.

Decoder In Transformer

1. Autoregressive Process

Autoregressive refers to a model or process where the current state or output depends on its previous states or outputs. In the context of language generation, an autoregressive model predicts the next element in a sequence based on the elements that came before it in that sequence. Transformers, especially in the context of language tasks, often use autoregressive decoding in the generation process.

This method is also used in RNN architecture as well.

The Transformer decoder works differently during training and inference. At training time, it is non-autoregressive, meaning that we don’t use the decoder’s predictions as input for the following timestep. Instead, we always use the correct tokens, which is called “teacher forcing”. In addition, the hidden states of all time steps are computed simultaneously in the attention heads, which is different from recurrent units like LSTMs and GRUs, where we need the previous timestep’s hidden state to compute the current timestep’s.

We don’t have access to the correct tokens at inference time, so we use the decoder’s prediction as input for the following timestep. However, even though the process is now autoregressive, efficient implementations of the Transformer model typically cache the hidden states of previous timesteps, so they don’t need to be recomputed.

Decoder does similar function as encoder but with only one change, that is Masking.
It should be noted that decoder is a auto-regressive model meaning it predicts future behavior based on past behavior.

Decoder takes in the list of previous output as input along with Encoders output which contains the attention information of input (Hi How are you). The decoder stops decoding once it generates <End> token.

Transformer Design architecture Business Use Cases

The Transformer design architecture has the potential to change how talent acquisition in HR departments works. Its self-attention mechanism enables rapid analysis of candidate data, expediting screening and accelerating decision-making for quicker, more accurate hiring. With parallel processing capabilities, it efficiently identifies candidates aligned with business needs, facilitating swift staff augmentation for projects or long-term goals.

The Transformer design architecture extends beyond HR, making a significant impact across diverse business sectors. In customer service, it enhances chatbot interactions, while in finance, it excels in fraud detection. Marketing benefits from sentiment analysis, supply chain management gains improved demand forecasting, and healthcare leverages it for diagnosing medical conditions. In legal processes, Transformers streamline document reviews, expediting contract analysis and due diligence. This adaptability underscores the architecture’s transformative role, enhancing efficiency and decision-making across various industries.

Final Thoughts

In conclusion, the transformative power of the Transformer architecture extends far beyond its foundational components. From attention mechanisms and word embedding to positional encoding and autoregressive processes, each element plays a crucial role in shaping the model’s capabilities. Notably, the decoder, with its autoregressive nature and masking, stands as a pivotal component in language generation tasks. The intricate dance between encoder and decoder, guided by attention and positional information, empowers the Transformer to comprehend and generate human-like text. As we delve into the intricacies of this architecture, we witness a revolution in natural language processing, opening doors to innovative applications like conversational chatbots and beyond.

Authored by: Mano Ranjan