One Transformer — A New Era of Deep Learning

Time to converge RNN, CNN and other DL with Transformer

Published in

DataDrivenInvestor

7 min readDec 16, 2022

Deep learning has ignited the AI renaissance over the past decade. DL has become the mainstream of technological innovation and digital transformation. Over time, due to different algorithms and use cases, it has established two well-known branches, CNN and RNN.

CNN (Convolutional Neural Network) is a DL model designed to process and analyze data with a grid-like structure, such as images. It uses convolutional layers to extract features from data and is often used for image classification, object detection, and segmentation.

RNN (Recurrent Neural Network) is a DL model to process sequential data, such as time series data or natural language. It uses feedback connections to capture the dependencies between data points in the sequence and is typically used for language translation, speech recognition, and text generation.

However, given their sophisticated complexity and unique architecture, ML scientists have to research and develop these two relevant domains independently, and then it becomes hard to share and evolve together.

Transformer was changing the game

Transformer is a new creative deep-learning model to process and analyze large, structured data sets, such as text or graphs. It employs self-attention mechanisms to capture the dependencies between data points in the input and is initially used for language translation, language modeling, and document classification.

Transformer has demonstrated outperforming RNN significantly on various NLP tasks, e.g., language modeling and machine translation. Transformer networks can process input sequences more efficiently, allowing them to handle longer input sequences and make more accurate predictions.

We are thrilled about its performance and flexibility. It started to expand to various other applications, including image and video analysis, speech recognition, and time series forecasting. But the journey has just begun.

The stunning generative AI, including DALL-E 2, Stability AI, and ChatGPT, are also based on the transformer.

ChatGPT is mind-blowing. It is a GPT-3 chatbot rooted in Transformer. Transformer’s breakthrough in language models has triggered the convergence of other models like CNN. This is where One Transformer comes in.

Converge RNN and CNN with Transformer

It’s time to merge CNN and RNN using Transformer in PyTorch and other frameworks. It’ll be a new era of deep learning for sequential and spatial data processing.

For instance, a CNN could extract features, and an RNN could process the extracted features over time to capture temporal patterns. The transformer could then encode and decode the extracted features and processed data to make predictions about the image’s content.

Large Language Models (LLM) are a successful stunning application of transformers for RNNs. Vision Transformer (ViT) is an excellent example of CNNs using transformers. Hugging Face has many code implementations on ViT.

Overall, combining CNN and RNN with (or evolving into) Transformer can be a powerful way to leverage the strengths of each model to perform complex tasks such as image classification or object detection.

Example of implementing Transformer forecasting in PyTorch

Transformer first revolutionized RNN in language models. Moreover, RNN is a new effective model for time-series forecasting, often supported by ARIMA (AutoRegressive Integrated Moving Average), exponential smoothing, and seasonality models.

However, Transformer can improve RNN for forecasting. Here is a Transformer example for forecasting in PyTorch:

import torch
import torch.nn as nn

class TransformerForecaster(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, num_layers, num_heads):
        super(TransformerForecaster, self).__init__()
        
        # Define the encoder layers
        self.encoder_layers = nn.TransformerEncoderLayer(input_size, num_heads, hidden_size)
        self.encoder_norm = nn.LayerNorm(input_size)
        
        # Define the decoder layers
        self.decoder_layers = nn.TransformerDecoderLayer(output_size, num_heads, hidden_size)
        self.decoder_norm = nn.LayerNorm(output_size)
        
        # Define the final output layer
        self.output_layer = nn.Linear(output_size, output_size)
    
    def forward(self, x, y):
        # Pass the input sequence through the encoder layers
        encoder_output = self.encoder_norm(self.encoder_layers(x))
        
        # Pass the target sequence through the decoder layers
        decoder_output = self.decoder_norm(self.decoder_layers(y, encoder_output))
        
        # Use the output layer to make predictions
        predictions = self.output_layer(decoder_output)
        
        return predictions

We can instantiate the above model with appropriate hyperparameters (such as input_size, output_size, hidden_size, num_layers, and num_heads), then pass input and target sequences to the forward() method to generate predictions.

For example:

# Define the model hyperparameters
input_size = 10
output_size = 5
hidden_size = 20
num_layers = 3
num_heads = 4

# Instantiate the model
forecaster = TransformerForecaster(input_size, output_size, hidden_size, num_layers, num_heads)
# Generate some input and target sequences
x = torch.randn(10, 5, 10)
y = torch.randn(10, 5, 5)
# Generate predictions
predictions = forecaster(x, y)

This is just an example of implementing a transformer for forecasting in PyTorch. There are many other ways to do this. But we may want to experiment with different architectures and hyperparameters to find the best model for specific forecasting tasks.

Example of implementing Transformer CV in PyTorch

The game does not end up transforming models for sequential data. Transformer also demonstrates the capability for CV and image classifications.

Here is a Transformer example for CV tasks in PyTorch:

import torch
import torchvision
import torch.nn as nn

class TransformerCV(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, num_layers, num_heads):
        super(TransformerCV, self).__init__()
        
        # Define the encoder layers
        self.encoder_layers = nn.TransformerEncoderLayer(input_size, num_heads, hidden_size)
        self.encoder_norm = nn.LayerNorm(input_size)
        
        # Define the decoder layers
        self.decoder_layers = nn.TransformerDecoderLayer(output_size, num_heads, hidden_size)
        self.decoder_norm = nn.LayerNorm(output_size)
        
        # Define the final output layer
        self.output_layer = nn.Linear(output_size, output_size)
        
        # Define the vision model
        self.vision_model = torchvision.models.resnet18(pretrained=True)
    
    def forward(self, x):
        # Pass the input image through the vision model
        vision_output = self.vision_model(x)
        
        # Pass the vision output through the encoder layers
        encoder_output = self.encoder_norm(self.encoder_layers(vision_output))
        
        # Pass the encoder output through the decoder layers
        decoder_output = self.decoder_norm(self.decoder_layers(vision_output, encoder_output))
        
        # Use the output layer to make predictions
        predictions = self.output_layer(decoder_output)
        
        return predictions

We can instantiate it with the appropriate hyperparameters (such as input_size, output_size, hidden_size, num_layers, and num_heads), then pass an input image to the forward() method to generate predictions.

For example:

# Define the model hyperparameters
input_size = 3
output_size = 10
hidden_size = 20
num_layers = 3
num_heads = 4

# Instantiate the model
transformer_cv = TransformerCV(input_size, output_size, hidden_size, num_layers, num_heads)
# Generate some input images
x = torch.randn(10, 3, 224, 224)
# Generate predictions
predictions = transformer_cv(x)

Again, it is an example of implementing a transformer for a simple CV task in PyTorch. We may want to experiment with different architectures and hyper-parameters to find the best model for a specific task.

Key points to use Transformer for CNN and RNN

It’s an exciting convergence journey from RNN to Transformer and optimizing CNN with Transformer. There are several key points to pay close attention to. Key points below but not limited to.

Input and output data types: In forecasting and RNN, the input and output data are typically floating-point tensors representing the time series data. In CV and CNN tasks, the input and output data are generally image tensors representing the image’s pixel values.
Input and output sizes: Forecasting and RNN tasks typically have a smaller input and output size than CV and CNN jobs since time series data is usually one-dimensional and images are generally two-dimensional.
Model architecture: The architecture of a transformer for forecasting and RNN tasks is often different from that of CV and CNN tasks. e.g., a transformer for RNN may have more layers and more attention heads to better capture the temporal dependencies in the data. In contrast, a transformer for CV may have more convolutional layers to better extract features from the images.
Hyperparameters: The hyperparameters of a transformer for forecasting and RNN tasks should be different from those for CV and CNN tasks. e.g., a transformer for RNN may have a larger hidden size and more layers to better capture the long-term dependencies in the data, while a transformer for CNN may have a larger number of attention heads to better capture the spatial dependencies in the images.
Using encoder and decoder layers: For RNN, the encoder layers typically encode the input sequence into a hidden representation, and decoder layers are used to decode the hidden representation into the output sequence. For CNN, the encoder layers typically encode the input image into a hidden representation, and the decoder layers are used to decode the hidden representation into the output image.
The use of the output layer: For RNN, the output layer typically makes predictions based on the output sequence from the decoder layers. For CNN, the output layer is generally used to make predictions based on the output image from the decoder layers.

While both forecasting (by RNN) and CV (by CNN) tasks can be solved using transformers in PyTorch, the specific architectures and hyperparameters of the transformer models may differ depending on the particular task and data.

Extensive Transformer-based Models

The transformer is not only reinventing RNN and CNN domains, but there are more and more successful transformer-based models.

Transformer-based Models (Source: TheAiEdge.io)

The above illustrates the evolution of transformer-based models, including but not limited to BERT transformers, multilingual transformers, non-text applications transformers, modified transformers, and text processing/generation transformers.

The number of AI papers per month on arXiv doubles exponentially every 24 months. Transformer appears to be an inflection point accelerating innovation.

ML/AI arXiv papers per month (source: Predicting the Future of AI)

In a Nutshell

With increasing practical applications, Transformer gains more confidence to upgrade RNN and fuse CNN. But it may extend beyond RNN and CNN. One Transformer, Transformer-based unification, is a trend. And there are three things to consider before accelerating.

Transformer itself is still evolving. There are many opportunities to improve and optimize CNN and RNN use cases in new models. Second, existing CNN and RNN models are rich in features and architectures. It may take time to transform. Finally, CNN and RNN models are widely used in production. The migration can be daunting.

[AIGC: 35%]

Generative AI and Future

GAN, GPT-3, DALL·E 2, and what’s next

pub.towardsai.net

Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate