Text Regression

This tutorial will walk you through building an end-to-end text regression pipeline using the Modlee package and PyTorch Lightning.

We’ll use the Yelp Polarity dataset, which contains customer reviews labeled with sentiment scores, to build a simple regression model that predicts a continuous value based on the text.

Open in Kaggle

First, we will import the the necessary libraries and set up the environment.

import os
import torch
import modlee
import lightning.pytorch as pl
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from datasets import load_dataset

Now, we will set up the modlee API key and initialize the modlee package. You can access your modlee API key from the dashboard.

Replace replace-with-your-api-key with your API key.

modlee.init(api_key="replace-with-your-api-key")

Text data needs to be tokenized (converted into numerical format) before it can be used by machine learning models. We use a pre-trained BERT tokenizer for this.

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

We define a function to preprocess raw text data using the tokenizer. Tokenization ensures that the input data has a uniform format and length, making it suitable for training deep learning models.

# Tokenization function: converts text into numerical format
def tokenize_texts(texts, tokenizer, max_length=20):
    encodings = tokenizer(
        texts,
        truncation=True,  # Truncate if too long
        padding="max_length",  # Pad if too short
        max_length=max_length,
        return_tensors="pt",  # Return PyTorch tensors
        add_special_tokens=True,  # Include special tokens like [CLS], [SEP]
    )
    input_ids = encodings['input_ids']
    attention_mask = encodings['attention_mask']

    return input_ids, attention_mask

In this step, we load a text dataset using Hugging Face’s datasets library. We are using the Yelp Polarity dataset, which consists of movie reviews labeled as positive or negative.

def load_real_data(dataset_name):
    # Load the dataset based on the provided name.
    # In this case, we are specifically loading the 'yelp_polarity' dataset.
    dataset = load_dataset("yelp_polarity", split='train[:80%]')  # Load the first 80% of the training data

    # Extract the 'text' column from the dataset, which contains the review texts.
    texts = dataset['text']

    # Extract the 'label' column, which contains the sentiment labels (positive or negative).
    targets = dataset['label']

    targets = [float(label) for label in targets]

    # Return the texts and their corresponding sentiment labels (targets).
    return texts, targets

We tokenize the dataset and split it into training and testing sets. This step ensures that we have separate datasets for training and evaluation.

# Load 'yelp_polarity' dataset
texts, targets = load_real_data(dataset_name="yelp_polarity")

# Use only the first 100 samples for simplicity
texts = texts[:100]
targets = targets[:100]

# Tokenize the text into input IDs and attention masks
input_ids, attention_masks = tokenize_texts(texts, tokenizer)

# Split the data into training and testing sets (80% train, 20% test)
X_train_ids, X_test_ids, X_train_masks, X_test_masks, y_train, y_test = train_test_split(
    input_ids, attention_masks, targets, test_size=0.2, random_state=42
)

We prepare PyTorch DataLoader objects to feed data into the model during training.

# Define a custom PyTorch Dataset
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.y)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create training and testing datasets
train_dataset = TextDataset(torch.tensor(X_train_ids, dtype=torch.float),torch.tensor(y_train, dtype=torch.float))
test_dataset = TextDataset(torch.tensor(X_test_ids, dtype=torch.float),torch.tensor(y_test, dtype=torch.float))

# Create DataLoaders for batch processing
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
train_dataloader.initial_tokenizer = tokenizer

Now, we create our model. We offer two different approaches for selecting a model:

Option 1: Use a Recommended Modlee Model

If you’d like to start with a benchmark solution, Modlee provides pre-trained and optimized models for specific tasks. You can retrieve a recommended model as follows:

recommender = modlee.recommender.from_modality_task(
    modality='text',
    task='regression'
    )
recommender.fit(train_dataloader)
recommended_modlee_model = recommender.model

Option 2: Define Your Own Modlee Model

If you want to experiment with a custom architecture, you can define your own model. Below, we create a custom text regression model by inheriting from Modlee’s TextRegressionModleeModel.

# Define a simple MLP-based regression model using Modlee
class MLPTextRegressionModel(modlee.model.TextRegressionModleeModel):
    def __init__(self, vocab_size, embed_dim=50, tokenizer=None, max_length=20):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim, padding_idx=tokenizer.pad_token_id if tokenizer else None)
        self.fc1 = torch.nn.Linear(embed_dim * max_length, 256)
        self.fc2 = torch.nn.Linear(256, 64)
        self.fc3 = torch.nn.Linear(64, 1)  # Single output for regression
        self.loss_fn = torch.nn.MSELoss()  # Mean Squared Error loss
        self.max_length = max_length

    def forward(self, input_ids):
        if isinstance(input_ids, list):  # Convert list to tensor if needed
            input_ids = torch.stack([torch.tensor(item, dtype=torch.long) for item in input_ids])
        elif not isinstance(input_ids, torch.Tensor):
            input_ids = torch.tensor(input_ids, dtype=torch.long)

        if input_ids.dim() == 3:  # Ensure correct shape
            input_ids = input_ids.view(-1, self.max_length)

        embedded = self.embedding(input_ids.long()).flatten(start_dim=1)
        x = torch.nn.functional.relu(self.fc1(embedded))
        x = torch.nn.functional.relu(self.fc2(x))
        return self.fc3(x)  # Output a single continuous value

    def training_step(self, batch, batch_idx):
        input_ids, targets = batch
        preds = self.forward(input_ids)
        return self.loss_fn(preds, targets)  # Compute loss

    def validation_step(self, batch, batch_idx):
        input_ids, targets = batch
        preds = self.forward(input_ids)
        return self.loss_fn(preds, targets)  # Compute validation loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

# Initialize model
modlee_model = MLPTextRegressionModel(vocab_size=tokenizer.vocab_size, tokenizer=tokenizer)

We instantiate the model and use PyTorch Lightning’s Trainer class to handle training. For this example, we’ll continue as if we chose a recommended model.

# Train the model using Modlee and PyTorch Lightning's Trainer
with modlee.start_run() as run:
    trainer = pl.Trainer(max_epochs=1) # Train for one epoch
    trainer.fit(
        model=recommended_modlee_model,
        train_dataloaders=train_dataloader,
        val_dataloaders=test_dataloader
    )

After training, we inspect the artifacts saved by Modlee, including the model graph and various statistics. With Modlee, your training assets are automatically saved, preserving valuable insights for future reference and collaboration.

last_run_path = modlee.last_run_path()
print(f"Run path: {last_run_path}")
artifacts_path = os.path.join(last_run_path, 'artifacts')
artifacts = sorted(os.listdir(artifacts_path))
print(f"Saved artifacts: {artifacts}")