Tabular Regression

In this tutorial, we will walk through the process of building a tabular regression model using the Modlee package. We will utilize the California Housing dataset, which contains information about housing prices in California, to predict house prices based on various features.

First, we will import the the necessary libraries and set up the environment.

import torch
import os
import modlee
import lightning.pytorch as pl
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pytest

Now, we will set up the modlee API key and initialize the modlee package. You can access your modlee API key from the dashboard.

Replace replace-with-your-api-key with your API key.

modlee.init(api_key="replace-with-your-api-key")

Now, we will load the California Housing dataset, which we will use for our regression task. This dataset is readily available through the fetch_california_housing function from the sklearn.datasets module. We will convert the features and target values into PyTorch tensors for compatibility with our model.

def load_california_housing_data():
    # Fetch the California housing dataset
    data = fetch_california_housing()
    X, y = data.data, data.target  # Separate features (X) and target (y)

    # Convert features and target to PyTorch tensors for compatibility
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.float32)

    return X, y

# Load the data
X, y = load_california_housing_data()

Once we have the data, the next step is to split it into training and testing sets. This allows us to train our model on one set of data and validate its performance on another.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After splitting the data, we need to create TensorDataset objects for our training and testing sets. This will facilitate easy loading of the data during the training process. We create DataLoader objects for both training and testing datasets.

# Create TensorDataset objects for training and testing data
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

# Create DataLoader objects for batching and shuffling
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Now it’s time to define our regression model using Modlee’s framework. We offer two different approaches for selecting a model:

Option 1: Use a Recommended Modlee Model

If you’d like to start with a benchmark solution, Modlee provides pre-trained and optimized models for specific tasks. You can retrieve a recommended model as follows:

recommender = modlee.recommender.from_modality_task(
    modality='tabular',
    task='regression'
    )
recommender.fit(train_dataloader)
recommended_modlee_model = recommender.model

Option 2: Define Your Own Modlee Model

If you want to experiment with a custom architecture, you can define your own model. We will create a class called TabularRegression that inherits from modlee.model.TabularRegressionModleeModel. This class will define our neural network architecture and the training process.

class TabularRegression(modlee.model.TabularRegressionModleeModel):
    def __init__(self, input_dim):
        super().__init__()  # Initialize the parent class
        # Define the neural network architecture
        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_dim, 128),  # First layer with 128 neurons
            torch.nn.ReLU(),  # ReLU activation function
            torch.nn.Linear(128, 64),  # Second layer with 64 neurons
            torch.nn.ReLU(),  # ReLU activation function
            torch.nn.Linear(64, 1)  # Output layer predicting a single value
        )
        self.loss_fn = torch.nn.MSELoss()  # Mean Squared Error loss function

    def forward(self, x):
        return self.model(x)  # Forward pass through the model

    def training_step(self, batch):
        x, y = batch  # Unpack the batch
        preds = self.forward(x).squeeze()  # Get predictions from the model
        loss = self.loss_fn(preds, y)  # Compute loss
        return loss  # Return the loss

    def validation_step(self, batch):
        x, y = batch  # Unpack the batch
        preds = self.forward(x).squeeze()  # Get predictions from the model
        loss = self.loss_fn(preds, y)  # Compute loss
        return loss  # Return the loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)  # Optimizer configuration

modlee_model = TabularRegression(input_dim=X_train.shape[1])

With the model defined, we can proceed to train it. We will use the pl.Trainer from PyTorch Lightning, which simplifies the training process. We will specify the number of epochs and how often to log training progress. For this example, we’ll continue as if we chose a recommended model.

with modlee.start_run() as run:  # Start a training run
    trainer = pl.Trainer(max_epochs=1)  # Set up the trainer
    trainer.fit(  # Start training the model
        model=recommended_modlee_model,
        train_dataloaders=train_dataloader,
        val_dataloaders=test_dataloader  # Use test data for validation
    )

After training, we inspect the artifacts saved by Modlee, including the model graph and various statistics. With Modlee, your training assets are automatically saved, preserving valuable insights for future reference and collaboration.

last_run_path = modlee.last_run_path()
print(f"Run path: {last_run_path}")
artifacts_path = os.path.join(last_run_path, 'artifacts')
artifacts = sorted(os.listdir(artifacts_path))
print(f"Saved artifacts: {artifacts}")