Time Series Regression

In this tutorial, we will guide you through the process of implementing a time series regression model using the Modlee framework along with PyTorch.

The goal is to predict power consumption based on various environmental factors, such as temperature, humidity, wind speed, and solar radiation.

Note: Currently, Modlee does not support recurrent LSTM operations. Instead, we will focus on non-recurrent models suited for time series data, such as convolutional neural networks (CNNs) and transformers, which can effectively capture sequential patterns without requiring recurrent layers.

First, we will import the the necessary libraries and set up the environment.

import torch
import os
import modlee
import lightning.pytorch as pl
from torch.utils.data import DataLoader, TensorDataset
import pytest
import pandas as pd
from sklearn.model_selection import train_test_split

Now, we will set up the modlee API key and initialize the modlee package. You can access your modlee API key from the dashboard.

Replace replace-with-your-api-key with your API key.

modlee.init(api_key="replace-with-your-api-key")

The dataset used in this tutorial includes hourly time series data that links environmental conditions to power consumption across three zones. Each record contains a timestamp, temperature, humidity, wind speed, and measures of solar radiation, alongside the power consumption (in watts) for each zone.

This data allows for the exploration of relationships between weather patterns and energy usage, aiding in the development of predictive models.

For this example, we will manually download the dataset from Kaggle and upload it to the environment. Visit the Time Series Regression dataset page on Kaggle and click the Download button to save the dataset to your local machine.

Copy the path to the donwloaded files, which will be used later.

Next, we need to load the power consumption dataset. This dataset contains various features related to environmental conditions and their corresponding power consumption values. The load_power_consumption_data function is designed to read the CSV file, process the data, and create time series sequences.

We then select the relevant features from the dataset for our input variables, X, which include temperature, humidity, wind speed, and solar radiation values. The output variable, y, is calculated as the mean power consumption across three different zones.

# Function to load the power consumption dataset and prepare it for training
def load_power_consumption_data(file_path, seq_length):
    # Load the dataset from the specified CSV file
    data = pd.read_csv(file_path)
    # Convert the 'Datetime' column to datetime objects
    data['Datetime'] = pd.to_datetime(data['Datetime'])
    # Set the 'Datetime' column as the index for the DataFrame
    data.set_index('Datetime', inplace=True)

    # Extract relevant features for prediction and target variable
    X = data[['Temperature', 'Humidity', 'WindSpeed', 'GeneralDiffuseFlows', 'DiffuseFlows']].values
    # Calculate the average power consumption across the three zones as the target variable
    y = data[['PowerConsumption_Zone1', 'PowerConsumption_Zone2', 'PowerConsumption_Zone3']].mean(axis=1).values
    # Convert features and target to PyTorch tensors
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.float32)

    # Create sequences of the specified length for input features
    num_samples = X.shape[0] - seq_length + 1
    X_seq = torch.stack([X[i:i + seq_length] for i in range(num_samples)])
    y_seq = y[seq_length - 1:]  # Align target variable with sequences

    return X_seq, y_seq

Once we have the preprocessed data, we proceed to create PyTorch datasets and DataLoaders.

Here, we load the power consumption data from the specified CSV file. We create a TensorDataset to hold the features and labels. To split the dataset into training and validation sets, we use the train_test_split function from sklearn.

# Define the path to the dataset
file_path = 'path-to-powerconsumption.csv'
# Load the power consumption data with a specified sequence length
X, y = load_power_consumption_data(file_path, 20)

# Create a TensorDataset for the training data
dataset = TensorDataset(X, y)
# Split dataset indices into training and validation sets
train_indices, val_indices = train_test_split(range(len(dataset)), test_size=0.2, random_state=42)

# Create training and validation datasets
train_dataset = TensorDataset(X[train_indices], y[train_indices])
val_dataset = TensorDataset(X[val_indices], y[val_indices])

# Create DataLoader for batch processing during training
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False)

We will now define a multivariate time series regression model by creating a class that inherits from modlee.model.TimeseriesRegressionModleeModel. This class uses a Transformer-based architecture to predict a continuous value.

We initialize a TransformerEncoder with multi-head attention to process sequential dependencies.

class TransformerTimeSeriesRegressor(modlee.model.TimeseriesRegressionModleeModel):
    def __init__(self, input_dim, seq_length, num_heads=1, hidden_dim=64):
        super().__init__()
        # Initialize a Transformer encoder layer with specified input dimensions and heads
        self.encoder_layer = torch.nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads)
        # Stack encoder layers to form the Transformer encoder
        self.transformer_encoder = torch.nn.TransformerEncoder(self.encoder_layer, num_layers=2)
        # Define a fully connected layer to map encoded features to a single output value
        self.fc = torch.nn.Linear(input_dim * seq_length, 1)
        # Set the loss function to mean squared error for regression tasks
        self.loss_fn = torch.nn.MSELoss()

    def forward(self, x):
        # Pass input through the Transformer encoder
        x = self.transformer_encoder(x)
        # Flatten the output and pass it through the fully connected layer
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

    def training_step(self, batch):
        # Get input and target from batch
        x, y = batch
        # Generate predictions and compute loss
        preds = self.forward(x)
        loss = self.loss_fn(preds, y)
        return loss

    def validation_step(self, batch):
        # Get input and target from batch
        x, y = batch
        # Generate predictions and compute loss
        preds = self.forward(x)
        loss = self.loss_fn(preds, y)
        return loss

    def configure_optimizers(self):
        # Use the Adam optimizer with a learning rate of 1e-3
        return torch.optim.Adam(self.parameters(), lr=1e-3)

model = TransformerTimeSeriesRegressor(input_dim=5, seq_length=20)

With our model defined, we can now train it using the PyTorch Lightning Trainer. This trainer simplifies the training process by managing the training loops and logging.

# Start a training run with Modlee
with modlee.start_run() as run:
    trainer = pl.Trainer(max_epochs=1)  # Set up the trainer
    trainer.fit(
        model=model,
        train_dataloaders=train_dataloader,  # Load training data
        val_dataloaders=val_dataloader  # Load validation data
    )

After training, we inspect the artifacts saved by Modlee, including the model graph and various statistics. With Modlee, your training assets are automatically saved, preserving valuable insights for future reference and collaboration.

last_run_path = modlee.last_run_path()
print(f"Run path: {last_run_path}")
artifacts_path = os.path.join(last_run_path, 'artifacts')
artifacts = sorted(os.listdir(artifacts_path))
print(f"Saved artifacts: {artifacts}")