|image1|
.. |image1| image:: https://github.com/mansiagr4/gifs/raw/main/new_small_logo.svg
Tabular Classification
======================
This examples uses the ``modlee`` package for tabular data
classification. We’ll use a diabetes dataset to show you how to:
1. Prepare the data.
2. Use ``modlee`` for model training.
3. Implement and train a custom model.
4. Evaluate the model.
|Open in Kaggle|
First, we will import the the necessary libraries and set up the
environment.
.. code:: python
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import os
import modlee
import lightning.pytorch as pl
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.model_selection import train_test_split
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
Now, we will set up the ``modlee`` API key and initialize the ``modlee``
package. You can access your ``modlee`` API key `from the
dashboard `__.
Replace ``replace-with-your-api-key`` with your API key.
.. code:: python
os.environ['MODLEE_API_KEY'] = "replace-with-your-api-key"
modlee.init(api_key=os.environ['MODLEE_API_KEY'])
Now, we will prepare our data. For this example, we will manually
download the diabetes dataset from Kaggle and upload it to the
environment.
Visit the `Diabetes CSV dataset
page `__ on
Kaggle and click the **Download** button to save the dataset
``diabetes.csv`` to your local machine.
Copy the path to that donwloaded file, which will be used later.
Define a custom dataset class ``TabularDataset`` for handling our
tabular data.
.. code:: python
class TabularDataset(TensorDataset):
def __init__(self, data, target):
self.data = torch.tensor(data, dtype=torch.float32) # Convert features to tensors
self.target = torch.tensor(target, dtype=torch.long) # Convert labels to long integers for classification
def __len__(self):
return len(self.data) # Return the size of the dataset
def __getitem__(self, idx):
return self.data[idx], self.target[idx] # Return a single sample from the dataset
We can now load and preprocess the data, and also create the
dataloaders.
.. code:: python
def get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True):
dataset_path = "path-to-dataset"
df = pd.read_csv(dataset_path) # Load the CSV file into a DataFrame
X = df.drop('Outcome', axis=1).values # Features (X) - drop the target column
y = df['Outcome'].values # Labels (y) - the target column
scaler = StandardScaler() # Initialize the scaler for feature scaling
X_scaled = scaler.fit_transform(X) # Scale the features
dataset = TabularDataset(X_scaled, y) # Create a TabularDataset instance
# Split the dataset into training and validation sets
dataset_size = len(dataset)
val_size = int(val_split * dataset_size)
train_size = dataset_size - val_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
# Create DataLoader instances for training and validation
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=shuffle)
return train_dataloader, val_dataloader
# Generate the DataLoaders
train_dataloader, val_dataloader = get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True)
Next, we will define our model. We offer two different approaches for
selecting a model:
**Option 1: Use a Recommended Modlee Model**
If you’d like to start with a benchmark solution, Modlee provides
pre-trained and optimized models for specific tasks. You can retrieve a
recommended model as follows:
.. code:: python
recommender = modlee.recommender.TabularClassificationRecommender(num_classes=2)
recommender.fit(train_dataloader)
recommended_modlee_model = recommender.model
**Option 2: Define Your Own Modlee Model**
If you want to experiment with a custom architecture, you can define
your own model. Below, we create our model, which is a simple
feedforward neural network called ``TabularClassifier``. This model will
be integtated with Modlee’s framework.
.. code:: python
class TabularClassifier(modlee.model.TabularClassificationModleeModel):
def __init__(self, input_dim, num_classes=2):
super().__init__()
self.fc1 = torch.nn.Linear(input_dim, 128)
self.dropout1 = torch.nn.AlphaDropout(0.1)
self.fc2 = torch.nn.Linear(128, 64)
self.dropout2 = torch.nn.AlphaDropout(0.1)
self.fc3 = torch.nn.Linear(64, 32)
self.dropout3 = torch.nn.AlphaDropout(0.1)
self.fc4 = torch.nn.Linear(32, num_classes)
self.loss_fn = torch.nn.CrossEntropyLoss()
def forward(self, x):
x = torch.selu(self.fc1(x))
x = self.dropout1(x)
x = torch.selu(self.fc2(x))
x = self.dropout2(x)
x = torch.selu(self.fc3(x))
x = self.dropout3(x)
x = self.fc4(x)
return x
def training_step(self, batch, batch_idx):
x, y_target = batch
y_pred = self(x)
loss = self.loss_fn(y_pred, y_target.squeeze())
return {"loss": loss}
def validation_step(self, val_batch, batch_idx):
x, y_target = val_batch
y_pred = self(x)
val_loss = self.loss_fn(y_pred, y_target.squeeze())
return {'val_loss': val_loss}
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9)
return optimizer
Next, we can train and evaluate our model using ``PyTorch Lightning``
for one epoch. For this example, we’ll continue as if we chose a custom
model.
.. code:: python
# Get the input dimension
original_train_dataset = train_dataloader.dataset.dataset
input_dim = len(original_train_dataset[0][0])
num_classes = 2
# Initialize the Modlee model
modlee_model = TabularClassifier(input_dim=input_dim, num_classes=num_classes)
# Train the model using PyTorch Lightning
with modlee.start_run() as run:
trainer = pl.Trainer(max_epochs=1)
trainer.fit(
model=modlee_model,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader
)
Now, we inspect the artifacts saved by Modlee, including the model graph
and various statistics. With Modlee, your training assets are
automatically saved, preserving valuable insights for future reference
and collaboration.
.. code:: python
import sys
# Get the path to the last run's saved data
last_run_path = modlee.last_run_path()
print(f"Run path: {last_run_path}")
# Get the path to the saved artifacts
artifacts_path = os.path.join(last_run_path, 'artifacts')
artifacts = os.listdir(artifacts_path)
print(f"Saved artifacts: {artifacts}")
# Set the artifacts path as an environment variable
os.environ['ARTIFACTS_PATH'] = artifacts_path
# Add the artifacts directory to the system path
sys.path.insert(0, artifacts_path)
.. code:: python
# Print out the first few lines of the model
print("Model graph:")
.. code:: shell
!sed -n -e 1,15p $ARTIFACTS_PATH/model_graph.py
!echo " ..."
!sed -n -e 58,68p $ARTIFACTS_PATH/model_graph.py
!echo " ..."
.. code:: python
# Print the first lines of the data metafeatures
print("Data metafeatures:")
.. code:: shell
!head -20 $ARTIFACTS_PATH/stats_rep
.. |Open in Kaggle| image:: https://kaggle.com/static/images/open-in-kaggle.svg
:target: https://www.kaggle.com/code/modlee/modlee-tabular-classification