|image1| .. |image1| image:: https://github.com/mansiagr4/gifs/raw/main/new_small_logo.svg Tabular Classification ====================== This examples uses the ``modlee`` package for tabular data classification. We’ll use a diabetes dataset to show you how to: 1. Prepare the data. 2. Use ``modlee`` for model training. 3. Implement and train a custom model. 4. Evaluate the model. |Open in Kaggle| First, we will import the the necessary libraries and set up the environment. .. code:: python import pandas as pd from sklearn.preprocessing import StandardScaler import torch import os import modlee import lightning.pytorch as pl from torch.utils.data import DataLoader, TensorDataset, random_split from sklearn.model_selection import train_test_split import ssl ssl._create_default_https_context = ssl._create_unverified_context Now, we will set up the ``modlee`` API key and initialize the ``modlee`` package. You can access your ``modlee`` API key `from the dashboard `__. Replace ``replace-with-your-api-key`` with your API key. .. code:: python os.environ['MODLEE_API_KEY'] = "replace-with-your-api-key" modlee.init(api_key=os.environ['MODLEE_API_KEY']) Now, we will prepare our data. For this example, we will manually download the diabetes dataset from Kaggle and upload it to the environment. Visit the `Diabetes CSV dataset page `__ on Kaggle and click the **Download** button to save the dataset ``diabetes.csv`` to your local machine. Copy the path to that donwloaded file, which will be used later. Define a custom dataset class ``TabularDataset`` for handling our tabular data. .. code:: python class TabularDataset(TensorDataset): def __init__(self, data, target): self.data = torch.tensor(data, dtype=torch.float32) # Convert features to tensors self.target = torch.tensor(target, dtype=torch.long) # Convert labels to long integers for classification def __len__(self): return len(self.data) # Return the size of the dataset def __getitem__(self, idx): return self.data[idx], self.target[idx] # Return a single sample from the dataset We can now load and preprocess the data, and also create the dataloaders. .. code:: python def get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True): dataset_path = "path-to-dataset" df = pd.read_csv(dataset_path) # Load the CSV file into a DataFrame X = df.drop('Outcome', axis=1).values # Features (X) - drop the target column y = df['Outcome'].values # Labels (y) - the target column scaler = StandardScaler() # Initialize the scaler for feature scaling X_scaled = scaler.fit_transform(X) # Scale the features dataset = TabularDataset(X_scaled, y) # Create a TabularDataset instance # Split the dataset into training and validation sets dataset_size = len(dataset) val_size = int(val_split * dataset_size) train_size = dataset_size - val_size train_dataset, val_dataset = random_split(dataset, [train_size, val_size]) # Create DataLoader instances for training and validation train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle) val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=shuffle) return train_dataloader, val_dataloader # Generate the DataLoaders train_dataloader, val_dataloader = get_diabetes_dataloaders(batch_size=32, val_split=0.2, shuffle=True) Next, we will define our model. We offer two different approaches for selecting a model: **Option 1: Use a Recommended Modlee Model** If you’d like to start with a benchmark solution, Modlee provides pre-trained and optimized models for specific tasks. You can retrieve a recommended model as follows: .. code:: python recommender = modlee.recommender.TabularClassificationRecommender(num_classes=2) recommender.fit(train_dataloader) recommended_modlee_model = recommender.model **Option 2: Define Your Own Modlee Model** If you want to experiment with a custom architecture, you can define your own model. Below, we create our model, which is a simple feedforward neural network called ``TabularClassifier``. This model will be integtated with Modlee’s framework. .. code:: python class TabularClassifier(modlee.model.TabularClassificationModleeModel): def __init__(self, input_dim, num_classes=2): super().__init__() self.fc1 = torch.nn.Linear(input_dim, 128) self.dropout1 = torch.nn.AlphaDropout(0.1) self.fc2 = torch.nn.Linear(128, 64) self.dropout2 = torch.nn.AlphaDropout(0.1) self.fc3 = torch.nn.Linear(64, 32) self.dropout3 = torch.nn.AlphaDropout(0.1) self.fc4 = torch.nn.Linear(32, num_classes) self.loss_fn = torch.nn.CrossEntropyLoss() def forward(self, x): x = torch.selu(self.fc1(x)) x = self.dropout1(x) x = torch.selu(self.fc2(x)) x = self.dropout2(x) x = torch.selu(self.fc3(x)) x = self.dropout3(x) x = self.fc4(x) return x def training_step(self, batch, batch_idx): x, y_target = batch y_pred = self(x) loss = self.loss_fn(y_pred, y_target.squeeze()) return {"loss": loss} def validation_step(self, val_batch, batch_idx): x, y_target = val_batch y_pred = self(x) val_loss = self.loss_fn(y_pred, y_target.squeeze()) return {'val_loss': val_loss} def configure_optimizers(self): optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9) return optimizer Next, we can train and evaluate our model using ``PyTorch Lightning`` for one epoch. For this example, we’ll continue as if we chose a custom model. .. code:: python # Get the input dimension original_train_dataset = train_dataloader.dataset.dataset input_dim = len(original_train_dataset[0][0]) num_classes = 2 # Initialize the Modlee model modlee_model = TabularClassifier(input_dim=input_dim, num_classes=num_classes) # Train the model using PyTorch Lightning with modlee.start_run() as run: trainer = pl.Trainer(max_epochs=1) trainer.fit( model=modlee_model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader ) Now, we inspect the artifacts saved by Modlee, including the model graph and various statistics. With Modlee, your training assets are automatically saved, preserving valuable insights for future reference and collaboration. .. code:: python import sys # Get the path to the last run's saved data last_run_path = modlee.last_run_path() print(f"Run path: {last_run_path}") # Get the path to the saved artifacts artifacts_path = os.path.join(last_run_path, 'artifacts') artifacts = os.listdir(artifacts_path) print(f"Saved artifacts: {artifacts}") # Set the artifacts path as an environment variable os.environ['ARTIFACTS_PATH'] = artifacts_path # Add the artifacts directory to the system path sys.path.insert(0, artifacts_path) .. code:: python # Print out the first few lines of the model print("Model graph:") .. code:: shell !sed -n -e 1,15p $ARTIFACTS_PATH/model_graph.py !echo " ..." !sed -n -e 58,68p $ARTIFACTS_PATH/model_graph.py !echo " ..." .. code:: python # Print the first lines of the data metafeatures print("Data metafeatures:") .. code:: shell !head -20 $ARTIFACTS_PATH/stats_rep .. |Open in Kaggle| image:: https://kaggle.com/static/images/open-in-kaggle.svg :target: https://www.kaggle.com/code/modlee/modlee-tabular-classification