Dataset guidelines
Here we show pseudo code to illustrate building a pytorch data loader from a list of data elements in a format that is compatible with Modlee Auto Experiment Documentation
TLDR
Define your dataset in an unnested format: [[x1, x2, x3, …, y], …]
Create a dataloader which is used to train a ModleeModel with a Modlee Trainer
Define example custom dataset objects
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
feature1 = torch.tensor(self.data[idx][0], dtype=torch.float32)
feature2 = torch.tensor(self.data[idx][1], dtype=torch.float32)
feature3 = torch.tensor(self.data[idx][2], dtype=torch.float32)
features = [feature1,feature2,feature3] # This is a simplification
target = torch.tensor(self.data[idx][-1], dtype=torch.float32).squeeze() # Ensure target is a scalar or 1D
return features, target
def example_text():
return np.random.rand(10) # 1D array of 10 random numbers
def example_image():
return np.random.rand(5, 3) # 2D array of shape (5, 3) with random numbers
def example_video():
return np.random.rand(5, 3, 2) # 3D array of shape (5, 3, 2) with random numbers
def example_target():
return np.random.rand(1) # scalar value
Create dataset and dataloader
MODLEE_GUIDELINE
Define your raw data so that each element is a list of data objects (any combination of images,audio,text,video,etc …) with the final element of the list being your target which must match the output shape of your neural network - ex: [[x1, x2, x3, …, y], …]
Avoid nested data structures like the following - [[[x1, x2], x3, …, y], …]
Why?
Modlee extracts key meta features from your dataset so your experiment can be used in aggregate analysis alongside your collaborators data, to improve Modlee’s model recommendation technology for your connected environment. The above stated list data structure allows us to easily extract the information we need. Check out exactly how we do this on our public Github Repo.
data = [[example_text(),example_image(),example_video(),example_target()] for _ in range(4)]
dataset = CustomDataset(data)
Define a PyTorch DataLoader
MODLEE_GUIDELINE
Pass your dataset to a PyTorch DataLoader, so that Modlee can automatically parse it for meta features, allowing you to share it in a meaningful way with your colleagues.
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# Iterate through dataloader
for i,batch in enumerate(dataloader):
print(f"- batch_{i}")
features, target = batch
for j,feature in enumerate(features):
print(f"feature_{j}.shape = ", feature.shape)
print("target.shape = ", target.shape)
Modality & task compatibility
We’re working on making modlee compatible with any data modality and machine learning task which drove us to create the above stated MODLEE_GUIDELINES.
Check out our Github Repo to see which have been tested for auto documentation to date, and if you don’t see one you need, test it out yourself and contribute!
Reach out on our Discord to let us know what modality & tasks you want to use next, or give us feedback on these guidelines.