|image1| .. |image1| image:: https://github.com/mansiagr4/gifs/raw/main/new_small_logo.svg Audio Embeddings With Tabular Classification Model ================================================== In this example, we will build an audio classification model using ``PyTorch`` and ``Wav2Vec2``, a pretrained model for processing audio data. This guide will walk you through each step of the process, including setting up the environment, loading and preprocessing data, defining and training a model, and evaluating its performance. |Open in Kaggle| First, we will import the the necessary libraries and set up the environment. .. code:: python import torchaudio from sklearn.preprocessing import LabelEncoder from torch.utils.data import TensorDataset, DataLoader from transformers import Wav2Vec2Model import torch import os import modlee import lightning.pytorch as pl from sklearn.model_selection import train_test_split torchaudio.set_audio_backend("sox_io") Now we will set our Modlee API key and initialize the Modlee package. Make sure that you have a Modlee account and an API key `from the dashboard `__. Replace ``replace-with-your-api-key`` with your API key. .. code:: python os.environ['MODLEE_API_KEY'] = "replace-with-your-api-key" modlee.init(api_key=os.environ['MODLEE_API_KEY']) Now, we will prepare our data. For this example, we will manually download the ``Human Words Audio`` dataset from Kaggle and upload it to the environment. Visit the `Human Words Audio dataset page `__ on Kaggle and click the **Download** button to save the ``Animals`` directory to your local machine. Copy the path to that donwloaded file, which will be used later. This snippet loads the ``Wav2Vec2`` model. We’ll use it to convert audio into embeddings. This snippet loads the ``Wav2Vec2`` model. ``Wav2Vec2`` is a model designed for speech processing. We’ll use it to convert audio into embeddings. .. code:: python # Set device to GPU if available, otherwise use CPU. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Load the pre-trained Wav2Vec2 model and move it to the specified device. wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base").to(device) This function converts raw audio waveforms into embeddings using the ``Wav2Vec2`` model. .. code:: python def get_wav2vec_embeddings(waveforms): with torch.no_grad(): inputs = torch.tensor(waveforms).to(device) embeddings = wav2vec(inputs).last_hidden_state.mean(dim=1) return embeddings The ``AudioDataset`` class handles loading and preprocessing of audio files. .. code:: python class AudioDataset(TensorDataset): def __init__(self, audio_paths, labels, target_length=16000): self.audio_paths = audio_paths self.labels = labels self.target_length = target_length def __len__(self): return len(self.audio_paths) def __getitem__(self, idx): audio_path = self.audio_paths[idx] label = self.labels[idx] waveform, sample_rate = torchaudio.load(audio_path, normalize=True) waveform = waveform.mean(dim=0) # Pad or truncate the waveform to the target length if waveform.size(0) < self.target_length: waveform = torch.cat([waveform, torch.zeros(self.target_length - waveform.size(0))]) else: waveform = waveform[:self.target_length] return waveform, label This function loads audio files and their corresponding labels from a directory structure. .. code:: python def load_dataset(data_dir): audio_paths = [] labels = [] # Loop through each subdirectory in the data directory for label_dir in os.listdir(data_dir): label_dir_path = os.path.join(data_dir, label_dir) if os.path.isdir(label_dir_path): # Loop through each file in the directory for file_name in os.listdir(label_dir_path): if file_name.endswith('.wav'): audio_paths.append(os.path.join(label_dir_path, file_name)) labels.append(label_dir) return audio_paths, labels We define a simple Multi-Layer Perceptron (MLP) model for classification. This model takes the embeddings from ``Wav2Vec2`` as input. .. code:: python class MLP(modlee.model.TabularClassificationModleeModel): def __init__(self, input_size, num_classes): super().__init__() self.model = torch.nn.Sequential( torch.nn.Linear(input_size, 256), torch.nn.ReLU(), torch.nn.Linear(256, 128), torch.nn.ReLU(), torch.nn.Linear(128, num_classes) ) self.loss_fn = torch.nn.CrossEntropyLoss() def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): x, y_target = batch y_pred = self(x) loss = self.loss_fn(y_pred, y_target) return {"loss": loss} def validation_step(self, val_batch, batch_idx): x, y_target = val_batch y_pred = self(x) val_loss = self.loss_fn(y_pred, y_target) return {'val_loss': val_loss} def configure_optimizers(self): optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9) return optimizer ``Wav2Vec2`` transforms raw audio data into numerical embeddings that a model can interpret. We preprocess the audio by normalizing and padding it to a fixed length. Then, ``Wav2Vec2`` generates embeddings for each audio clip. .. code:: python def precompute_embeddings(dataloader): embeddings_list = [] labels_list = [] for inputs, labels in dataloader: inputs = inputs.to(device) embeddings = get_wav2vec_embeddings(inputs) embeddings_list.append(embeddings.cpu()) labels_list.append(labels) embeddings_list = torch.cat(embeddings_list, dim=0) labels_list = torch.cat(labels_list, dim=0) return embeddings_list, labels_list We create a function to train and evaluate our model. .. code:: python def train_model(modlee_model, train_dataloader, val_dataloader, num_epochs=1): with modlee.start_run() as run: # Create a PyTorch Lightning trainer trainer = pl.Trainer(max_epochs=num_epochs) # Train the model using the training and validation data loaders trainer.fit( model=modlee_model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader ) Finally, we load the dataset, preprocess it, and train the model. Add your path to the dataset in ``data_dir``. .. code:: python # Path to dataset data_dir = 'path-to-dataset' # Load dataset audio_paths, labels = load_dataset(data_dir) # Encode labels label_encoder = LabelEncoder() labels = label_encoder.fit_transform(labels) # Split dataset into training and validation sets train_paths, val_paths, train_labels, val_labels = train_test_split(audio_paths, labels, test_size=0.2, random_state=42) # Create datasets and dataloaders target_length = 16000 train_dataset = AudioDataset(train_paths, train_labels, target_length=target_length) val_dataset = AudioDataset(val_paths, val_labels, target_length=target_length) train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True) val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False) # Precompute embeddings print("Precomputing embeddings for training and validation data...") train_embeddings, train_labels = precompute_embeddings(train_dataloader) val_embeddings, val_labels = precompute_embeddings(val_dataloader) # Create TensorDataset for precomputed embeddings and labels train_embedding_dataset = TensorDataset(train_embeddings, train_labels) val_embedding_dataset = TensorDataset(val_embeddings, val_labels) # Create DataLoaders for the precomputed embeddings train_embedding_loader = DataLoader(train_embedding_dataset, batch_size=4, shuffle=True) val_embedding_loader = DataLoader(val_embedding_dataset, batch_size=4, shuffle=False) # Define number of classes num_classes = len(label_encoder.classes_) mlp_audio = MLP(input_size=768, num_classes=num_classes).to(device) # Train and evaluate the model train_model(mlp_audio, train_embedding_loader,val_embedding_loader) Finally, we can view the saved assets from training. With Modlee, your training assets are automatically saved, preserving valuable insights for future reference and collaboration. .. code:: python last_run_path = modlee.last_run_path() print(f"Run path: {last_run_path}") artifacts_path = os.path.join(last_run_path, 'artifacts') artifacts = sorted(os.listdir(artifacts_path)) print(f"Saved artifacts: {artifacts}") .. |Open in Kaggle| image:: https://kaggle.com/static/images/open-in-kaggle.svg :target: https://www.kaggle.com/code/modlee/modlee-audio-embeddings