modlee.data_metafeatures module

class modlee.data_metafeatures.DataMetafeatures(dataloader, num_sample=1000, testing=False)[source]

Bases: object

An object to hold metafeatures for a dataset, as loaded from a dataloader.

get_features()[source]

Get features for batch elements.

Returns:

A dictionary of {‘batch_element’ : features}

get_mfe()[source]

Get PyMFE features for every element in the dataloader

Returns:

A dictionary of {mfe_feature : mfe_value}

get_mfe_features()[source]

Get features for all batch elements with PyMFE.

Returns:

A list of metafeatures.

get_mfe_on_batch(batch_element)[source]

Get features for a batch element with PyMFE.

Parameters:

batch_element – The batch element to calculate.

Returns:

A dictionary of features for the batch element.

get_mfe_on_element(batch_element)

Get features for a batch element with PyMFE.

Parameters:

batch_element – The batch element to calculate.

Returns:

A dictionary of features for the batch element.

get_properties()[source]

Get properties — features that are not calculated, e.g. shapes

get_raw_batch_elements()[source]

Convert features to a list of dictionaries.

Returns:

A list of {‘raw’: feature}

get_stats()[source]

Get statistical features for batch elements. Includes feature shape, k-means clustering, and time taken to calculate features.

Returns:

A list of statistical features.

get_stats_rep()

Get features for batch elements.

Returns:

A dictionary of {‘batch_element’ : features}

class modlee.data_metafeatures.ImageDataMetafeatures(dataloader, embd_model=None, *args, **kwargs)[source]

Bases: DataMetafeatures

Image-based DataMetafeatures.

get_embedding(index=0, max_len=100)[source]
get_raw_batch_elements()[source]

Get the raw batch elements for an image-based dataset.

Returns:

A list of image-based features.

class modlee.data_metafeatures.TextDataMetafeatures(dataloader, nlp_model=None, *args, **kwargs)[source]

Bases: DataMetafeatures

get_embedding(index=None, max_len=100, *args, **kwargs)[source]

Get embeddings from the dataloader.

Parameters:

index – The index in a batch of the string elements to embed, defaults to 1

Returns:

A dictionary of {embd_i : embd_value}

modlee.data_metafeatures.bench_kmeans_unsupervised(batch, n_clusters=[2, 4, 8, 16, 32], testing=False)[source]

Calculate k-means clusters for a batch of data.

Parameters:
  • batch – The batch of data.

  • n_clusters – Number of clusters to calculate, defaults to [2, 4, 8, 16, 32],

  • testing – Flag for testing and calculating with a smaller batch, defaults to False,

Returns:

A dictionary of {‘kmeans’:calculated_kmeans_clusters}

modlee.data_metafeatures.extract_features_from_model(model, batch)[source]

Extract features for a data batch using a neural network model.

Parameters:
  • model – The model to use for feature extraction.

  • batch – The data batch on which to calculate features.

Returns:

The calculated features.

modlee.data_metafeatures.get_image_features(x, testing=False)[source]

Get features for a batch of image data.

Parameters:
  • x – The batch of image data.

  • testing – Flag to calculate on a smaller test subsample of the data, defaults to False.

Returns:

A dictionary of the features.

modlee.data_metafeatures.get_n_samples(dataloader, n_samples=100)[source]

Get a number of samples from a dataloader

Parameters:
  • dataloader – The dataloader.

  • n_samples – The number of samples, defaults to 100.

Returns:

An iterable of batch elements, each of length n_samples.

modlee.data_metafeatures.manipulate_x_1(x)[source]

Unsqueeze a 1D tensor.

Parameters:

x – The tensor.

Returns:

The tensor with an extra beginning dimension.

modlee.data_metafeatures.manipulate_x_2(x)[source]

Subsample a 2D tensor to the first 10000 values.

Parameters:

x – The tensor to subsample.

Returns:

A subsample of the tensor.

modlee.data_metafeatures.manipulate_x_3(x)[source]

Process a 3-dimensional tensor [batch_size, width, height] by resizing to a fixed size.

Parameters:

x – The tensor.

Returns:

The tensor, resized.

modlee.data_metafeatures.manipulate_x_4(x)[source]

Process a 4-dimensional tensor, assumed to be image-like [batch_size, channelw, width, height], into subchannels

Parameters:

x – The image to process.

Returns:

Sampled channels from the image.

modlee.data_metafeatures.manipulate_x_5(x)[source]

Process a 5-dimensional tensor, assumed to be video-like [batch_size, frames, channels, width, height], into image-like [batch_size, channels, width, height].

Parameters:

x – The tensor.

Returns:

A subsample of the tesnor

modlee.data_metafeatures.pad_image_channels(x, desired_channels=3)[source]

Pad an image with extra channels. Uses dimeension order [batch, channel, width, height].

Parameters:
  • x – The image tensor to pad.

  • desired_channels – Desired number of channels, defaults to 3.

Returns:

The padded tensor.

modlee.data_metafeatures.sample_dataloader(train_dataloader, num_sample)[source]

Sample batches from a dataloader.

Parameters:
  • train_dataloader – The dataloader to sample from.

  • num_sample – The number of samples.

Returns:

A tuple of dataset_size, batch_elements, and the original size of the batch.

modlee.data_metafeatures.sample_image_channels(x, num_sample=3)[source]

Sample random channels from an image [batch_size, channel, width, height].

Parameters:
  • x – The image tensor to sample from.

  • num_sample – Number of channels to sample, defaults to 3.

Returns:

A tensor of sampled channels.

modlee.data_metafeatures.sample_image_from_video(x, num_channels=1)[source]

Sample 3-channel images from a video tensor [batch_size, frames, channels, width, height].

Parameters:
  • x – The video tensor.

  • num_channels – The number of channels to sample.

Returns:

A tensor of images.