Multimodal Handwritten Dataset (MHD)

class multivae.data.datasets.MHD(datapath, split='train', modalities=['label', 'audio', 'trajectory', 'image'], download=False, missing_probabilities={'audio': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'image': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'label': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'trajectory': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, seed=0, keep_incomplete=True)[source]

Dataset class for the MHD dataset introduced in the paper: ‘Leveraging hierarchy in multimodal generative models for effective cross-modality inference’ (Vasco et al, 2021).’

In this version of the dataset class, we add the possibility to simulate missingness in the data, depending on the dataclass (Missing Not At Random). For that, the missing_probabilities parameter provides probabilities of missingness for each class, and for each modality. For instance, the code below will define a dataset with missing samples in the trajectory modality, only in the classes 0,1,2, et 9.

>>> from multivae.data.datasets import MHD
>>> missing_probabilities = {
...     image = np.zeros(10,).float(),
...     audio = np.zeros(10,).float(),
...     trajectory = [0.1,0.3,0.4,0.,0.,0.,0.,0.,0.,0.9]
... }
>>> dataset = MHD(data_path,
...  'train',
...   modalities = ['image', 'audio', 'trajectory'],
...   download = True,
...   missing_probabilities = missing_probabilities)
Parameters:
  • datapath (str) – Where the data is stored. It must contained the ‘mhd_train.pt’ file and ‘mhd_test.pt’ file.

  • split (Literal['train', 'test']) – Split of the data to use. Default to ‘train’.

  • modalities (list) – The modalities to use among ‘label’, ‘trajectory’, ‘image’, ‘audio’. By default, we use all.

  • download (bool) – If the dataset is not present at the given path, wether to download it or not. Default to False.

  • missing_probabilities (dict) – For each modality, the probabilities for each class to be missing in the created incomplete dataset. By default, we use no missing data.

  • seed (int) – default to 0. You can change the seed to create a different incomplete dataset.