ria-toolkit-oss/docs/source/ria_toolkit_oss/data/radio_datasets.rst

.. _radio_datasets:

Intro to radio datasets
=======================

In RIA, radio datasets are iterable datasets, designed specifically for machine learning applications in radio signal
processing and analysis. The individual examples are stored alongside the corresponding metadata in high-performance
HDF5 files, referred to as dataset source files.

The Radio Dataset Framework provides a software interface to access and manipulate these source files. This eliminates
the need for users to interface with the source files directly. Instead, users initialize and interact with a Python
object, while the complexities of efficient data retrieval and source file manipulation are managed behind the scenes.

Ria Toolkit OSS includes an abstract class called :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset`, which defines common properties and
behaviors for all radio datasets. :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset` can be considered a blueprint for all
other radio dataset classes. This class is then subclassed to define more specific blueprints for different types
of radio datasets. For example, :py:obj:`ria_toolkit_oss.data.datasets.IQDataset`, which is tailored for machine learning tasks
involving the processing of signals represented as IQ (In-phase and Quadrature) samples.

Then, in the various project backends, there are concrete dataset classes, which inherit from both Ria Toolkit OSS and the base
dataset class from the respective backend. For example, the :py:obj:`TorchIQDataset` class extends both
:py:obj:`ria_toolkit_oss.data.datasets.IQDataset` from Ria Toolkit OSS and :py:obj:`torch.ria_toolkit_oss.data.IterableDataset` from
PyTorch, providing a concrete dataset class tailored for IQ datasets and optimized for the PyTorch backend.

Dataset initialization
----------------------

There are three ways to initialize a radio dataset:

1. Use an RIA dataset builder to download and prepare an off-the-shelf dataset.
2. Use the RIA Curator to curate a dataset from a collection of recordings.
3. Initialize a dataset from a source file.

Off-the-shelf datasets
~~~~~~~~~~~~~~~~~~~~~~

Qoherent provides a wide selection of off-the-shelf machine learning datasets for radio. These can be downloaded and
initialized using the corresponding dataset builders. For example, we can initialize a new instance of Qoherent's AWGN
Modulation dataset using :py:obj:`AWGN_Builder`:

>>> from ria.dataset_manager.builders import AWGN_Builder
>>> awgn_builder = AWGN_Builder()
>>> awgn_builder.download_and_prepare()
>>> awgn_ds = awgn_builder.as_dataset(backend="pytorch")

Because we specified ``backend="pytorch"``, we got back a ``TorchIQDataset``, which is compatible with the PyTorch
framework for machine learning.

>>> awgn_ds._class__.__name__
TorchIQDataset

If we specify an alternative backend, we will get a different class. For example:

>>> awgn_dataset_tf = awgn_builder.as_dataset(backend="tensorflow")
>>> awgn_dataset_tf._class__.__name__
TensorFlowIQDataset

However, both datasets are radio datasets. And, in the case of the AWGN Modulation dataset, both are IQ datasets.

>>> isinstance(awgn_ds, RadioDataset)
True
>>> isinstance(awgn_ds, IQDataset)
True

Dataset curation
~~~~~~~~~~~~~~~~

A second way to initialize a dataset is by curating it from a collection or folder of radio recordings. For example:

>>> from ria.dataset_manager.curator import Curator, SimpleSlicer, RMSQualifier
>>> slicer = SimpleSlicer()
>>> qualifier = RMSQualifier()
>>> curator = Curator(slicer=Slicer, qualifier=Qualifier)
>>> ds = curator.curate("path/to/folder/of/recording/files", backend="pytorch")

Please refer to the Curator Package for more information regarding dataset curation.

Initializing from source
~~~~~~~~~~~~~~~~~~~~~~~~

The third way to initialize a dataset is directly from the source file. For example, I can initialize a
``TensorFlowIQDataset`` using the source file curated above:

>>> from ria.pytorch_backend.datasets import TensorFlowIQDataset
>>> ds_tf = TensorFlowIQDataset(source=ds.source)

Notice that ``ds`` and ``ds_tf`` are equal, and any inplace operations performed on one will affect the state of the
other:

>>> ds == tf_ds
True

This underscores a key point: There are no backend-specific details in the dataset files themselves. Instead, support
for different backends is provided through the software interface.


Dataset usage
-------------

Datasets from the PyTorch backend are just that, PyTorch datasets. They are substitutable for any other PyTorch
dataset, and used just the same. For example, to initialize a dataloader:

>>> from ria.dataset_manager.builders import AWGN_Builder
>>> from torch.ria_toolkit_oss.data import DataLoader
>>> builder = AWGN_Builder()
>>> builder.download_and_prepare()
>>> ds = builder.as_dataset(backend="pytorch")
>>> dl = DataLoader(ds, batch_size=32, shuffle=True)

Similarly, datasets in the Tensorflow backend are just that, TensorFlow datasets. They are substitute for any other
Tensorflow dataset, and used just the same. For example:

>>> from ria.dataset_manager.builders import AWGN_Builder
>>> from torch.ria_toolkit_oss.data import DataLoader
>>> builder = AWGN_Builder()
>>> builder.download_and_prepare()
>>> ds = builder.as_dataset(backend="tensorflow")
>>> ds = ds.shuffle(buffer_size=1000).batch(batch_size=32)

All datasets contain a table of metadata, which can be accessed through the ``metadata`` property:
>>> md = ds.metadata

At any index in the dataset, the metadata will always correspond to the data at that same index. Metadata labels can
be viewed using the ``labels`` property:

>>> ds.labels
['rec_id', 'snr', 'modulation']

Dataset processing and manipulation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All radio datasets support methods tailored specifically for radio processing. These methods are backend-independent,
inherited from the blueprints in Ria Toolkit OSS like :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset`.

For example, we can trim down the length of the examples from 1,024 to 512 samples, and then augment the dataset:

>>> ds_trimmed = ds.trim_examples(trim_length=512)
>>> ds_augmented = ds_trimmed.augment(level=1.0)

.. note:: Because no augmentations were specified, ``augment()`` applied the default IQ augmentations returned by the
   :py:obj:`IQDataset.default_augmentations()` method.

The dataset state is managed within the source file, rather than in memory. This allows us to process and
train on very large datasets, even if they exceed available memory capacity.

Each operation creates and returns a new dataset object initialized from a new source file. Using the builder, the
AWGN Modulation source file was downloaded as ``modulation_awgn.hdf5`` source file, whose state is accessed with
``ds``. Then, trimming the examples in the dataset created a new source file called ``modulation_awgn.001.hdf5``,
whose state is accessed with ``ds_trimmed``. Augmentation then generated a third file called
``modulation_awgn.002.hdf5``. This is important to keep in mind, especially when working with large datasets.

Optionally, we could have performed these preprocessing augmentations inplace by specifying ``inplace=True``.
While inplace operations are more memory efficient, they can lead to
`aliasing <https://www.teach.cs.toronto.edu/~csc110y/fall/notes/06-memory-model/05-aliasing.html>`_. Therefore, we
recommend limiting the use of inplace operations to memory limited application where you know what you're doing.

All dataset processing and manipulation operations operate on the metadata too. As a result, at any index in
the dataset, the metadata will always accurately reflect the data at that same index.

Slicing, indexing, and filtering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Radio datasets support indexing, slicing, and filtering using Python's idiomatic square bracket syntax. For example:

>>> example = ds[2]  # Retrieve the example at index 2.
>>> ds_subset = ds[10:20]  # Retrieve a slice of the dataset from index 10 to 20.
>>> ds_filtered = ds[ds.metadata['snr'] > 3]  # Retrieve all examples where the SNR is greater than 3 dB

Notice that ``example`` is a NumPy array, while both ``subset`` and ``filtered_ds`` are dataset objects.

To read the whole dataset into memory as a NumPy array, use the ``data`` property:

>>> arr = ds_filtered.data  # Equivalent to arr = ds_filtered[:]


Dataset iterators
~~~~~~~~~~~~~~~~~

You can iterate over radio datasets manually, just like with a list:

>>> for i, example in enumerate(ds):
...     print(f"Example at index {i}:")
...     print(example)


Source files
------------

Dataset source files are high-performance HDF5 files that contain both data and metadata in a single self-descriptive
file.

Source file format
~~~~~~~~~~~~~~~~~~

While average users need not concern themselves with the source file format, those creating their own source files will
need to familiarize themselves with the expected format. As a first step, we recommend familiarizing yourself with
some `core concepts <https://docs.h5py.org/en/stable/quick.html#core-concepts>`_ pertaining to the HDF5 file format.

.. note::

   If you're having trouble converting your radio dataset into a source file compatible with the Radio Dataset
   Framework, please let us know. We'd be happy to assist.

Here is the source file format:

.. code-block:: text

   root/
   ├── data (Dataset)
   │   ├── [Full dataset license (Attribute)]
   │   └── [dataset examples, to use as input to the model]
   │
   └── metadata (Group)
       ├── metadata (Dataset)
       │   ├── rec_id (Column)
       │   ├── sample_rate (Column)
       │   └── ...
       │
       └── about (Dataset)
           ├── author
           ├── name
           └── ...

Additional datasets can be added at the root level as required. For example, some datasets---such as the MathWork's
Spectrum Sensing dataset---contain a separate dataset at the root level for the pixel masks. Should these extra datasets
exist, they need to be the same shape as the primary dataset.

.. code-block:: text

   root/
   ├── data (Dataset)
   │   ├── [Full dataset license (Attribute)]
   │   └── [spectrogram images, to use as input to the model]
   │
   ├── masks (Dataset)
   │   └── [target masks, to use for training]
   │
   └── metadata (Group)
       ├── metadata (Dataset)
       │   ├── rec_id (Column)
       │   ├── sample_rate (Column)
       │   └── ...
       │
       └── about (Dataset)
           ├── author
           ├── name
           └── ...

Data format
~~~~~~~~~~~

IQ data
^^^^^^^

IQ data is stored as complex numbers where each data point is a complex value combining
in-phase (I) and quadrature (Q) components. The precision depends on the application, but often each component (real
and imaginary) is stored as a 32-bit floating-point number.

IQ data is stored a 3-dimensional array with the shape ``M x C x N``:

- ``M``: Represents the number of examples in the dataset.
- ``C``: Indicates the number of radio channels.
- ``N``: Denotes the length of each signal, which is the number of data points in each example.

Spectrogram data
^^^^^^^^^^^^^^^^

Spectrogram data is stored as real numbers, the exact format of which depends on the image format.

Spectrogram data is stored a 4-dimensional array with the shape M x C x H x W:

- ``M``: Represents the number of examples in the dataset.
- ``C``: Indicates the number of image channels.
- ``H x W``: Denotes the height and width of the spectrogram images, respectively.

Reading source files
~~~~~~~~~~~~~~~~~~~~

Here's an example of how to  read these source files in pure Python, using the
`h5py <https://docs.h5py.org/en/stable/index.html>`_ library:


.. code-block:: python

   import hdf5

   with h5py.File(dataset_file, "r") as f:
       data = f['data']
       print(f"Length of the dataset: {len(data)}")

       print("Keys in metadata/about:")
       for attr_name, attr_value in mataset_info.attrs.items():
           print(f"{attr_name}: {attr_value}")

       print("Keys in metadata/metadata:")
       for attr_name, attr_value in metadata.attrs.items():
           print(f"{attr_name}: {attr_value}")

To load in the data as a numpy array and the metadata as a pandas DataFrame:

.. code-block:: python

   import hdf5
   import pandas as pd

   with h5py.File(dataset_file, "r") as f:
       data = f['data'][:]
       metadata = pd.DataFrame(f["metadata/metadata"][:])


.. note:: It is generally inadvisable to read the entire dataset into memory.