312 lines
13 KiB
ReStructuredText
312 lines
13 KiB
ReStructuredText
.. _radio_datasets:
|
|
|
|
Intro to radio datasets
|
|
=======================
|
|
|
|
In RIA, radio datasets are iterable datasets, designed specifically for machine learning applications in radio signal
|
|
processing and analysis. The individual examples are stored alongside the corresponding metadata in high-performance
|
|
HDF5 files, referred to as dataset source files.
|
|
|
|
The Radio Dataset Framework provides a software interface to access and manipulate these source files. This eliminates
|
|
the need for users to interface with the source files directly. Instead, users initialize and interact with a Python
|
|
object, while the complexities of efficient data retrieval and source file manipulation are managed behind the scenes.
|
|
|
|
Ria Toolkit OSS includes an abstract class called :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset`, which defines common properties and
|
|
behaviors for all radio datasets. :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset` can be considered a blueprint for all
|
|
other radio dataset classes. This class is then subclassed to define more specific blueprints for different types
|
|
of radio datasets. For example, :py:obj:`ria_toolkit_oss.data.datasets.IQDataset`, which is tailored for machine learning tasks
|
|
involving the processing of signals represented as IQ (In-phase and Quadrature) samples.
|
|
|
|
Then, in the various project backends, there are concrete dataset classes, which inherit from both Ria Toolkit OSS and the base
|
|
dataset class from the respective backend. For example, the :py:obj:`TorchIQDataset` class extends both
|
|
:py:obj:`ria_toolkit_oss.data.datasets.IQDataset` from Ria Toolkit OSS and :py:obj:`torch.ria_toolkit_oss.data.IterableDataset` from
|
|
PyTorch, providing a concrete dataset class tailored for IQ datasets and optimized for the PyTorch backend.
|
|
|
|
Dataset initialization
|
|
----------------------
|
|
|
|
There are three ways to initialize a radio dataset:
|
|
|
|
1. Use an RIA dataset builder to download and prepare an off-the-shelf dataset.
|
|
2. Use the RIA Curator to curate a dataset from a collection of recordings.
|
|
3. Initialize a dataset from a source file.
|
|
|
|
Off-the-shelf datasets
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Qoherent provides a wide selection of off-the-shelf machine learning datasets for radio. These can be downloaded and
|
|
initialized using the corresponding dataset builders. For example, we can initialize a new instance of Qoherent's AWGN
|
|
Modulation dataset using :py:obj:`AWGN_Builder`:
|
|
|
|
>>> from ria.dataset_manager.builders import AWGN_Builder
|
|
>>> awgn_builder = AWGN_Builder()
|
|
>>> awgn_builder.download_and_prepare()
|
|
>>> awgn_ds = awgn_builder.as_dataset(backend="pytorch")
|
|
|
|
Because we specified ``backend="pytorch"``, we got back a ``TorchIQDataset``, which is compatible with the PyTorch
|
|
framework for machine learning.
|
|
|
|
>>> awgn_ds._class__.__name__
|
|
TorchIQDataset
|
|
|
|
If we specify an alternative backend, we will get a different class. For example:
|
|
|
|
>>> awgn_dataset_tf = awgn_builder.as_dataset(backend="tensorflow")
|
|
>>> awgn_dataset_tf._class__.__name__
|
|
TensorFlowIQDataset
|
|
|
|
However, both datasets are radio datasets. And, in the case of the AWGN Modulation dataset, both are IQ datasets.
|
|
|
|
>>> isinstance(awgn_ds, RadioDataset)
|
|
True
|
|
>>> isinstance(awgn_ds, IQDataset)
|
|
True
|
|
|
|
Dataset curation
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
A second way to initialize a dataset is by curating it from a collection or folder of radio recordings. For example:
|
|
|
|
>>> from ria.dataset_manager.curator import Curator, SimpleSlicer, RMSQualifier
|
|
>>> slicer = SimpleSlicer()
|
|
>>> qualifier = RMSQualifier()
|
|
>>> curator = Curator(slicer=Slicer, qualifier=Qualifier)
|
|
>>> ds = curator.curate("path/to/folder/of/recording/files", backend="pytorch")
|
|
|
|
Please refer to the Curator Package for more information regarding dataset curation.
|
|
|
|
Initializing from source
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The third way to initialize a dataset is directly from the source file. For example, I can initialize a
|
|
``TensorFlowIQDataset`` using the source file curated above:
|
|
|
|
>>> from ria.pytorch_backend.datasets import TensorFlowIQDataset
|
|
>>> ds_tf = TensorFlowIQDataset(source=ds.source)
|
|
|
|
Notice that ``ds`` and ``ds_tf`` are equal, and any inplace operations performed on one will affect the state of the
|
|
other:
|
|
|
|
>>> ds == tf_ds
|
|
True
|
|
|
|
This underscores a key point: There are no backend-specific details in the dataset files themselves. Instead, support
|
|
for different backends is provided through the software interface.
|
|
|
|
|
|
Dataset usage
|
|
-------------
|
|
|
|
Datasets from the PyTorch backend are just that, PyTorch datasets. They are substitutable for any other PyTorch
|
|
dataset, and used just the same. For example, to initialize a dataloader:
|
|
|
|
>>> from ria.dataset_manager.builders import AWGN_Builder
|
|
>>> from torch.ria_toolkit_oss.data import DataLoader
|
|
>>> builder = AWGN_Builder()
|
|
>>> builder.download_and_prepare()
|
|
>>> ds = builder.as_dataset(backend="pytorch")
|
|
>>> dl = DataLoader(ds, batch_size=32, shuffle=True)
|
|
|
|
Similarly, datasets in the Tensorflow backend are just that, TensorFlow datasets. They are substitute for any other
|
|
Tensorflow dataset, and used just the same. For example:
|
|
|
|
>>> from ria.dataset_manager.builders import AWGN_Builder
|
|
>>> from torch.ria_toolkit_oss.data import DataLoader
|
|
>>> builder = AWGN_Builder()
|
|
>>> builder.download_and_prepare()
|
|
>>> ds = builder.as_dataset(backend="tensorflow")
|
|
>>> ds = ds.shuffle(buffer_size=1000).batch(batch_size=32)
|
|
|
|
All datasets contain a table of metadata, which can be accessed through the ``metadata`` property:
|
|
>>> md = ds.metadata
|
|
|
|
At any index in the dataset, the metadata will always correspond to the data at that same index. Metadata labels can
|
|
be viewed using the ``labels`` property:
|
|
|
|
>>> ds.labels
|
|
['rec_id', 'snr', 'modulation']
|
|
|
|
Dataset processing and manipulation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
All radio datasets support methods tailored specifically for radio processing. These methods are backend-independent,
|
|
inherited from the blueprints in Ria Toolkit OSS like :py:obj:`ria_toolkit_oss.data.datasets.RadioDataset`.
|
|
|
|
For example, we can trim down the length of the examples from 1,024 to 512 samples, and then augment the dataset:
|
|
|
|
>>> ds_trimmed = ds.trim_examples(trim_length=512)
|
|
>>> ds_augmented = ds_trimmed.augment(level=1.0)
|
|
|
|
.. note:: Because no augmentations were specified, ``augment()`` applied the default IQ augmentations returned by the
|
|
:py:obj:`IQDataset.default_augmentations()` method.
|
|
|
|
The dataset state is managed within the source file, rather than in memory. This allows us to process and
|
|
train on very large datasets, even if they exceed available memory capacity.
|
|
|
|
Each operation creates and returns a new dataset object initialized from a new source file. Using the builder, the
|
|
AWGN Modulation source file was downloaded as ``modulation_awgn.hdf5`` source file, whose state is accessed with
|
|
``ds``. Then, trimming the examples in the dataset created a new source file called ``modulation_awgn.001.hdf5``,
|
|
whose state is accessed with ``ds_trimmed``. Augmentation then generated a third file called
|
|
``modulation_awgn.002.hdf5``. This is important to keep in mind, especially when working with large datasets.
|
|
|
|
Optionally, we could have performed these preprocessing augmentations inplace by specifying ``inplace=True``.
|
|
While inplace operations are more memory efficient, they can lead to
|
|
`aliasing <https://www.teach.cs.toronto.edu/~csc110y/fall/notes/06-memory-model/05-aliasing.html>`_. Therefore, we
|
|
recommend limiting the use of inplace operations to memory limited application where you know what you're doing.
|
|
|
|
All dataset processing and manipulation operations operate on the metadata too. As a result, at any index in
|
|
the dataset, the metadata will always accurately reflect the data at that same index.
|
|
|
|
Slicing, indexing, and filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Radio datasets support indexing, slicing, and filtering using Python's idiomatic square bracket syntax. For example:
|
|
|
|
>>> example = ds[2] # Retrieve the example at index 2.
|
|
>>> ds_subset = ds[10:20] # Retrieve a slice of the dataset from index 10 to 20.
|
|
>>> ds_filtered = ds[ds.metadata['snr'] > 3] # Retrieve all examples where the SNR is greater than 3 dB
|
|
|
|
Notice that ``example`` is a NumPy array, while both ``subset`` and ``filtered_ds`` are dataset objects.
|
|
|
|
To read the whole dataset into memory as a NumPy array, use the ``data`` property:
|
|
|
|
>>> arr = ds_filtered.data # Equivalent to arr = ds_filtered[:]
|
|
|
|
|
|
Dataset iterators
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
You can iterate over radio datasets manually, just like with a list:
|
|
|
|
>>> for i, example in enumerate(ds):
|
|
... print(f"Example at index {i}:")
|
|
... print(example)
|
|
|
|
|
|
Source files
|
|
------------
|
|
|
|
Dataset source files are high-performance HDF5 files that contain both data and metadata in a single self-descriptive
|
|
file.
|
|
|
|
Source file format
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
While average users need not concern themselves with the source file format, those creating their own source files will
|
|
need to familiarize themselves with the expected format. As a first step, we recommend familiarizing yourself with
|
|
some `core concepts <https://docs.h5py.org/en/stable/quick.html#core-concepts>`_ pertaining to the HDF5 file format.
|
|
|
|
.. note::
|
|
|
|
If you're having trouble converting your radio dataset into a source file compatible with the Radio Dataset
|
|
Framework, please let us know. We'd be happy to assist.
|
|
|
|
Here is the source file format:
|
|
|
|
.. code-block:: text
|
|
|
|
root/
|
|
├── data (Dataset)
|
|
│ ├── [Full dataset license (Attribute)]
|
|
│ └── [dataset examples, to use as input to the model]
|
|
│
|
|
└── metadata (Group)
|
|
├── metadata (Dataset)
|
|
│ ├── rec_id (Column)
|
|
│ ├── sample_rate (Column)
|
|
│ └── ...
|
|
│
|
|
└── about (Dataset)
|
|
├── author
|
|
├── name
|
|
└── ...
|
|
|
|
Additional datasets can be added at the root level as required. For example, some datasets---such as the MathWork's
|
|
Spectrum Sensing dataset---contain a separate dataset at the root level for the pixel masks. Should these extra datasets
|
|
exist, they need to be the same shape as the primary dataset.
|
|
|
|
.. code-block:: text
|
|
|
|
root/
|
|
├── data (Dataset)
|
|
│ ├── [Full dataset license (Attribute)]
|
|
│ └── [spectrogram images, to use as input to the model]
|
|
│
|
|
├── masks (Dataset)
|
|
│ └── [target masks, to use for training]
|
|
│
|
|
└── metadata (Group)
|
|
├── metadata (Dataset)
|
|
│ ├── rec_id (Column)
|
|
│ ├── sample_rate (Column)
|
|
│ └── ...
|
|
│
|
|
└── about (Dataset)
|
|
├── author
|
|
├── name
|
|
└── ...
|
|
|
|
Data format
|
|
~~~~~~~~~~~
|
|
|
|
IQ data
|
|
^^^^^^^
|
|
|
|
IQ data is stored as complex numbers where each data point is a complex value combining
|
|
in-phase (I) and quadrature (Q) components. The precision depends on the application, but often each component (real
|
|
and imaginary) is stored as a 32-bit floating-point number.
|
|
|
|
IQ data is stored a 3-dimensional array with the shape ``M x C x N``:
|
|
|
|
- ``M``: Represents the number of examples in the dataset.
|
|
- ``C``: Indicates the number of radio channels.
|
|
- ``N``: Denotes the length of each signal, which is the number of data points in each example.
|
|
|
|
Spectrogram data
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
Spectrogram data is stored as real numbers, the exact format of which depends on the image format.
|
|
|
|
Spectrogram data is stored a 4-dimensional array with the shape M x C x H x W:
|
|
|
|
- ``M``: Represents the number of examples in the dataset.
|
|
- ``C``: Indicates the number of image channels.
|
|
- ``H x W``: Denotes the height and width of the spectrogram images, respectively.
|
|
|
|
Reading source files
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Here's an example of how to read these source files in pure Python, using the
|
|
`h5py <https://docs.h5py.org/en/stable/index.html>`_ library:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
import hdf5
|
|
|
|
with h5py.File(dataset_file, "r") as f:
|
|
data = f['data']
|
|
print(f"Length of the dataset: {len(data)}")
|
|
|
|
print("Keys in metadata/about:")
|
|
for attr_name, attr_value in mataset_info.attrs.items():
|
|
print(f"{attr_name}: {attr_value}")
|
|
|
|
print("Keys in metadata/metadata:")
|
|
for attr_name, attr_value in metadata.attrs.items():
|
|
print(f"{attr_name}: {attr_value}")
|
|
|
|
To load in the data as a numpy array and the metadata as a pandas DataFrame:
|
|
|
|
.. code-block:: python
|
|
|
|
import hdf5
|
|
import pandas as pd
|
|
|
|
with h5py.File(dataset_file, "r") as f:
|
|
data = f['data'][:]
|
|
metadata = pd.DataFrame(f["metadata/metadata"][:])
|
|
|
|
|
|
.. note:: It is generally inadvisable to read the entire dataset into memory.
|