Skip to content

Synthesizing a Dataset

The Dataset Generator creates labelled collections of simulated RF signals, ready for training modulation-recognition models. Instead of collecting real recordings, you configure the modulations, channel conditions, and sampling strategy you want — and RIA Hub generates the samples entirely in software.

Each example is a fixed-length IQ sample array paired with a class label (the modulation type). You control the signal pipeline end-to-end: modulation family and scheme, pulse-shaping filter, channel model (AWGN, Rayleigh fading, or clean), SNR range, and whether samples are drawn on a grid, randomly, or balanced across classes.

Use the Dataset Generator when you want to:

  • Create a labelled training dataset without needing RF hardware or recordings
  • Sweep a wide range of SNR conditions and modulation types systematically
  • Quickly prototype a modulation-recognition model before collecting real-world data
  • Reproduce a dataset exactly using a fixed random seed

What the generator produces:

  • An HDF5 file containing IQ sample arrays and corresponding class labels
  • A configurable number of samples per (modulation, SNR) combination, or a total sample budget drawn randomly or balanced across classes
  • Optional pulse shaping (RRC/RC filters) and channel impairments (AWGN, Rayleigh fading)
  • A RIA Hub repository to store the output — create one before you start if you don’t have one
  • Generation runs as a background job; large datasets (hundreds of thousands of samples) can take several minutes

Navigate to your repository, click the Dataset Manager tab, then select Generator from the sidebar.

Give your dataset a name and an optional description. The name appears in the repository and is embedded in the file’s attributes, so use something descriptive — for example, psk-qam-awgn-v1.

Check the modulation types you want to include. Each selected modulation becomes one class in the output dataset.

Modulations are grouped by family:

FamilyOptions
PSK (Phase-Shift Keying)BPSK, QPSK, 8PSK, 16PSK, 32PSK
QAM (Quadrature Amplitude Modulation)QAM16, QAM64, QAM256
PAM (Pulse Amplitude Modulation)PAM2, PAM4, PAM8
ParameterWhat it controls
Signal lengthNumber of IQ samples per example (1024 is a common starting point)
Sample rateSimulated sample rate in Hz (e.g., 1 000 000 for 1 MHz)

A longer signal gives the model more context per example but increases file size.

Pulse shaping filters the baseband signal, affecting its spectral shape and inter-symbol interference characteristics.

ParameterDescription
Filter typeRRC (Root-Raised Cosine), RC (Raised Cosine), or none
Samples per symbol (SPS)How many samples represent each symbol
Rolloff (β)Controls filter bandwidth — 0 is narrow, 1 is wide
SpanFilter length in symbols

The channel model adds realistic impairments to the signals.

OptionWhat it adds
AWGN onlyAdditive white Gaussian noise at the SNR you specify
Rayleigh fading + AWGNMultipath fading on top of noise — better for training robust models
NoneClean signals with no impairment — useful for debugging

AWGN only is the right starting point for most training datasets.

Signal-to-Noise Ratio (SNR) measures how much signal there is relative to background noise. A higher SNR means a cleaner signal.

Set a minimum and maximum SNR in dB. For example, −10 dB to 20 dB covers a wide range of channel conditions. Set a step to control how many SNR levels are sampled within that range (e.g., a step of 2 dB over a 30 dB range gives 16 SNR levels).

The sampling strategy determines how (modulation, SNR) pairs are selected.

StrategyWhat it does
GridGenerates examples at every (modulation, SNR) combination. The most systematic approach.
RandomDraws N random examples by sampling modulation and SNR uniformly.
BalancedGenerates an equal number of examples per modulation class, with SNR drawn randomly.

The summary panel shows the estimated total sample count based on your configuration. Check this before submitting — it’s easy to accidentally configure a dataset with millions of samples.

If you want the dataset to be reproducible, enter a fixed random seed. Using the same seed with the same configuration always produces the same output. Leave it blank for a non-deterministic run.

Click Generate. The job is queued and runs in the background. A progress indicator shows status while the generation runs.

Small datasets (a few thousand samples) typically complete in under a minute. Larger grid sweeps take longer.

When generation finishes, you can:

  • Download — Save the HDF5 file to your local machine.
  • Commit to repository — Store the dataset in your RIA Hub repository under version control. This is the recommended workflow: the file is versioned, associated with its generation parameters, and accessible to your team.

  • Inspect the dataset — Run the Dataset Inspector to check class balance and per-class signal statistics before committing to a training run.
  • Train a model — When the dataset looks good, take it to the Model Builder to configure and launch a training workflow.