Data Preprocessing

This guide covers the preprocessing tools available in pgmuvi for assessing data quality, filtering observations, and subsampling large datasets.

Overview

Raw observational datasets often contain:

  • observations from poorly sampled or non-variable sources,

  • extremely dense sampling that is computationally expensive for GP fitting,

  • bands with insufficient coverage to constrain variability.

pgmuvi provides tools to address each of these issues before fitting.

Checking Variability

The first question to ask is whether your source is actually variable. pgmuvi provides three complementary test statistics:

Statistic

Description

Weighted χ²

Tests against a constant-flux (null) model.

Fvar

Fractional excess variance; measures variability amplitude relative to noise.

Stetson K

A robust, outlier-resistant index of correlated variability.

For single-band data:

result = lc.check_variability()
print(result)

For multiband data:

results = lc.check_variability_per_band()
for band, r in results.items():
    print(band, r)

To retain only bands that pass a variability criterion:

lc = lc.filter_variable_bands(fvar_min=0.1)

Sampling Quality Metrics

Even if a source is variable, the observations may not resolve the variability timescales of interest. pgmuvi computes several metrics to assess this:

Key

Description

n_points

Number of finite observations.

baseline

Total time span (max(t) min(t)).

max_gap

Largest gap between consecutive observations.

max_gap_fraction

max_gap / baseline; large values reduce sensitivity to long-period variability.

median_cadence

Median time between consecutive observations.

mean_cadence

Mean time between consecutive observations.

cadence_std

Standard deviation of the cadence distribution.

nyquist_period

2 × effective_cadence; shortest reliably detectable period.

nyquist_frequency

1 / (2 × effective_cadence); corresponding Nyquist frequency.

longest_detectable_period

baseline / 2 (heuristic upper limit on detectable periods).

duty_cycle

Fraction of the baseline with observations (n × cadence / baseline).

sampling_uniformity

1 std(cadence) / mean(cadence); 1 = perfectly uniform, 0 = highly irregular.

Retrieve numeric metrics:

metrics = lc.compute_sampling_metrics()
print(metrics)

Or get a plain-language assessment with recommendations:

lc.assess_sampling_quality()

For multiband data:

lc.assess_sampling_quality_per_band()

To retain only well-sampled bands:

lc = lc.filter_well_sampled_bands(min_points=20)

Subsampling Dense Datasets

GP inference scales as \(\mathcal{O}(N^3)\) in the number of observations. For very densely sampled light curves, subsampling can make fitting computationally feasible without significantly affecting the results, provided the subsampled dataset still satisfies the Nyquist criterion for the periods of interest.

pgmuvi provides a gap-preserving random subsampling function in the preprocessing subpackage:

from pgmuvi.preprocess import subsample_lightcurve

times  = lc.xdata.cpu().numpy()
fluxes = lc.ydata.cpu().numpy()
errors = lc.yerr.cpu().numpy()

idx = subsample_lightcurve(times, max_samples=500)

lc_sub = pgmuvi.lightcurve.Lightcurve(
    times[idx], fluxes[idx], errors[idx]
)

subsample_lightcurve takes only the 1-D time array and returns an index array. It preserves the overall time coverage (gaps are retained in proportion) so that long-timescale variability remains detectable after subsampling.

See also

pgmuvi.preprocess — full API reference for the preprocessing subpackage.

Quality Filtering

Note

This section will describe how to apply quality flags or sigma-clipping to remove outliers before fitting. The relevant utilities are located in pgmuvi.preprocess.quality.

Preprocessing Tutorial

A dedicated notebook tutorial covering the full preprocessing workflow — loading data, checking variability, assessing sampling quality, and subsampling — is provided in the User Guide: