Data Preprocessing¶

This guide covers the preprocessing tools available in pgmuvi for assessing data quality, filtering observations, and subsampling large datasets.

Overview ¶

Raw observational datasets often contain:

observations from poorly sampled or non-variable sources,
extremely dense sampling that is computationally expensive for GP fitting,
bands with insufficient coverage to constrain variability.

pgmuvi provides tools to address each of these issues before fitting.

Checking Variability ¶

The first question to ask is whether your source is actually variable. pgmuvi provides three complementary test statistics:

Statistic	Description
Weighted χ²	Tests against a constant-flux (null) model.
F_var	Fractional excess variance; measures variability amplitude relative to noise.
Stetson K	A robust, outlier-resistant index of correlated variability.

For single-band data:

result = lc.check_variability()
print(result)

For multiband data:

results = lc.check_variability_per_band()
for band, r in results.items():
    print(band, r)

To retain only bands that pass a variability criterion:

lc = lc.filter_variable_bands(fvar_min=0.1)

Sampling Quality Metrics ¶

Even if a source is variable, the observations may not resolve the variability timescales of interest. pgmuvi computes several metrics to assess this:

Key	Description
`n_points`	Number of finite observations.
`baseline`	Total time span (`max(t) − min(t)`).
`max_gap`	Largest gap between consecutive observations.
`max_gap_fraction`	`max_gap / baseline`; large values reduce sensitivity to long-period variability.
`median_cadence`	Median time between consecutive observations.
`mean_cadence`	Mean time between consecutive observations.
`cadence_std`	Standard deviation of the cadence distribution.
`nyquist_period`	`2 × effective_cadence`; shortest reliably detectable period.
`nyquist_frequency`	`1 / (2 × effective_cadence)`; corresponding Nyquist frequency.
`longest_detectable_period`	`baseline / 2` (heuristic upper limit on detectable periods).
`duty_cycle`	Fraction of the baseline with observations (`n × cadence / baseline`).
`sampling_uniformity`	`1 − std(cadence) / mean(cadence)`; 1 = perfectly uniform, 0 = highly irregular.

Retrieve numeric metrics:

metrics = lc.compute_sampling_metrics()
print(metrics)

Or get a plain-language assessment with recommendations:

lc.assess_sampling_quality()

For multiband data:

lc.assess_sampling_quality_per_band()

To retain only well-sampled bands:

lc = lc.filter_well_sampled_bands(min_points=20)

GP inference scales as \(\mathcal{O}(N^3)\) in the number of observations. For very densely sampled light curves, subsampling can make fitting computationally feasible without significantly affecting the results, provided the subsampled dataset still satisfies the Nyquist criterion for the periods of interest.

pgmuvi provides a gap-preserving random subsampling function in the preprocessing subpackage:

from pgmuvi.preprocess import subsample_lightcurve

times  = lc.xdata.cpu().numpy()
fluxes = lc.ydata.cpu().numpy()
errors = lc.yerr.cpu().numpy()

idx = subsample_lightcurve(times, max_samples=500)

lc_sub = pgmuvi.lightcurve.Lightcurve(
    times[idx], fluxes[idx], errors[idx]
)

subsample_lightcurve takes only the 1-D time array and returns an index array. It preserves the overall time coverage (gaps are retained in proportion) so that long-timescale variability remains detectable after subsampling.

Quality Filtering ¶

Note

This section will describe how to apply quality flags or sigma-clipping to remove outliers before fitting. The relevant utilities are located in pgmuvi.preprocess.quality.

Preprocessing Tutorial ¶

A dedicated notebook tutorial covering the full preprocessing workflow — loading data, checking variability, assessing sampling quality, and subsampling — is provided in the User Guide:

Tutorial: Preprocessing and Data Quality Assessment

Data Preprocessing¶

Overview ¶

Checking Variability ¶

Sampling Quality Metrics ¶

Subsampling Dense Datasets ¶

Quality Filtering ¶

Preprocessing Tutorial ¶

pgmuvi

Navigation

Related Topics

Data Preprocessing¶

Overview¶

Checking Variability¶

Sampling Quality Metrics¶

Subsampling Dense Datasets¶

Quality Filtering¶

Preprocessing Tutorial¶

Overview ¶

Checking Variability ¶

Sampling Quality Metrics ¶

Subsampling Dense Datasets ¶

Quality Filtering ¶

Preprocessing Tutorial ¶