Tutorial: Preprocessing and Data Quality Assessment

This notebook walks through the preprocessing tools available in pgmuvi:

  1. Checking whether a source is variable

  2. Assessing sampling quality (Nyquist period, detectable range)

  3. Filtering poorly sampled or non-variable bands

  4. Subsampling dense datasets

Prerequisites: you should be familiar with loading data into a Lightcurve object (see the Loading Data how-to guide and the basic pgmuvi_tutorial notebook).

1. Setup

We begin by generating a synthetic light curve with known properties so that we can verify the outputs of the preprocessing tools.

TODO: Replace the synthetic example below with a real observational dataset once this tutorial is expanded.

[ ]:
import numpy as np
import pgmuvi

# --- Placeholder: generate synthetic data ---
# TODO: expand with pgmuvi.synthetic once the synthetic tutorial is complete
rng = np.random.default_rng(42)
times = np.sort(rng.uniform(0, 1000, 300))
period = 100.0
fluxes = 1.0 + 0.3 * np.sin(2 * np.pi * times / period) + rng.normal(0, 0.05, len(times))
errors = np.full_like(fluxes, 0.05)

lc = pgmuvi.lightcurve.Lightcurve(times, fluxes, errors)
print(f"Number of observations: {len(times)}")

2. Variability Detection

pgmuvi provides three complementary variability statistics:

  • Weighted χ² against a constant-flux null model

  • F_var (fractional excess variance)

  • Stetson K index

See the Concepts page in the documentation for a description of each statistic.

[ ]:
# TODO: expand once API is verified
result = lc.check_variability()
print(result)

3. Sampling Quality Assessment

Before fitting, it is important to know the range of periods that can be detected given the cadence and baseline of the observations.

[ ]:
# TODO: expand with interpretation of output
metrics = lc.compute_sampling_metrics()
print(metrics)

# Plain-language assessment
lc.assess_sampling_quality()

4. Subsampling Dense Datasets

GP inference scales as O(N³), so subsampling can dramatically reduce computation time while retaining the information content needed to detect variability.

The subsample_lightcurve function in pgmuvi.preprocess performs gap-preserving random subsampling.

[ ]:
from pgmuvi.preprocess import subsample_lightcurve

# subsample_lightcurve takes only the 1-D time array and returns indices
t = lc.xdata.cpu().numpy()
f = lc.ydata.cpu().numpy()
e = lc.yerr.cpu().numpy()

# TODO: expand with visualisation of before/after subsampling
idx = subsample_lightcurve(t, max_samples=100)
print(f"Subsampled from {len(t)} to {len(idx)} observations")

lc_sub = pgmuvi.lightcurve.Lightcurve(t[idx], f[idx], e[idx])
lc_sub.assess_sampling_quality()

Next Steps

  • Once you have verified data quality, proceed to the pgmuvi_tutorial notebook for GP fitting.

  • For multiband data, see the pgmuvi_tutorial_2d notebook.

  • For more detail on the preprocessing API, see the pgmuvi.preprocess API reference in the documentation.