Tutorial: Preprocessing and Data Quality Assessment¶
This notebook walks through the preprocessing tools available in pgmuvi:
Checking whether a source is variable
Assessing sampling quality (Nyquist period, detectable range)
Filtering poorly sampled or non-variable bands
Subsampling dense datasets
Prerequisites: you should be familiar with loading data into a Lightcurve object (see the Loading Data how-to guide and the basic pgmuvi_tutorial notebook).
1. Setup¶
We begin by generating a synthetic light curve with known properties so that we can verify the outputs of the preprocessing tools.
TODO: Replace the synthetic example below with a real observational dataset once this tutorial is expanded.
[ ]:
import numpy as np
import pgmuvi
# --- Placeholder: generate synthetic data ---
# TODO: expand with pgmuvi.synthetic once the synthetic tutorial is complete
rng = np.random.default_rng(42)
times = np.sort(rng.uniform(0, 1000, 300))
period = 100.0
fluxes = 1.0 + 0.3 * np.sin(2 * np.pi * times / period) + rng.normal(0, 0.05, len(times))
errors = np.full_like(fluxes, 0.05)
lc = pgmuvi.lightcurve.Lightcurve(times, fluxes, errors)
print(f"Number of observations: {len(times)}")
2. Variability Detection¶
pgmuvi provides three complementary variability statistics:
Weighted χ² against a constant-flux null model
F_var (fractional excess variance)
Stetson K index
See the Concepts page in the documentation for a description of each statistic.
[ ]:
# TODO: expand once API is verified
result = lc.check_variability()
print(result)
3. Sampling Quality Assessment¶
Before fitting, it is important to know the range of periods that can be detected given the cadence and baseline of the observations.
[ ]:
# TODO: expand with interpretation of output
metrics = lc.compute_sampling_metrics()
print(metrics)
# Plain-language assessment
lc.assess_sampling_quality()
4. Subsampling Dense Datasets¶
GP inference scales as O(N³), so subsampling can dramatically reduce computation time while retaining the information content needed to detect variability.
The subsample_lightcurve function in pgmuvi.preprocess performs gap-preserving random subsampling.
[ ]:
from pgmuvi.preprocess import subsample_lightcurve
# subsample_lightcurve takes only the 1-D time array and returns indices
t = lc.xdata.cpu().numpy()
f = lc.ydata.cpu().numpy()
e = lc.yerr.cpu().numpy()
# TODO: expand with visualisation of before/after subsampling
idx = subsample_lightcurve(t, max_samples=100)
print(f"Subsampled from {len(t)} to {len(idx)} observations")
lc_sub = pgmuvi.lightcurve.Lightcurve(t[idx], f[idx], e[idx])
lc_sub.assess_sampling_quality()
Next Steps¶
Once you have verified data quality, proceed to the
pgmuvi_tutorialnotebook for GP fitting.For multiband data, see the
pgmuvi_tutorial_2dnotebook.For more detail on the preprocessing API, see the
pgmuvi.preprocessAPI reference in the documentation.