Data Preprocessing¶
This guide covers the preprocessing tools available in pgmuvi for assessing
data quality, filtering observations, and subsampling large datasets.
Overview¶
Raw observational datasets often contain:
observations from poorly sampled or non-variable sources,
extremely dense sampling that is computationally expensive for GP fitting,
bands with insufficient coverage to constrain variability.
pgmuvi provides tools to address each of these issues before fitting.
Checking Variability¶
The first question to ask is whether your source is actually variable. pgmuvi
provides three complementary test statistics:
Statistic |
Description |
|---|---|
Weighted χ² |
Tests against a constant-flux (null) model. |
Fvar |
Fractional excess variance; measures variability amplitude relative to noise. |
Stetson K |
A robust, outlier-resistant index of correlated variability. |
For single-band data:
result = lc.check_variability()
print(result)
For multiband data:
results = lc.check_variability_per_band()
for band, r in results.items():
print(band, r)
To retain only bands that pass a variability criterion:
lc = lc.filter_variable_bands(fvar_min=0.1)
Sampling Quality Metrics¶
Even if a source is variable, the observations may not resolve the variability
timescales of interest. pgmuvi computes several metrics to assess this:
Key |
Description |
|---|---|
|
Number of finite observations. |
|
Total time span ( |
|
Largest gap between consecutive observations. |
|
|
|
Median time between consecutive observations. |
|
Mean time between consecutive observations. |
|
Standard deviation of the cadence distribution. |
|
|
|
|
|
|
|
Fraction of the baseline with observations ( |
|
|
Retrieve numeric metrics:
metrics = lc.compute_sampling_metrics()
print(metrics)
Or get a plain-language assessment with recommendations:
lc.assess_sampling_quality()
For multiband data:
lc.assess_sampling_quality_per_band()
To retain only well-sampled bands:
lc = lc.filter_well_sampled_bands(min_points=20)
Subsampling Dense Datasets¶
GP inference scales as \(\mathcal{O}(N^3)\) in the number of observations. For very densely sampled light curves, subsampling can make fitting computationally feasible without significantly affecting the results, provided the subsampled dataset still satisfies the Nyquist criterion for the periods of interest.
pgmuvi provides a gap-preserving random subsampling function in the preprocessing
subpackage:
from pgmuvi.preprocess import subsample_lightcurve
times = lc.xdata.cpu().numpy()
fluxes = lc.ydata.cpu().numpy()
errors = lc.yerr.cpu().numpy()
idx = subsample_lightcurve(times, max_samples=500)
lc_sub = pgmuvi.lightcurve.Lightcurve(
times[idx], fluxes[idx], errors[idx]
)
subsample_lightcurve takes only the 1-D time array and returns an index array.
It preserves the overall time coverage (gaps are retained in proportion) so that
long-timescale variability remains detectable after subsampling.
See also
pgmuvi.preprocess — full API reference for the preprocessing subpackage.
Quality Filtering¶
Note
This section will describe how to apply quality flags or sigma-clipping to remove
outliers before fitting. The relevant utilities are located in
pgmuvi.preprocess.quality.
Preprocessing Tutorial¶
A dedicated notebook tutorial covering the full preprocessing workflow — loading data, checking variability, assessing sampling quality, and subsampling — is provided in the User Guide: