pgmuvi.preprocess¶
pgmuvi.preprocess: preprocessing utilities for lightcurves.
- pgmuvi.preprocess.compute_fvar(y, yerr) float¶
Compute normalized excess variance (F_var).
F_var = np.sqrt(max(s**2 - mean_err**2, 0)) / |ybar|
- where:
s**2 = sample variance of y mean_err**2 = mean(sig_i**2)
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
- Returns:
fvar – Scale-free amplitude measure. 0 = no excess variance beyond errors >0 = intrinsic variability detected
- Return type:
float
Notes
This effect size prevents keeping bands that are statistically significant but astrophysically tiny (common with large N).
- Raises:
ValueError – If inputs are not 1-D, have mismatched shapes, have fewer than 2 points, contain non-finite values, or have non-positive yerr.
- pgmuvi.preprocess.compute_stetson_k(y, yerr) float¶
Compute Stetson K index for shape/coherence diagnostics.
K = (1/N) * sum(|delta_i|) / np.sqrt((1/N) * sum(delta_i**2))
where delta_i = sqrt(N/(N-1)) * (y_i - ybar) / sig_i and N is the number of valid points used in the statistic.
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
- Returns:
K – Stetson K index. ~0.798 for pure Gaussian noise Larger values indicate more coherent/peaked residual structure. Returns
np.nanwhen the statistic is not well-defined.- Return type:
float
Notes
This implementation keeps the classic Stetson K form while handling pathological inputs robustly.
If finite, positive uncertainties are available,
ybaris a weighted mean using 1/sigma^2.If weighted averaging is not possible,
ybarfalls back to the unweighted mean of finite flux points.delta_iuses the finite-sample factor sqrt(N/(N-1)) for N > 1.For invalid/pathological cases (N < 2, invalid denominators, etc.),
np.nanis returned instead of raising.
References
Stetson, P. B. 1996, PASP, 108, 851
The public API and return type are preserved for backward compatibility.
- pgmuvi.preprocess.is_variable(y, yerr, alpha: float = 0.01, fvar_min: float = 0.05, stetson_k_min: float = 0.95, min_points: int = 6, verbose: bool = False) tuple[bool, dict]¶
Comprehensive variability assessment using required and diagnostic tests.
- Parameters:
y (array-like or torch.Tensor) – Flux measurements (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
alpha (float, default=0.01) – Significance level for chi-square test
fvar_min (float, default=0.05) – Minimum fractional excess variance (5%)
stetson_k_min (float, default=0.95) – Diagnostic Stetson-K reference threshold used only to set
tests_passed['stetson_test']and reporting notes. It is not required for the overall VARIABLE/NOT VARIABLE decision.min_points (int, default=6) – Minimum number of data points required
verbose (bool, default=False) – Print diagnostic information
- Returns:
is_var (bool) – True if lightcurve passes the required variability gates (min_points, chi2, and F_var).
diagnostics (dict) –
- {
‘n_points’: int, ‘chi2’: float, ‘dof’: int, ‘p_value’: float, ‘fvar’: float, ‘stetson_k’: float, ‘decision’: str, # ‘VARIABLE’ or reason for rejection ‘tests_passed’: {
’chi2_test’: bool, ‘fvar_test’: bool, ‘stetson_test’: bool, ‘min_points’: bool
}
}
Notes
Decision logic (required gates): 1. N >= min_points 2. p_value < alpha (statistically significant) 3. F_var >= fvar_min (astrophysically significant)
Stetson K is retained as a shape/coherence diagnostic and reported in diagnostics, but does not veto a lightcurve that already passes the required gates.
Both numpy arrays and torch tensors (including CUDA tensors) are accepted.
Examples
>>> from pgmuvi.preprocess.variability import is_variable >>> is_var, diag = is_variable(y, yerr, verbose=True) >>> if is_var: ... print("Proceed with GP fitting") ... else: ... print(f"Skipping: {diag['decision']}")
- pgmuvi.preprocess.subsample_lightcurve(t, max_samples=500, max_gap_fraction=0.3, random_seed=None)¶
Return indices selecting at most max_samples points from t while preserving the temporal baseline and the max_gap_fraction constraint.
The algorithm always keeps the earliest and latest time points so that the full temporal baseline is preserved. It then fills the remaining budget with randomly chosen interior points. If the resulting selection contains any gap larger than
max_gap_fraction * baseline, the original data point closest to the midpoint of each offending gap is added iteratively. To keep the total count at or below max_samples, each added point is paired with the removal of the interior point whose removal creates the smallest new gap that still satisfies the constraint. If no such drop candidate exists, that gap is left unrepaired and repair continues for other gaps. A hard iteration cap prevents infinite loops.- Parameters:
t (array_like) – Observation times (1-D).
max_samples (int, default 500) – Maximum number of points to include in the subsample. Must be an integer >= 2.
max_gap_fraction (float, default 0.3) – Maximum allowed gap between consecutive selected times as a fraction of the total temporal baseline.
random_seed (int or None, optional) – Seed for the random number generator (reproducibility).
- Returns:
indices – Indices into t that form the subsample, ordered by ascending time. If
len(t) <= max_samplesthe full rangenp.arange(len(t))is returned unchanged. The returned array always has length <= max_samples.- Return type:
np.ndarray of int
Notes
When filling a gap would exceed the budget, the densest interior point (i.e. the one whose removal creates the smallest new gap while still satisfying the gap constraint) is swapped out. If no such drop candidate exists, the gap is left unrepaired and repair continues for other gaps. A hard iteration cap (
2 * max_samples + 1) prevents infinite loops in pathological cases.- Raises:
ValueError – If max_samples is not an integer >= 2.
Examples
>>> import numpy as np >>> rng = np.random.default_rng(0) >>> t = rng.uniform(0, 100, 5000) >>> idx = subsample_lightcurve(t, max_samples=500, random_seed=42) >>> len(idx) <= 500 True
- pgmuvi.preprocess.weighted_chi2_test(y, yerr) tuple[float, int, float, float]¶
Weighted chi-square test against a constant mean model.
Null hypothesis H0: y_i = mu + eps_i, where eps_i ~ N(0, sig_i^2)
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D, finite).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, finite, positive).
- Returns:
chi2 (float) – Test statistic: sum(w_i * (y_i - ybar_w)**2)
dof (int) – Degrees of freedom (N - 1)
ybar_w (float) – Weighted mean: sum(y_i/sig_i**2) / sum(1/sig_i**2)
p_value (float) – Right-tail p-value: P(chi2 >= chi2_stat | dof)
Notes
Uses scipy.special.gammaincc for chi-square tail probability. This is the uniformly most powerful test for non-constant mean under Gaussian heteroscedastic errors with known variances.
- Raises:
ValueError – If inputs are not 1-D, have mismatched shapes, contain non-finite values, contain non-positive yerr, or have fewer than 2 points.
Submodules¶
Sampling quality assessment utilities for lightcurve data.
Provides metrics and validation gates to detect poorly sampled lightcurves
before GP fitting, preventing bad fits due to sparse coverage, large gaps,
or inadequate baseline. Also provides subsample_lightcurve() to
randomly draw a size-limited subset of observations while preserving the
temporal baseline, sampling structure and gap-coverage constraints.
- pgmuvi.preprocess.quality.assess_sampling_quality(t: ndarray, y: ndarray = None, yerr: ndarray = None, min_points: int = 15, max_gap_fraction: float = 0.3, min_baseline_factor: float = 3.0, min_snr: float = 3.0, min_fraction_good_snr: float = 0.5, verbose: bool = False) tuple¶
Assess whether lightcurve sampling is adequate for GP fitting.
Applies multiple quality gates to prevent fitting poorly sampled data.
- Parameters:
t (np.ndarray) – Observation times
y (np.ndarray, optional) – Flux values (used for SNR assessment if provided)
yerr (np.ndarray, optional) – Uncertainties (used for SNR assessment if provided)
min_points (int, default=6) – Minimum number of observations required
max_gap_fraction (float, default=0.3) – Maximum allowed gap as fraction of baseline (e.g., 0.3 = 30%) Large gaps cause extrapolation errors
min_baseline_factor (float, default=3.0) – Minimum baseline / median_cadence ratio Ensures sufficient temporal coverage (e.g., 3 = at least 3 typical cadences)
min_snr (float, default=3.0) – Minimum median SNR required (if y, yerr provided)
min_fraction_good_snr (float, default=0.5) – Minimum fraction of points with SNR > min_snr (if y, yerr provided)
verbose (bool, default=False) – Print detailed assessment report
- Returns:
passes (bool) – True if all quality gates pass
diagnostics (dict) –
- {
‘metrics’: dict (from compute_sampling_metrics), ‘gates’: {
’min_points’: bool, ‘max_gap’: bool, ‘min_baseline’: bool, ‘min_snr’: bool (if applicable)
}, ‘warnings’: list[str], ‘recommendation’: str # ‘PROCEED’ or ‘DO NOT FIT’
}
Examples
>>> passes, diag = assess_sampling_quality(t, y, yerr, verbose=True) >>> if passes: ... print("Safe to fit GP") >>> else: ... print(f"Issues: {diag['warnings']}")
- pgmuvi.preprocess.quality.compute_sampling_metrics(t: ndarray, y: ndarray = None, yerr: ndarray = None) dict¶
Compute comprehensive temporal sampling quality metrics.
- Parameters:
t (np.ndarray) – Observation times (any units)
y (np.ndarray, optional) – Flux values (used for additional diagnostics if provided)
yerr (np.ndarray, optional) – Uncertainties (used for SNR metrics if provided)
- Returns:
- {
‘n_points’: int, ‘baseline’: float,
Total time span (max(t) - min(t))
- ’max_gap’: float,
Largest gap between consecutive observations
- ’max_gap_fraction’: float,
max_gap / baseline
- ’median_cadence’: float,
Median time between observations
- ’mean_cadence’: float,
Mean time between observations
- ’cadence_std’: float,
Standard deviation of cadences
- ’nyquist_period’: float,
2 * effective_cadence (shortest reliably detectable period), where effective_cadence is the median cadence, or the mean cadence if the median is zero (e.g. many duplicate timestamps)
- ’nyquist_frequency’: float,
1 / (2 * effective_cadence), np.inf if effective_cadence == 0
- ’longest_detectable_period’: float,
baseline / 2 (heuristic upper limit)
- ’duty_cycle’: float,
Fraction of baseline with observations (simple estimate)
- ’sampling_uniformity’: float,
1 - (std(cadences) / mean(cadences)), range [0,1] 1 = perfectly uniform, 0 = highly irregular
}
If y and yerr provided, also includes: {
’median_snr’: float, ‘mean_snr’: float, ‘fraction_snr_gt_3’: float, ‘fraction_snr_gt_5’: float
}
- Return type:
dict
Examples
>>> metrics = compute_sampling_metrics(t, y, yerr) >>> print(f"Nyquist period: {metrics['nyquist_period']:.2f} days") >>> print(f"Detectable range: {metrics['nyquist_period']:.1f}" ... f" - {metrics['longest_detectable_period']:.1f} days")
- pgmuvi.preprocess.quality.robust_scale(y: ndarray, c: float = 0.6745) float¶
Compute robust scale estimate using median absolute deviation (MAD).
scale = MAD(y) / c
where c = 0.6745 for equivalence to std dev under Gaussian.
- Parameters:
y (np.ndarray) – Data values (should be finite)
c (float, default=0.6745) – Normalization constant (0.6745 ≈ Φ^(-1)(0.75) for Gaussian)
- Returns:
Robust scale estimate, or 0.0 if all values identical
- Return type:
float
Notes
More robust to outliers than standard deviation. Used to normalize measurements for scale-independent comparisons.
- pgmuvi.preprocess.quality.subsample_lightcurve(t, max_samples=500, max_gap_fraction=0.3, random_seed=None)¶
Return indices selecting at most max_samples points from t while preserving the temporal baseline and the max_gap_fraction constraint.
The algorithm always keeps the earliest and latest time points so that the full temporal baseline is preserved. It then fills the remaining budget with randomly chosen interior points. If the resulting selection contains any gap larger than
max_gap_fraction * baseline, the original data point closest to the midpoint of each offending gap is added iteratively. To keep the total count at or below max_samples, each added point is paired with the removal of the interior point whose removal creates the smallest new gap that still satisfies the constraint. If no such drop candidate exists, that gap is left unrepaired and repair continues for other gaps. A hard iteration cap prevents infinite loops.- Parameters:
t (array_like) – Observation times (1-D).
max_samples (int, default 500) – Maximum number of points to include in the subsample. Must be an integer >= 2.
max_gap_fraction (float, default 0.3) – Maximum allowed gap between consecutive selected times as a fraction of the total temporal baseline.
random_seed (int or None, optional) – Seed for the random number generator (reproducibility).
- Returns:
indices – Indices into t that form the subsample, ordered by ascending time. If
len(t) <= max_samplesthe full rangenp.arange(len(t))is returned unchanged. The returned array always has length <= max_samples.- Return type:
np.ndarray of int
Notes
When filling a gap would exceed the budget, the densest interior point (i.e. the one whose removal creates the smallest new gap while still satisfying the gap constraint) is swapped out. If no such drop candidate exists, the gap is left unrepaired and repair continues for other gaps. A hard iteration cap (
2 * max_samples + 1) prevents infinite loops in pathological cases.- Raises:
ValueError – If max_samples is not an integer >= 2.
Examples
>>> import numpy as np >>> rng = np.random.default_rng(0) >>> t = rng.uniform(0, 100, 5000) >>> idx = subsample_lightcurve(t, max_samples=500, random_seed=42) >>> len(idx) <= 500 True
Variability detection utilities for lightcurves.
Implements a three-tier statistical testing framework to determine if a lightcurve shows significant variability before attempting GP fitting.
References
Stetson, P. B. 1996, PASP, 108, 851 (Stetson K index)
- pgmuvi.preprocess.variability.compute_fvar(y, yerr) float¶
Compute normalized excess variance (F_var).
F_var = np.sqrt(max(s**2 - mean_err**2, 0)) / |ybar|
- where:
s**2 = sample variance of y mean_err**2 = mean(sig_i**2)
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
- Returns:
fvar – Scale-free amplitude measure. 0 = no excess variance beyond errors >0 = intrinsic variability detected
- Return type:
float
Notes
This effect size prevents keeping bands that are statistically significant but astrophysically tiny (common with large N).
- Raises:
ValueError – If inputs are not 1-D, have mismatched shapes, have fewer than 2 points, contain non-finite values, or have non-positive yerr.
- pgmuvi.preprocess.variability.compute_stetson_k(y, yerr) float¶
Compute Stetson K index for shape/coherence diagnostics.
K = (1/N) * sum(|delta_i|) / np.sqrt((1/N) * sum(delta_i**2))
where delta_i = sqrt(N/(N-1)) * (y_i - ybar) / sig_i and N is the number of valid points used in the statistic.
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
- Returns:
K – Stetson K index. ~0.798 for pure Gaussian noise Larger values indicate more coherent/peaked residual structure. Returns
np.nanwhen the statistic is not well-defined.- Return type:
float
Notes
This implementation keeps the classic Stetson K form while handling pathological inputs robustly.
If finite, positive uncertainties are available,
ybaris a weighted mean using 1/sigma^2.If weighted averaging is not possible,
ybarfalls back to the unweighted mean of finite flux points.delta_iuses the finite-sample factor sqrt(N/(N-1)) for N > 1.For invalid/pathological cases (N < 2, invalid denominators, etc.),
np.nanis returned instead of raising.
References
Stetson, P. B. 1996, PASP, 108, 851
The public API and return type are preserved for backward compatibility.
- pgmuvi.preprocess.variability.is_variable(y, yerr, alpha: float = 0.01, fvar_min: float = 0.05, stetson_k_min: float = 0.95, min_points: int = 6, verbose: bool = False) tuple[bool, dict]¶
Comprehensive variability assessment using required and diagnostic tests.
- Parameters:
y (array-like or torch.Tensor) – Flux measurements (1-D).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).
alpha (float, default=0.01) – Significance level for chi-square test
fvar_min (float, default=0.05) – Minimum fractional excess variance (5%)
stetson_k_min (float, default=0.95) – Diagnostic Stetson-K reference threshold used only to set
tests_passed['stetson_test']and reporting notes. It is not required for the overall VARIABLE/NOT VARIABLE decision.min_points (int, default=6) – Minimum number of data points required
verbose (bool, default=False) – Print diagnostic information
- Returns:
is_var (bool) – True if lightcurve passes the required variability gates (min_points, chi2, and F_var).
diagnostics (dict) –
- {
‘n_points’: int, ‘chi2’: float, ‘dof’: int, ‘p_value’: float, ‘fvar’: float, ‘stetson_k’: float, ‘decision’: str, # ‘VARIABLE’ or reason for rejection ‘tests_passed’: {
’chi2_test’: bool, ‘fvar_test’: bool, ‘stetson_test’: bool, ‘min_points’: bool
}
}
Notes
Decision logic (required gates): 1. N >= min_points 2. p_value < alpha (statistically significant) 3. F_var >= fvar_min (astrophysically significant)
Stetson K is retained as a shape/coherence diagnostic and reported in diagnostics, but does not veto a lightcurve that already passes the required gates.
Both numpy arrays and torch tensors (including CUDA tensors) are accepted.
Examples
>>> from pgmuvi.preprocess.variability import is_variable >>> is_var, diag = is_variable(y, yerr, verbose=True) >>> if is_var: ... print("Proceed with GP fitting") ... else: ... print(f"Skipping: {diag['decision']}")
- pgmuvi.preprocess.variability.weighted_chi2_test(y, yerr) tuple[float, int, float, float]¶
Weighted chi-square test against a constant mean model.
Null hypothesis H0: y_i = mu + eps_i, where eps_i ~ N(0, sig_i^2)
- Parameters:
y (array-like or torch.Tensor) – Flux values (1-D, finite).
yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, finite, positive).
- Returns:
chi2 (float) – Test statistic: sum(w_i * (y_i - ybar_w)**2)
dof (int) – Degrees of freedom (N - 1)
ybar_w (float) – Weighted mean: sum(y_i/sig_i**2) / sum(1/sig_i**2)
p_value (float) – Right-tail p-value: P(chi2 >= chi2_stat | dof)
Notes
Uses scipy.special.gammaincc for chi-square tail probability. This is the uniformly most powerful test for non-constant mean under Gaussian heteroscedastic errors with known variances.
- Raises:
ValueError – If inputs are not 1-D, have mismatched shapes, contain non-finite values, contain non-positive yerr, or have fewer than 2 points.