pgmuvi.preprocess

pgmuvi.preprocess: preprocessing utilities for lightcurves.

pgmuvi.preprocess.compute_fvar(y, yerr) float

Compute normalized excess variance (F_var).

F_var = np.sqrt(max(s**2 - mean_err**2, 0)) / |ybar|

where:

s**2 = sample variance of y mean_err**2 = mean(sig_i**2)

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

Returns:

fvar – Scale-free amplitude measure. 0 = no excess variance beyond errors >0 = intrinsic variability detected

Return type:

float

Notes

This effect size prevents keeping bands that are statistically significant but astrophysically tiny (common with large N).

Raises:

ValueError – If inputs are not 1-D, have mismatched shapes, have fewer than 2 points, contain non-finite values, or have non-positive yerr.

pgmuvi.preprocess.compute_stetson_k(y, yerr) float

Compute Stetson K index for shape/coherence diagnostics.

K = (1/N) * sum(|delta_i|) / np.sqrt((1/N) * sum(delta_i**2))

where delta_i = sqrt(N/(N-1)) * (y_i - ybar) / sig_i and N is the number of valid points used in the statistic.

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

Returns:

K – Stetson K index. ~0.798 for pure Gaussian noise Larger values indicate more coherent/peaked residual structure. Returns np.nan when the statistic is not well-defined.

Return type:

float

Notes

This implementation keeps the classic Stetson K form while handling pathological inputs robustly.

  • If finite, positive uncertainties are available, ybar is a weighted mean using 1/sigma^2.

  • If weighted averaging is not possible, ybar falls back to the unweighted mean of finite flux points.

  • delta_i uses the finite-sample factor sqrt(N/(N-1)) for N > 1.

  • For invalid/pathological cases (N < 2, invalid denominators, etc.), np.nan is returned instead of raising.

References

Stetson, P. B. 1996, PASP, 108, 851

The public API and return type are preserved for backward compatibility.

pgmuvi.preprocess.is_variable(y, yerr, alpha: float = 0.01, fvar_min: float = 0.05, stetson_k_min: float = 0.95, min_points: int = 6, verbose: bool = False) tuple[bool, dict]

Comprehensive variability assessment using required and diagnostic tests.

Parameters:
  • y (array-like or torch.Tensor) – Flux measurements (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

  • alpha (float, default=0.01) – Significance level for chi-square test

  • fvar_min (float, default=0.05) – Minimum fractional excess variance (5%)

  • stetson_k_min (float, default=0.95) – Diagnostic Stetson-K reference threshold used only to set tests_passed['stetson_test'] and reporting notes. It is not required for the overall VARIABLE/NOT VARIABLE decision.

  • min_points (int, default=6) – Minimum number of data points required

  • verbose (bool, default=False) – Print diagnostic information

Returns:

  • is_var (bool) – True if lightcurve passes the required variability gates (min_points, chi2, and F_var).

  • diagnostics (dict) –

    {

    ‘n_points’: int, ‘chi2’: float, ‘dof’: int, ‘p_value’: float, ‘fvar’: float, ‘stetson_k’: float, ‘decision’: str, # ‘VARIABLE’ or reason for rejection ‘tests_passed’: {

    ’chi2_test’: bool, ‘fvar_test’: bool, ‘stetson_test’: bool, ‘min_points’: bool

    }

    }

Notes

Decision logic (required gates): 1. N >= min_points 2. p_value < alpha (statistically significant) 3. F_var >= fvar_min (astrophysically significant)

Stetson K is retained as a shape/coherence diagnostic and reported in diagnostics, but does not veto a lightcurve that already passes the required gates.

Both numpy arrays and torch tensors (including CUDA tensors) are accepted.

Examples

>>> from pgmuvi.preprocess.variability import is_variable
>>> is_var, diag = is_variable(y, yerr, verbose=True)
>>> if is_var:
...     print("Proceed with GP fitting")
... else:
...     print(f"Skipping: {diag['decision']}")
pgmuvi.preprocess.subsample_lightcurve(t, max_samples=500, max_gap_fraction=0.3, random_seed=None)

Return indices selecting at most max_samples points from t while preserving the temporal baseline and the max_gap_fraction constraint.

The algorithm always keeps the earliest and latest time points so that the full temporal baseline is preserved. It then fills the remaining budget with randomly chosen interior points. If the resulting selection contains any gap larger than max_gap_fraction * baseline, the original data point closest to the midpoint of each offending gap is added iteratively. To keep the total count at or below max_samples, each added point is paired with the removal of the interior point whose removal creates the smallest new gap that still satisfies the constraint. If no such drop candidate exists, that gap is left unrepaired and repair continues for other gaps. A hard iteration cap prevents infinite loops.

Parameters:
  • t (array_like) – Observation times (1-D).

  • max_samples (int, default 500) – Maximum number of points to include in the subsample. Must be an integer >= 2.

  • max_gap_fraction (float, default 0.3) – Maximum allowed gap between consecutive selected times as a fraction of the total temporal baseline.

  • random_seed (int or None, optional) – Seed for the random number generator (reproducibility).

Returns:

indices – Indices into t that form the subsample, ordered by ascending time. If len(t) <= max_samples the full range np.arange(len(t)) is returned unchanged. The returned array always has length <= max_samples.

Return type:

np.ndarray of int

Notes

When filling a gap would exceed the budget, the densest interior point (i.e. the one whose removal creates the smallest new gap while still satisfying the gap constraint) is swapped out. If no such drop candidate exists, the gap is left unrepaired and repair continues for other gaps. A hard iteration cap (2 * max_samples + 1) prevents infinite loops in pathological cases.

Raises:

ValueError – If max_samples is not an integer >= 2.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> t = rng.uniform(0, 100, 5000)
>>> idx = subsample_lightcurve(t, max_samples=500, random_seed=42)
>>> len(idx) <= 500
True
pgmuvi.preprocess.weighted_chi2_test(y, yerr) tuple[float, int, float, float]

Weighted chi-square test against a constant mean model.

Null hypothesis H0: y_i = mu + eps_i, where eps_i ~ N(0, sig_i^2)

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D, finite).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, finite, positive).

Returns:

  • chi2 (float) – Test statistic: sum(w_i * (y_i - ybar_w)**2)

  • dof (int) – Degrees of freedom (N - 1)

  • ybar_w (float) – Weighted mean: sum(y_i/sig_i**2) / sum(1/sig_i**2)

  • p_value (float) – Right-tail p-value: P(chi2 >= chi2_stat | dof)

Notes

Uses scipy.special.gammaincc for chi-square tail probability. This is the uniformly most powerful test for non-constant mean under Gaussian heteroscedastic errors with known variances.

Raises:

ValueError – If inputs are not 1-D, have mismatched shapes, contain non-finite values, contain non-positive yerr, or have fewer than 2 points.

Submodules

Sampling quality assessment utilities for lightcurve data.

Provides metrics and validation gates to detect poorly sampled lightcurves before GP fitting, preventing bad fits due to sparse coverage, large gaps, or inadequate baseline. Also provides subsample_lightcurve() to randomly draw a size-limited subset of observations while preserving the temporal baseline, sampling structure and gap-coverage constraints.

pgmuvi.preprocess.quality.assess_sampling_quality(t: ndarray, y: ndarray = None, yerr: ndarray = None, min_points: int = 15, max_gap_fraction: float = 0.3, min_baseline_factor: float = 3.0, min_snr: float = 3.0, min_fraction_good_snr: float = 0.5, verbose: bool = False) tuple

Assess whether lightcurve sampling is adequate for GP fitting.

Applies multiple quality gates to prevent fitting poorly sampled data.

Parameters:
  • t (np.ndarray) – Observation times

  • y (np.ndarray, optional) – Flux values (used for SNR assessment if provided)

  • yerr (np.ndarray, optional) – Uncertainties (used for SNR assessment if provided)

  • min_points (int, default=6) – Minimum number of observations required

  • max_gap_fraction (float, default=0.3) – Maximum allowed gap as fraction of baseline (e.g., 0.3 = 30%) Large gaps cause extrapolation errors

  • min_baseline_factor (float, default=3.0) – Minimum baseline / median_cadence ratio Ensures sufficient temporal coverage (e.g., 3 = at least 3 typical cadences)

  • min_snr (float, default=3.0) – Minimum median SNR required (if y, yerr provided)

  • min_fraction_good_snr (float, default=0.5) – Minimum fraction of points with SNR > min_snr (if y, yerr provided)

  • verbose (bool, default=False) – Print detailed assessment report

Returns:

  • passes (bool) – True if all quality gates pass

  • diagnostics (dict) –

    {

    ‘metrics’: dict (from compute_sampling_metrics), ‘gates’: {

    ’min_points’: bool, ‘max_gap’: bool, ‘min_baseline’: bool, ‘min_snr’: bool (if applicable)

    }, ‘warnings’: list[str], ‘recommendation’: str # ‘PROCEED’ or ‘DO NOT FIT’

    }

Examples

>>> passes, diag = assess_sampling_quality(t, y, yerr, verbose=True)
>>> if passes:
...     print("Safe to fit GP")
>>> else:
...     print(f"Issues: {diag['warnings']}")
pgmuvi.preprocess.quality.compute_sampling_metrics(t: ndarray, y: ndarray = None, yerr: ndarray = None) dict

Compute comprehensive temporal sampling quality metrics.

Parameters:
  • t (np.ndarray) – Observation times (any units)

  • y (np.ndarray, optional) – Flux values (used for additional diagnostics if provided)

  • yerr (np.ndarray, optional) – Uncertainties (used for SNR metrics if provided)

Returns:

{

‘n_points’: int, ‘baseline’: float,

Total time span (max(t) - min(t))

’max_gap’: float,

Largest gap between consecutive observations

’max_gap_fraction’: float,

max_gap / baseline

’median_cadence’: float,

Median time between observations

’mean_cadence’: float,

Mean time between observations

’cadence_std’: float,

Standard deviation of cadences

’nyquist_period’: float,

2 * effective_cadence (shortest reliably detectable period), where effective_cadence is the median cadence, or the mean cadence if the median is zero (e.g. many duplicate timestamps)

’nyquist_frequency’: float,

1 / (2 * effective_cadence), np.inf if effective_cadence == 0

’longest_detectable_period’: float,

baseline / 2 (heuristic upper limit)

’duty_cycle’: float,

Fraction of baseline with observations (simple estimate)

’sampling_uniformity’: float,

1 - (std(cadences) / mean(cadences)), range [0,1] 1 = perfectly uniform, 0 = highly irregular

}

If y and yerr provided, also includes: {

’median_snr’: float, ‘mean_snr’: float, ‘fraction_snr_gt_3’: float, ‘fraction_snr_gt_5’: float

}

Return type:

dict

Examples

>>> metrics = compute_sampling_metrics(t, y, yerr)
>>> print(f"Nyquist period: {metrics['nyquist_period']:.2f} days")
>>> print(f"Detectable range: {metrics['nyquist_period']:.1f}"
...       f" - {metrics['longest_detectable_period']:.1f} days")
pgmuvi.preprocess.quality.robust_scale(y: ndarray, c: float = 0.6745) float

Compute robust scale estimate using median absolute deviation (MAD).

scale = MAD(y) / c

where c = 0.6745 for equivalence to std dev under Gaussian.

Parameters:
  • y (np.ndarray) – Data values (should be finite)

  • c (float, default=0.6745) – Normalization constant (0.6745 ≈ Φ^(-1)(0.75) for Gaussian)

Returns:

Robust scale estimate, or 0.0 if all values identical

Return type:

float

Notes

More robust to outliers than standard deviation. Used to normalize measurements for scale-independent comparisons.

pgmuvi.preprocess.quality.subsample_lightcurve(t, max_samples=500, max_gap_fraction=0.3, random_seed=None)

Return indices selecting at most max_samples points from t while preserving the temporal baseline and the max_gap_fraction constraint.

The algorithm always keeps the earliest and latest time points so that the full temporal baseline is preserved. It then fills the remaining budget with randomly chosen interior points. If the resulting selection contains any gap larger than max_gap_fraction * baseline, the original data point closest to the midpoint of each offending gap is added iteratively. To keep the total count at or below max_samples, each added point is paired with the removal of the interior point whose removal creates the smallest new gap that still satisfies the constraint. If no such drop candidate exists, that gap is left unrepaired and repair continues for other gaps. A hard iteration cap prevents infinite loops.

Parameters:
  • t (array_like) – Observation times (1-D).

  • max_samples (int, default 500) – Maximum number of points to include in the subsample. Must be an integer >= 2.

  • max_gap_fraction (float, default 0.3) – Maximum allowed gap between consecutive selected times as a fraction of the total temporal baseline.

  • random_seed (int or None, optional) – Seed for the random number generator (reproducibility).

Returns:

indices – Indices into t that form the subsample, ordered by ascending time. If len(t) <= max_samples the full range np.arange(len(t)) is returned unchanged. The returned array always has length <= max_samples.

Return type:

np.ndarray of int

Notes

When filling a gap would exceed the budget, the densest interior point (i.e. the one whose removal creates the smallest new gap while still satisfying the gap constraint) is swapped out. If no such drop candidate exists, the gap is left unrepaired and repair continues for other gaps. A hard iteration cap (2 * max_samples + 1) prevents infinite loops in pathological cases.

Raises:

ValueError – If max_samples is not an integer >= 2.

Examples

>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> t = rng.uniform(0, 100, 5000)
>>> idx = subsample_lightcurve(t, max_samples=500, random_seed=42)
>>> len(idx) <= 500
True

Variability detection utilities for lightcurves.

Implements a three-tier statistical testing framework to determine if a lightcurve shows significant variability before attempting GP fitting.

References

Stetson, P. B. 1996, PASP, 108, 851 (Stetson K index)

pgmuvi.preprocess.variability.compute_fvar(y, yerr) float

Compute normalized excess variance (F_var).

F_var = np.sqrt(max(s**2 - mean_err**2, 0)) / |ybar|

where:

s**2 = sample variance of y mean_err**2 = mean(sig_i**2)

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

Returns:

fvar – Scale-free amplitude measure. 0 = no excess variance beyond errors >0 = intrinsic variability detected

Return type:

float

Notes

This effect size prevents keeping bands that are statistically significant but astrophysically tiny (common with large N).

Raises:

ValueError – If inputs are not 1-D, have mismatched shapes, have fewer than 2 points, contain non-finite values, or have non-positive yerr.

pgmuvi.preprocess.variability.compute_stetson_k(y, yerr) float

Compute Stetson K index for shape/coherence diagnostics.

K = (1/N) * sum(|delta_i|) / np.sqrt((1/N) * sum(delta_i**2))

where delta_i = sqrt(N/(N-1)) * (y_i - ybar) / sig_i and N is the number of valid points used in the statistic.

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

Returns:

K – Stetson K index. ~0.798 for pure Gaussian noise Larger values indicate more coherent/peaked residual structure. Returns np.nan when the statistic is not well-defined.

Return type:

float

Notes

This implementation keeps the classic Stetson K form while handling pathological inputs robustly.

  • If finite, positive uncertainties are available, ybar is a weighted mean using 1/sigma^2.

  • If weighted averaging is not possible, ybar falls back to the unweighted mean of finite flux points.

  • delta_i uses the finite-sample factor sqrt(N/(N-1)) for N > 1.

  • For invalid/pathological cases (N < 2, invalid denominators, etc.), np.nan is returned instead of raising.

References

Stetson, P. B. 1996, PASP, 108, 851

The public API and return type are preserved for backward compatibility.

pgmuvi.preprocess.variability.is_variable(y, yerr, alpha: float = 0.01, fvar_min: float = 0.05, stetson_k_min: float = 0.95, min_points: int = 6, verbose: bool = False) tuple[bool, dict]

Comprehensive variability assessment using required and diagnostic tests.

Parameters:
  • y (array-like or torch.Tensor) – Flux measurements (1-D).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, positive).

  • alpha (float, default=0.01) – Significance level for chi-square test

  • fvar_min (float, default=0.05) – Minimum fractional excess variance (5%)

  • stetson_k_min (float, default=0.95) – Diagnostic Stetson-K reference threshold used only to set tests_passed['stetson_test'] and reporting notes. It is not required for the overall VARIABLE/NOT VARIABLE decision.

  • min_points (int, default=6) – Minimum number of data points required

  • verbose (bool, default=False) – Print diagnostic information

Returns:

  • is_var (bool) – True if lightcurve passes the required variability gates (min_points, chi2, and F_var).

  • diagnostics (dict) –

    {

    ‘n_points’: int, ‘chi2’: float, ‘dof’: int, ‘p_value’: float, ‘fvar’: float, ‘stetson_k’: float, ‘decision’: str, # ‘VARIABLE’ or reason for rejection ‘tests_passed’: {

    ’chi2_test’: bool, ‘fvar_test’: bool, ‘stetson_test’: bool, ‘min_points’: bool

    }

    }

Notes

Decision logic (required gates): 1. N >= min_points 2. p_value < alpha (statistically significant) 3. F_var >= fvar_min (astrophysically significant)

Stetson K is retained as a shape/coherence diagnostic and reported in diagnostics, but does not veto a lightcurve that already passes the required gates.

Both numpy arrays and torch tensors (including CUDA tensors) are accepted.

Examples

>>> from pgmuvi.preprocess.variability import is_variable
>>> is_var, diag = is_variable(y, yerr, verbose=True)
>>> if is_var:
...     print("Proceed with GP fitting")
... else:
...     print(f"Skipping: {diag['decision']}")
pgmuvi.preprocess.variability.weighted_chi2_test(y, yerr) tuple[float, int, float, float]

Weighted chi-square test against a constant mean model.

Null hypothesis H0: y_i = mu + eps_i, where eps_i ~ N(0, sig_i^2)

Parameters:
  • y (array-like or torch.Tensor) – Flux values (1-D, finite).

  • yerr (array-like or torch.Tensor) – 1-sigma uncertainties (1-D, finite, positive).

Returns:

  • chi2 (float) – Test statistic: sum(w_i * (y_i - ybar_w)**2)

  • dof (int) – Degrees of freedom (N - 1)

  • ybar_w (float) – Weighted mean: sum(y_i/sig_i**2) / sum(1/sig_i**2)

  • p_value (float) – Right-tail p-value: P(chi2 >= chi2_stat | dof)

Notes

Uses scipy.special.gammaincc for chi-square tail probability. This is the uniformly most powerful test for non-constant mean under Gaussian heteroscedastic errors with known variances.

Raises:

ValueError – If inputs are not 1-D, have mismatched shapes, contain non-finite values, contain non-positive yerr, or have fewer than 2 points.