GMMThresholding#

class GMMThresholding(adata, feature, label_obs_save_str, thresholding_events_key='gmm_thresholding_events', layer=None, gmm_kwargs=None, random_state=42)[source]#

Bases: GaussianMixtureModelBase

A class to perform Gaussian Mixture Model (GMM) based thresholding on single-feature data.

This class fits a GMM to the distribution of a specified feature (e.g., gene expression), calculates decision boundaries between the components (either automatically based on probability changes or manually specified), assigns categorical labels to observations based on these boundaries, and provides plotting functionalities to visualize the results.

Attributes:
adataad.AnnData

A copy of the input AnnData object, modified during processing.

featurestr

The name of the feature (column in adata.var) being thresholded.

gmm_obs_labelstr

The key in adata.obs where the resulting category labels will be stored.

gmm_kwargsDict

Default keyword arguments passed to the sklearn.mixture.GaussianMixture model during fitting.

internal_data_SingleThresholdingEventModel

Pydantic model storing all results for the current feature.

manual_decision_boundariesbool

Flag indicating if thresholds were set manually.

Parameters:

Methods

categorize_samples

Categorize samples based on GMM-derived or manual thresholds.

determine_optimal_components

Determine the optimal number of GMM components for this feature.

determine_optimal_number_components

Determine the optimal number of components for the GMM.

fit

Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.

generate_thresholding_report

Generate a human-readable report of all thresholding operations.

plot_bayesian_information_criterion_curve

Plot the BIC curve.

plot_bic_curve

Plot the Bayesian Information Criterion (BIC) curve.

plot_feature_distribution_exploratory

Plot histogram of the feature distribution for exploratory analysis.

plot_feature_strip_plot_exploratory

Plot strip plot + histogram for exploratory analysis.

plot_hist_distribution_with_boundaries

Plot the histogram and GMM components with decision boundaries.

plot_strip_plot_histogram_with_decision_boundaries

Generate a strip plot with a histogram and decision boundaries.

return_adata

Serialize internal Pydantic models and return the modified AnnData object.

return_thresholds

Return the decision boundary thresholds.

categorize_samples(manual_thresholds=None, ordered_labels=None, duplicate_labels=False)[source]#

Categorize samples based on GMM-derived or manual thresholds.

This method assigns categorical labels to samples based on either automatically calculated decision boundaries from GMM fitting or manually specified thresholds. Supports collapsing multiple GMM components into fewer categories for cross-dataset robustness.

Parameters:
manual_thresholdslist of float or int, optional

Explicit threshold values. If None, thresholds are calculated automatically from GMM probabilities. Must have length equal to (number of unique labels - 1).

ordered_labelslist, optional

Label for each GMM component. Length must equal n_components fitted in GMM. Can contain duplicates if duplicate_labels=True to merge multiple components into single categories. If None, generates default labels.

duplicate_labelsbool, optional

If True, allows duplicate labels to collapse multiple GMM components into single categories. Useful for cross-dataset robustness where many components are fitted for adaptive boundary placement but fewer final categories are desired. Defaults to False.

Raises:
ValueError

If GMM model has not been fitted first.

ValueError

If number of ordered_labels doesn’t match n_components.

ValueError

If duplicate labels provided but duplicate_labels=False.

TypeError

If manual_thresholds is not a list.

ValueError

If manual_thresholds length doesn’t match unique labels - 1.

TypeError

If threshold values are not numeric.

Examples

Standard usage with 2 components and 2 labels:

gmm.fit(n_components=2)
gmm.categorize_samples(ordered_labels=['Low', 'High'])

Cross-dataset robustness - fit many components, collapse to binary:

gmm.fit(n_components=8)
gmm.categorize_samples(
    ordered_labels=['Low', 'Low', 'Low', 'High', 'High', 'High', 'High', 'High'],
    duplicate_labels=True
)
# Boundary automatically placed based on 8-component GMM fit

Using manual thresholds:

gmm.fit(n_components=2)
gmm.categorize_samples(
    ordered_labels=['Low', 'High'],
    manual_thresholds=[0.05]
)
Parameters:
Return type:

None

determine_optimal_components(component_range, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)[source]#

Determine the optimal number of GMM components for this feature.

This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.

Parameters:
component_rangeint

Maximum number of components to test.

metricstr, optional

Metric to use for optimization (currently only ‘bic’ supported). Defaults to ‘bic’.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.

return_bic_listbool, optional

If True, returns tuple of (optimal_n, bic_list). Defaults to False.

Returns:
int or tuple

Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Examples

Find optimal number of components:

gmm = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='cell_cycle'
)
optimal_n = gmm.determine_optimal_components(
    component_range=5
)
print(f'Optimal components: {optimal_n}')
Parameters:
  • component_range (int)

  • metric (str)

  • curve (str)

  • direction (str)

  • return_bic_list (bool)

Return type:

Union[int, tuple]

determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#

Determine the optimal number of components for the GMM.

Parameters:
adataad.AnnData

AnnData object containing the data.

featurestr

Name of the feature to analyze.

component_rangeint

Maximum number of components to test.

layerstr or None, optional

Optional layer name to use instead of .X. Default is None.

gmm_kwargsdict or None, optional

Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.

metricstr, optional

Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.

return_bic_listbool, optional

If True, returns tuple of (optimal_n, bic_list). Default is False.

Returns:
int or tuple of (int, list of (int or float))

Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Parameters:
Return type:

Union[int, Tuple[int, List[Union[int, float]]]]

fit(n_components)[source]#

Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.

Parameters:
n_componentsint

Number of Gaussian components to fit. Must be > 1.

Raises:
TypeError

If n_components is not an integer.

ValueError

If n_components is <= 1.

Parameters:

n_components (int)

Return type:

None

generate_thresholding_report(output_format='text')#

Generate a human-readable report of all thresholding operations.

Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)

Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.

Parameters:
output_formatstr, default ‘text’

‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.

Returns:
Union[str, pd.DataFrame]

Formatted report string or DataFrame.

Raises:
KeyError

If thresholding_events_key doesn’t exist in adata.uns.

ValueError

If output_format is not ‘text’ or ‘dataframe’.

TypeError

If adata.uns[thresholding_events_key] is not a dict.

Examples

Generate text report:

>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat')
>>> gmm.fit(n_components=2)
>>> gmm.categorize_samples(['Low', 'High'])
>>> report = gmm.generate_thresholding_report()
>>> print(report)
Thresholding Report
==================================================
Parameters:

output_format (str)

Return type:

Union[str, DataFrame]

plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#

Plot the BIC curve.

Parameters:
adataad.AnnData

AnnData object containing the data.

featurestr

Name of the feature to analyze.

component_rangeint

Maximum number of components to test.

layerstr or None, optional

Optional layer name to use instead of .X. Default is None.

gmm_kwargsdict or None, optional

Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.

axmatplotlib.pyplot.Axes or None, optional

Optional matplotlib axes to plot on. If None, creates new figure. Default is None.

save_pathstr or Path or None, optional

Optional path to save the figure. Parent directory must exist. Default is None.

Raises:
FileNotFoundError

If save_path parent directory doesn’t exist.

Parameters:
Return type:

None

plot_bic_curve(component_range, curve='convex', direction='decreasing', ax=None, save_path=None)[source]#

Plot the Bayesian Information Criterion (BIC) curve.

This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.

Parameters:
component_rangeint

Maximum number of components to test.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.

axplt.Axes, optional

Matplotlib axes to plot on. If None, creates new figure. Defaults to None.

save_pathstr or Path, optional

Path to save the figure. Parent directory must exist. Defaults to None.

Raises:
FileNotFoundError

If save_path parent directory doesn’t exist.

Examples

Plot BIC curve to determine optimal components:

gmm = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='cell_cycle'
)
gmm.plot_bic_curve(component_range=5)
plt.show()
Parameters:
Return type:

None

plot_feature_distribution_exploratory(hist_kwargs=None, ax=None, x_axis_limits=None)[source]#

Plot histogram of the feature distribution for exploratory analysis.

This method allows you to visualize the feature distribution WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:
hist_kwargsdict, optional

Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.

axplt.Axes, optional

Matplotlib axes to plot on. If None, uses current axes.

x_axis_limitstuple, optional

(min, max) for x-axis. Use None for data-driven limits.

Returns:
plt.Axes

The matplotlib axes object.

Examples

Explore DNA content distribution before deciding on components:

gmm = GMMThresholding(
    adata=adata,
    feature='DNA_content',
    label_obs_save_str='cell_cycle'
)
gmm.plot_feature_distribution_exploratory(
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(0, 10)
)
plt.title('DNA Content Distribution - Exploratory')
plt.show()
Parameters:
Return type:

Axes

plot_feature_strip_plot_exploratory(hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None)[source]#

Plot strip plot + histogram for exploratory analysis.

Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.

Parameters:
hist_kwargsdict, optional

Keyword arguments for histogram.

strip_plot_kwargsdict, optional

Keyword arguments for strip plot.

scatter_densitybool, optional

If True, uses density-based coloring. Defaults to True.

x_axis_limitstuple, optional

(min, max) for x-axis.

Returns:
tuple

(fig, (ax_strip, ax_hist)) - Figure and axes objects.

Examples

Explore DNA content distribution with density visualization:

gmm = GMMThresholding(
    adata=adata,
    feature='DNA_content',
    label_obs_save_str='cell_cycle'
)
fig, (ax_strip, ax_hist) = gmm.plot_feature_strip_plot_exploratory(
    scatter_density=True,
    x_axis_limits=(0, 10)
)
plt.suptitle('DNA Content Distribution - Exploratory')
plt.show()
Parameters:
Return type:

tuple

plot_hist_distribution_with_boundaries(num_std=5, title=None, hist_kwargs=None, cmap=<matplotlib.colors.LinearSegmentedColormap object>, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#

Plot the histogram and GMM components with decision boundaries.

Parameters:
num_stdint, optional

Number of standard deviations for GMM plotting. Defaults to 5.

titlestr, optional

Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

hist_kwargsdict, optional

Kwargs for histogram. Defaults to None.

cmapplt.cm.ScalarMappable, optional

Colormap. Defaults to ‘rainbow’.

axplt.Axes, optional

Axes to plot on. If None, creates new. Defaults to None.

x_axis_limitstuple, optional

X-axis limits (min, max). Defaults to None.

resolutionint, optional

Resolution for plotting. Defaults to 1000.

save_pathstr or Path, optional

Path to save the figure. Parent directory must exist. Defaults to None.

Returns:
plt.Axes

The matplotlib axes object. Call plt.show() to display it.

Raises:
FileNotFoundError

If save_path parent directory doesn’t exist.

Parameters:
Return type:

Axes

plot_strip_plot_histogram_with_decision_boundaries(cmap=<matplotlib.colors.ListedColormap object>, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#

Generate a strip plot with a histogram and decision boundaries.

This method wraps the base class implementation, providing a convenient interface for single-feature thresholding visualizations.

Parameters:
cmapplt.cm.ScalarMappable, optional

Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].

y_axis_limitstuple, optional

Y-axis limits (min, max). If None, uses data min/max. Defaults to None.

resolutionint, optional

Resolution for boundary plotting. Defaults to 1000.

scatter_densitybool, optional

If True, color by density; if False, color by labels. Defaults to True.

vmaxint or float, optional

Maximum density value for colormap. If None, auto-calculated. Defaults to None.

hist_kwargsdict, optional

Kwargs for histogram (bins, color, etc.). Defaults to None.

strip_plot_kwargsdict, optional

Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.

titlestr or None, optional

Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

Returns:
Figure

The matplotlib figure object. Call plt.show() to display it.

Raises:
ValueError

If decision boundaries have not been calculated yet.

Parameters:
Return type:

Figure

return_adata(overwrite=False)[source]#

Serialize internal Pydantic models and return the modified AnnData object.

Parameters:
overwritebool, optional

If True, allows overwriting existing entries. Defaults to False.

Returns:
ad.AnnData

The modified AnnData object with the serialized GMM thresholding events.

Raises:
ValueError

If the feature already exists in adata.uns[thresholding_events_key] and overwrite is False. This prevents overwriting existing entries.

Parameters:

overwrite (bool)

Return type:

AnnData

return_thresholds()[source]#

Return the decision boundary thresholds.

Returns:
list of float

The decision boundary thresholds.

Raises:
ValueError

If decision boundaries have not been calculated yet.

Return type:

List[float]