GMMThresholding#

class GMMThresholding(adata, feature, label_obs_save_str, thresholding_events_key='gmm_thresholding_events', layer=None, gmm_kwargs=None, random_state=42)[source]#

Bases: GaussianMixtureModelBase

A class to perform Gaussian Mixture Model (GMM) based thresholding on single-feature data.

This class fits a GMM to the distribution of a specified feature (e.g., gene expression), calculates decision boundaries between the components (either automatically based on probability changes or manually specified), assigns categorical labels to observations based on these boundaries, and provides plotting functionalities to visualize the results.

Attributes:

adataad.AnnData: A copy of the input AnnData object, modified during processing.
featurestr: The name of the feature (column in adata.var) being thresholded.
gmm_obs_labelstr: The key in adata.obs where the resulting category labels will be stored.
gmm_kwargsDict: Default keyword arguments passed to the sklearn.mixture.GaussianMixture model during fitting.
internal_data_SingleThresholdingEventModel: Pydantic model storing all results for the current feature.
manual_decision_boundariesbool: Flag indicating if thresholds were set manually.

Parameters:

adata (AnnData)
feature (str)
label_obs_save_str (str)
thresholding_events_key (str)
layer (Optional[str])
gmm_kwargs (Optional[dict])
random_state (int)

Methods

`categorize_samples`	Categorize samples based on GMM-derived or manual thresholds.
`determine_optimal_components`	Determine the optimal number of GMM components for this feature.
`determine_optimal_number_components`	Determine the optimal number of components for the GMM.
`fit`	Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.
`generate_thresholding_report`	Generate a human-readable report of all thresholding operations.
`plot_bayesian_information_criterion_curve`	Plot the BIC curve.
`plot_bic_curve`	Plot the Bayesian Information Criterion (BIC) curve.
`plot_feature_distribution_exploratory`	Plot histogram of the feature distribution for exploratory analysis.
`plot_feature_strip_plot_exploratory`	Plot strip plot + histogram for exploratory analysis.
`plot_hist_distribution_with_boundaries`	Plot the histogram and GMM components with decision boundaries.
`plot_strip_plot_histogram_with_decision_boundaries`	Generate a strip plot with a histogram and decision boundaries.
`return_adata`	Serialize internal Pydantic models and return the modified AnnData object.
`return_thresholds`	Return the decision boundary thresholds.

categorize_samples(manual_thresholds=None, ordered_labels=None, duplicate_labels=False)[source]#

Categorize samples based on GMM-derived or manual thresholds.

This method assigns categorical labels to samples based on either automatically calculated decision boundaries from GMM fitting or manually specified thresholds. Supports collapsing multiple GMM components into fewer categories for cross-dataset robustness.

Parameters:

manual_thresholdslist of float or int, optional: Explicit threshold values. If None, thresholds are calculated automatically from GMM probabilities. Must have length equal to (number of unique labels - 1).
ordered_labelslist, optional: Label for each GMM component. Length must equal n_components fitted in GMM. Can contain duplicates if duplicate_labels=True to merge multiple components into single categories. If None, generates default labels.
duplicate_labelsbool, optional: If True, allows duplicate labels to collapse multiple GMM components into single categories. Useful for cross-dataset robustness where many components are fitted for adaptive boundary placement but fewer final categories are desired. Defaults to False.

Raises:

ValueError: If GMM model has not been fitted first.
ValueError: If number of ordered_labels doesn’t match n_components.
ValueError: If duplicate labels provided but duplicate_labels=False.
TypeError: If manual_thresholds is not a list.
ValueError: If manual_thresholds length doesn’t match unique labels - 1.
TypeError: If threshold values are not numeric.

Examples

Standard usage with 2 components and 2 labels:

gmm.fit(n_components=2)
gmm.categorize_samples(ordered_labels=['Low', 'High'])

Cross-dataset robustness - fit many components, collapse to binary:

gmm.fit(n_components=8)
gmm.categorize_samples(
    ordered_labels=['Low', 'Low', 'Low', 'High', 'High', 'High', 'High', 'High'],
    duplicate_labels=True
)
# Boundary automatically placed based on 8-component GMM fit

Using manual thresholds:

gmm.fit(n_components=2)
gmm.categorize_samples(
    ordered_labels=['Low', 'High'],
    manual_thresholds=[0.05]
)

Parameters:

manual_thresholds (Optional[List[Union[int, float]]])
ordered_labels (Optional[list])
duplicate_labels (bool)

Return type:

None

determine_optimal_components(component_range, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)[source]#

Determine the optimal number of GMM components for this feature.

This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.

Parameters:

component_rangeint: Maximum number of components to test.
metricstr, optional: Metric to use for optimization (currently only ‘bic’ supported). Defaults to ‘bic’.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.
return_bic_listbool, optional: If True, returns tuple of (optimal_n, bic_list). Defaults to False.

Returns:

int or tuple: Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Examples

Find optimal number of components:

gmm = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='cell_cycle'
)
optimal_n = gmm.determine_optimal_components(
    component_range=5
)
print(f'Optimal components: {optimal_n}')

Parameters:

component_range (int)
metric (str)
curve (str)
direction (str)
return_bic_list (bool)

Return type:

Union[int, tuple]

determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#

Determine the optimal number of components for the GMM.

Parameters:

adataad.AnnData: AnnData object containing the data.
featurestr: Name of the feature to analyze.
component_rangeint: Maximum number of components to test.
layerstr or None, optional: Optional layer name to use instead of .X. Default is None.
gmm_kwargsdict or None, optional: Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
metricstr, optional: Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
return_bic_listbool, optional: If True, returns tuple of (optimal_n, bic_list). Default is False.

Returns:

int or tuple of (int, list of (int or float)): Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Parameters:

adata (AnnData)
feature (str)
component_range (int)
layer (Optional[str])
gmm_kwargs (Optional[dict])
metric (str)
curve (str)
direction (str)
return_bic_list (bool)

Return type:

Union[int, Tuple[int, List[Union[int, float]]]]

fit(n_components)[source]#

Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.

Parameters:

n_componentsint: Number of Gaussian components to fit. Must be > 1.

Raises:

TypeError: If n_components is not an integer.
ValueError: If n_components is <= 1.

Parameters:: n_components (int)
Return type:: None

generate_thresholding_report(output_format='text')#

Generate a human-readable report of all thresholding operations.

Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)

Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.

Parameters:

output_formatstr, default ‘text’: ‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.

Returns:

Union[str, pd.DataFrame]: Formatted report string or DataFrame.

Raises:

KeyError: If thresholding_events_key doesn’t exist in adata.uns.
ValueError: If output_format is not ‘text’ or ‘dataframe’.
TypeError: If adata.uns[thresholding_events_key] is not a dict.

Examples

Generate text report:

>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat')
>>> gmm.fit(n_components=2)
>>> gmm.categorize_samples(['Low', 'High'])
>>> report = gmm.generate_thresholding_report()
>>> print(report)
Thresholding Report
==================================================

Parameters:: output_format (str)
Return type:: Union[str, DataFrame]

plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#

Plot the BIC curve.

Parameters:

adataad.AnnData: AnnData object containing the data.
featurestr: Name of the feature to analyze.
component_rangeint: Maximum number of components to test.
layerstr or None, optional: Optional layer name to use instead of .X. Default is None.
gmm_kwargsdict or None, optional: Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
axmatplotlib.pyplot.Axes or None, optional: Optional matplotlib axes to plot on. If None, creates new figure. Default is None.
save_pathstr or Path or None, optional: Optional path to save the figure. Parent directory must exist. Default is None.

Raises:

FileNotFoundError: If save_path parent directory doesn’t exist.

Parameters:

adata (AnnData)
feature (str)
component_range (int)
layer (Optional[str])
gmm_kwargs (Optional[dict])
curve (str)
direction (str)
ax (Axes)
save_path (Union[str, Path, None])

Return type:

None

plot_bic_curve(component_range, curve='convex', direction='decreasing', ax=None, save_path=None)[source]#

Plot the Bayesian Information Criterion (BIC) curve.

This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.

Parameters:

component_rangeint: Maximum number of components to test.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.
axplt.Axes, optional: Matplotlib axes to plot on. If None, creates new figure. Defaults to None.
save_pathstr or Path, optional: Path to save the figure. Parent directory must exist. Defaults to None.

Raises:

FileNotFoundError: If save_path parent directory doesn’t exist.

Examples

Plot BIC curve to determine optimal components:

gmm = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='cell_cycle'
)
gmm.plot_bic_curve(component_range=5)
plt.show()

Parameters:

component_range (int)
curve (str)
direction (str)
ax (Optional[Axes])
save_path (Union[str, Path, None])

Return type:

None

plot_feature_distribution_exploratory(hist_kwargs=None, ax=None, x_axis_limits=None)[source]#

Plot histogram of the feature distribution for exploratory analysis.

This method allows you to visualize the feature distribution WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:

hist_kwargsdict, optional: Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.
axplt.Axes, optional: Matplotlib axes to plot on. If None, uses current axes.
x_axis_limitstuple, optional: (min, max) for x-axis. Use None for data-driven limits.

Returns:

plt.Axes: The matplotlib axes object.

Examples

Explore DNA content distribution before deciding on components:

gmm = GMMThresholding(
    adata=adata,
    feature='DNA_content',
    label_obs_save_str='cell_cycle'
)
gmm.plot_feature_distribution_exploratory(
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(0, 10)
)
plt.title('DNA Content Distribution - Exploratory')
plt.show()

Parameters:

hist_kwargs (Optional[Dict])
ax (Optional[Axes])
x_axis_limits (Optional[tuple])

Return type:

Axes

plot_feature_strip_plot_exploratory(hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None)[source]#

Plot strip plot + histogram for exploratory analysis.

Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.

Parameters:

hist_kwargsdict, optional: Keyword arguments for histogram.
strip_plot_kwargsdict, optional: Keyword arguments for strip plot.
scatter_densitybool, optional: If True, uses density-based coloring. Defaults to True.
x_axis_limitstuple, optional: (min, max) for x-axis.

Returns:

tuple: (fig, (ax_strip, ax_hist)) - Figure and axes objects.

Examples

Explore DNA content distribution with density visualization:

gmm = GMMThresholding(
    adata=adata,
    feature='DNA_content',
    label_obs_save_str='cell_cycle'
)
fig, (ax_strip, ax_hist) = gmm.plot_feature_strip_plot_exploratory(
    scatter_density=True,
    x_axis_limits=(0, 10)
)
plt.suptitle('DNA Content Distribution - Exploratory')
plt.show()

Parameters:

hist_kwargs (Optional[Dict])
strip_plot_kwargs (Optional[Dict])
scatter_density (bool)
x_axis_limits (Optional[tuple])

Return type:

tuple

plot_hist_distribution_with_boundaries(num_std=5, title=None, hist_kwargs=None, cmap=<matplotlib.colors.LinearSegmentedColormap object>, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#

Plot the histogram and GMM components with decision boundaries.

Parameters:

num_stdint, optional: Number of standard deviations for GMM plotting. Defaults to 5.
titlestr, optional: Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
hist_kwargsdict, optional: Kwargs for histogram. Defaults to None.
cmapplt.cm.ScalarMappable, optional: Colormap. Defaults to ‘rainbow’.
axplt.Axes, optional: Axes to plot on. If None, creates new. Defaults to None.
x_axis_limitstuple, optional: X-axis limits (min, max). Defaults to None.
resolutionint, optional: Resolution for plotting. Defaults to 1000.
save_pathstr or Path, optional: Path to save the figure. Parent directory must exist. Defaults to None.

Returns:

plt.Axes: The matplotlib axes object. Call plt.show() to display it.

Raises:

FileNotFoundError: If save_path parent directory doesn’t exist.

Parameters:

num_std (int)
title (Optional[str])
hist_kwargs (Optional[Dict])
cmap (_ScalarMappable)
ax (Axes)
x_axis_limits (Optional[tuple])
resolution (int)
save_path (Union[str, Path, None])

Return type:

Axes

plot_strip_plot_histogram_with_decision_boundaries(cmap=<matplotlib.colors.ListedColormap object>, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#

Generate a strip plot with a histogram and decision boundaries.

This method wraps the base class implementation, providing a convenient interface for single-feature thresholding visualizations.

Parameters:

cmapplt.cm.ScalarMappable, optional: Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].
y_axis_limitstuple, optional: Y-axis limits (min, max). If None, uses data min/max. Defaults to None.
resolutionint, optional: Resolution for boundary plotting. Defaults to 1000.
scatter_densitybool, optional: If True, color by density; if False, color by labels. Defaults to True.
vmaxint or float, optional: Maximum density value for colormap. If None, auto-calculated. Defaults to None.
hist_kwargsdict, optional: Kwargs for histogram (bins, color, etc.). Defaults to None.
strip_plot_kwargsdict, optional: Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.
titlestr or None, optional: Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

Returns:

Figure: The matplotlib figure object. Call plt.show() to display it.

Raises:

ValueError: If decision boundaries have not been calculated yet.

Parameters:

cmap (_ScalarMappable)
y_axis_limits (Optional[tuple])
resolution (int)
scatter_density (bool)
vmax (Union[int, float, None])
hist_kwargs (Optional[dict])
strip_plot_kwargs (Optional[dict])
title (Optional[str])

Return type:

Figure

return_adata(overwrite=False)[source]#

Serialize internal Pydantic models and return the modified AnnData object.

Parameters:

overwritebool, optional: If True, allows overwriting existing entries. Defaults to False.

Returns:

ad.AnnData: The modified AnnData object with the serialized GMM thresholding events.

Raises:

ValueError: If the feature already exists in adata.uns[thresholding_events_key] and overwrite is False. This prevents overwriting existing entries.

Parameters:: overwrite (bool)
Return type:: AnnData

return_thresholds()[source]#

Return the decision boundary thresholds.

Returns:

list of float: The decision boundary thresholds.

Raises:

ValueError: If decision boundaries have not been calculated yet.

Return type:

List[float]

GMMThresholding#

This Page