GMMThresholding#
- class GMMThresholding(adata, feature, label_obs_save_str, thresholding_events_key='gmm_thresholding_events', layer=None, gmm_kwargs=None, random_state=42)[source]#
Bases:
GaussianMixtureModelBaseA class to perform Gaussian Mixture Model (GMM) based thresholding on single-feature data.
This class fits a GMM to the distribution of a specified feature (e.g., gene expression), calculates decision boundaries between the components (either automatically based on probability changes or manually specified), assigns categorical labels to observations based on these boundaries, and provides plotting functionalities to visualize the results.
- Attributes:
- adataad.AnnData
A copy of the input AnnData object, modified during processing.
- featurestr
The name of the feature (column in adata.var) being thresholded.
- gmm_obs_labelstr
The key in adata.obs where the resulting category labels will be stored.
- gmm_kwargsDict
Default keyword arguments passed to the sklearn.mixture.GaussianMixture model during fitting.
- internal_data_SingleThresholdingEventModel
Pydantic model storing all results for the current feature.
- manual_decision_boundariesbool
Flag indicating if thresholds were set manually.
- Parameters:
Methods
Categorize samples based on GMM-derived or manual thresholds.
Determine the optimal number of GMM components for this feature.
Determine the optimal number of components for the GMM.
Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.
Generate a human-readable report of all thresholding operations.
Plot the BIC curve.
Plot the Bayesian Information Criterion (BIC) curve.
Plot histogram of the feature distribution for exploratory analysis.
Plot strip plot + histogram for exploratory analysis.
Plot the histogram and GMM components with decision boundaries.
Generate a strip plot with a histogram and decision boundaries.
Serialize internal Pydantic models and return the modified AnnData object.
Return the decision boundary thresholds.
- categorize_samples(manual_thresholds=None, ordered_labels=None, duplicate_labels=False)[source]#
Categorize samples based on GMM-derived or manual thresholds.
This method assigns categorical labels to samples based on either automatically calculated decision boundaries from GMM fitting or manually specified thresholds. Supports collapsing multiple GMM components into fewer categories for cross-dataset robustness.
- Parameters:
- manual_thresholdslist of float or int, optional
Explicit threshold values. If None, thresholds are calculated automatically from GMM probabilities. Must have length equal to (number of unique labels - 1).
- ordered_labelslist, optional
Label for each GMM component. Length must equal n_components fitted in GMM. Can contain duplicates if duplicate_labels=True to merge multiple components into single categories. If None, generates default labels.
- duplicate_labelsbool, optional
If True, allows duplicate labels to collapse multiple GMM components into single categories. Useful for cross-dataset robustness where many components are fitted for adaptive boundary placement but fewer final categories are desired. Defaults to False.
- Raises:
- ValueError
If GMM model has not been fitted first.
- ValueError
If number of ordered_labels doesn’t match n_components.
- ValueError
If duplicate labels provided but duplicate_labels=False.
- TypeError
If manual_thresholds is not a list.
- ValueError
If manual_thresholds length doesn’t match unique labels - 1.
- TypeError
If threshold values are not numeric.
Examples
Standard usage with 2 components and 2 labels:
gmm.fit(n_components=2) gmm.categorize_samples(ordered_labels=['Low', 'High'])
Cross-dataset robustness - fit many components, collapse to binary:
gmm.fit(n_components=8) gmm.categorize_samples( ordered_labels=['Low', 'Low', 'Low', 'High', 'High', 'High', 'High', 'High'], duplicate_labels=True ) # Boundary automatically placed based on 8-component GMM fit
Using manual thresholds:
gmm.fit(n_components=2) gmm.categorize_samples( ordered_labels=['Low', 'High'], manual_thresholds=[0.05] )
- determine_optimal_components(component_range, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)[source]#
Determine the optimal number of GMM components for this feature.
This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.
- Parameters:
- component_rangeint
Maximum number of components to test.
- metricstr, optional
Metric to use for optimization (currently only ‘bic’ supported). Defaults to ‘bic’.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.
- return_bic_listbool, optional
If True, returns tuple of (optimal_n, bic_list). Defaults to False.
- Returns:
- int or tuple
Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.
Examples
Find optimal number of components:
gmm = GMMThresholding( adata=adata, feature='cycD1 (nuc median)', label_obs_save_str='cell_cycle' ) optimal_n = gmm.determine_optimal_components( component_range=5 ) print(f'Optimal components: {optimal_n}')
- determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#
Determine the optimal number of components for the GMM.
- Parameters:
- adataad.AnnData
AnnData object containing the data.
- featurestr
Name of the feature to analyze.
- component_rangeint
Maximum number of components to test.
- layerstr or None, optional
Optional layer name to use instead of .X. Default is None.
- gmm_kwargsdict or None, optional
Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
- metricstr, optional
Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
- return_bic_listbool, optional
If True, returns tuple of (optimal_n, bic_list). Default is False.
- Returns:
- int or tuple of (int, list of (int or float))
Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.
- fit(n_components)[source]#
Fit a Gaussian Mixture Model (GMM) to the specified gene expression data.
- Parameters:
- n_componentsint
Number of Gaussian components to fit. Must be > 1.
- Raises:
- TypeError
If n_components is not an integer.
- ValueError
If n_components is <= 1.
- generate_thresholding_report(output_format='text')#
Generate a human-readable report of all thresholding operations.
Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)
Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.
- Parameters:
- output_formatstr, default ‘text’
‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.
- Returns:
- Union[str, pd.DataFrame]
Formatted report string or DataFrame.
- Raises:
- KeyError
If thresholding_events_key doesn’t exist in adata.uns.
- ValueError
If output_format is not ‘text’ or ‘dataframe’.
- TypeError
If adata.uns[thresholding_events_key] is not a dict.
Examples
Generate text report:
>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat') >>> gmm.fit(n_components=2) >>> gmm.categorize_samples(['Low', 'High']) >>> report = gmm.generate_thresholding_report() >>> print(report) Thresholding Report ==================================================
- plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#
Plot the BIC curve.
- Parameters:
- adataad.AnnData
AnnData object containing the data.
- featurestr
Name of the feature to analyze.
- component_rangeint
Maximum number of components to test.
- layerstr or None, optional
Optional layer name to use instead of .X. Default is None.
- gmm_kwargsdict or None, optional
Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
- axmatplotlib.pyplot.Axes or None, optional
Optional matplotlib axes to plot on. If None, creates new figure. Default is None.
- save_pathstr or Path or None, optional
Optional path to save the figure. Parent directory must exist. Default is None.
- Raises:
- FileNotFoundError
If save_path parent directory doesn’t exist.
- plot_bic_curve(component_range, curve='convex', direction='decreasing', ax=None, save_path=None)[source]#
Plot the Bayesian Information Criterion (BIC) curve.
This is a convenience wrapper that automatically uses the instance’s adata, feature, layer, and gmm_kwargs attributes.
- Parameters:
- component_rangeint
Maximum number of components to test.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Defaults to ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Defaults to ‘decreasing’.
- axplt.Axes, optional
Matplotlib axes to plot on. If None, creates new figure. Defaults to None.
- save_pathstr or Path, optional
Path to save the figure. Parent directory must exist. Defaults to None.
- Raises:
- FileNotFoundError
If save_path parent directory doesn’t exist.
Examples
Plot BIC curve to determine optimal components:
gmm = GMMThresholding( adata=adata, feature='cycD1 (nuc median)', label_obs_save_str='cell_cycle' ) gmm.plot_bic_curve(component_range=5) plt.show()
- plot_feature_distribution_exploratory(hist_kwargs=None, ax=None, x_axis_limits=None)[source]#
Plot histogram of the feature distribution for exploratory analysis.
This method allows you to visualize the feature distribution WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.
- Parameters:
- hist_kwargsdict, optional
Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.
- axplt.Axes, optional
Matplotlib axes to plot on. If None, uses current axes.
- x_axis_limitstuple, optional
(min, max) for x-axis. Use None for data-driven limits.
- Returns:
- plt.Axes
The matplotlib axes object.
Examples
Explore DNA content distribution before deciding on components:
gmm = GMMThresholding( adata=adata, feature='DNA_content', label_obs_save_str='cell_cycle' ) gmm.plot_feature_distribution_exploratory( hist_kwargs={'bins': 30, 'color': 'steelblue'}, x_axis_limits=(0, 10) ) plt.title('DNA Content Distribution - Exploratory') plt.show()
- plot_feature_strip_plot_exploratory(hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None)[source]#
Plot strip plot + histogram for exploratory analysis.
Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.
- Parameters:
- hist_kwargsdict, optional
Keyword arguments for histogram.
- strip_plot_kwargsdict, optional
Keyword arguments for strip plot.
- scatter_densitybool, optional
If True, uses density-based coloring. Defaults to True.
- x_axis_limitstuple, optional
(min, max) for x-axis.
- Returns:
- tuple
(fig, (ax_strip, ax_hist)) - Figure and axes objects.
Examples
Explore DNA content distribution with density visualization:
gmm = GMMThresholding( adata=adata, feature='DNA_content', label_obs_save_str='cell_cycle' ) fig, (ax_strip, ax_hist) = gmm.plot_feature_strip_plot_exploratory( scatter_density=True, x_axis_limits=(0, 10) ) plt.suptitle('DNA Content Distribution - Exploratory') plt.show()
- plot_hist_distribution_with_boundaries(num_std=5, title=None, hist_kwargs=None, cmap=<matplotlib.colors.LinearSegmentedColormap object>, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#
Plot the histogram and GMM components with decision boundaries.
- Parameters:
- num_stdint, optional
Number of standard deviations for GMM plotting. Defaults to 5.
- titlestr, optional
Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
- hist_kwargsdict, optional
Kwargs for histogram. Defaults to None.
- cmapplt.cm.ScalarMappable, optional
Colormap. Defaults to ‘rainbow’.
- axplt.Axes, optional
Axes to plot on. If None, creates new. Defaults to None.
- x_axis_limitstuple, optional
X-axis limits (min, max). Defaults to None.
- resolutionint, optional
Resolution for plotting. Defaults to 1000.
- save_pathstr or Path, optional
Path to save the figure. Parent directory must exist. Defaults to None.
- Returns:
- plt.Axes
The matplotlib axes object. Call plt.show() to display it.
- Raises:
- FileNotFoundError
If save_path parent directory doesn’t exist.
- plot_strip_plot_histogram_with_decision_boundaries(cmap=<matplotlib.colors.ListedColormap object>, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#
Generate a strip plot with a histogram and decision boundaries.
This method wraps the base class implementation, providing a convenient interface for single-feature thresholding visualizations.
- Parameters:
- cmapplt.cm.ScalarMappable, optional
Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].
- y_axis_limitstuple, optional
Y-axis limits (min, max). If None, uses data min/max. Defaults to None.
- resolutionint, optional
Resolution for boundary plotting. Defaults to 1000.
- scatter_densitybool, optional
If True, color by density; if False, color by labels. Defaults to True.
- vmaxint or float, optional
Maximum density value for colormap. If None, auto-calculated. Defaults to None.
- hist_kwargsdict, optional
Kwargs for histogram (bins, color, etc.). Defaults to None.
- strip_plot_kwargsdict, optional
Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.
- titlestr or None, optional
Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
- Returns:
- Figure
The matplotlib figure object. Call plt.show() to display it.
- Raises:
- ValueError
If decision boundaries have not been calculated yet.
- return_adata(overwrite=False)[source]#
Serialize internal Pydantic models and return the modified AnnData object.
- Parameters:
- overwritebool, optional
If True, allows overwriting existing entries. Defaults to False.
- Returns:
- ad.AnnData
The modified AnnData object with the serialized GMM thresholding events.
- Raises:
- ValueError
If the feature already exists in adata.uns[thresholding_events_key] and overwrite is False. This prevents overwriting existing entries.