SequentialGMM#

class SequentialGMM(adata, thresholding_events_key='sequential_gmm_thresholding_events', gmm_kwargs=None, random_state=42)[source]#

Bases: GaussianMixtureModelBase

Sequential GMM thresholding for iterative population refinement.

This class enables performing multiple sequential GMM thresholding operations on subsets of cells, where each operation refines a specific categorical label from a previous thresholding event. This is useful for hierarchical cell type classification or iterative gating strategies.

Unlike GMMThresholding which thresholds a single feature once, this class allows:

  • Initial thresholding on entire dataset

  • Refinement of specific label values through additional thresholding

  • Tracking operation provenance (parent-child relationships)

  • Multiple operations stored in a single .uns key

Attributes:
adataad.AnnData

A copy of the input AnnData object, modified during processing.

thresholding_events_keystr

Key in adata.uns for storing all operations.

gmm_kwargsDict

Default GMM kwargs (can be overridden per operation).

random_stateint

Random state for reproducibility.

Examples

Example workflow:

# Initialize
seq_gmm = SequentialGMM(
    adata=adata,
    thresholding_events_key='sequential_thresholding'
)

# Create initial labels on entire dataset
seq_gmm.threshold_entire_dataset(
    feature='DNA_content',
    label_obs_save_str='cell_cycle',
    n_components=2,
    ordered_labels=['Low', 'High'],
    operation_name='DNA_threshold'
)

# Refine 'Low' cells only
seq_gmm.refine_labels_with_gmm(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='Low',
    n_components=2,
    ordered_labels=['Low_neg', 'Low_pos'],
    operation_name='Plk1_refinement'
)

# Get modified adata
adata = seq_gmm.return_adata()
Parameters:

Methods

determine_optimal_number_components

Determine the optimal number of components for the GMM.

generate_thresholding_report

Generate a human-readable report of all thresholding operations.

plot_bayesian_information_criterion_curve

Plot the BIC curve.

plot_feature_distribution_exploratory

Plot histogram of a feature distribution for exploratory analysis.

plot_feature_strip_plot_exploratory

Plot strip plot + histogram for exploratory analysis.

plot_hist_distribution_with_boundaries

Plot histogram with boundaries for a specific operation.

plot_strip_plot_histogram_with_decision_boundaries

Plot 1D strip plot with histogram and decision boundaries for a specific operation.

refine_labels_with_gmm

Refine existing categorical labels by thresholding a subset with GMM.

refine_labels_with_manual_thresholds

Refine existing categorical labels using manual thresholds.

return_adata

Return the modified AnnData object.

threshold_entire_dataset

Threshold entire dataset to create initial categorical labels.

determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#

Determine the optimal number of components for the GMM.

Parameters:
adataad.AnnData

AnnData object containing the data.

featurestr

Name of the feature to analyze.

component_rangeint

Maximum number of components to test.

layerstr or None, optional

Optional layer name to use instead of .X. Default is None.

gmm_kwargsdict or None, optional

Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.

metricstr, optional

Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.

return_bic_listbool, optional

If True, returns tuple of (optimal_n, bic_list). Default is False.

Returns:
int or tuple of (int, list of (int or float))

Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Parameters:
Return type:

Union[int, Tuple[int, List[Union[int, float]]]]

generate_thresholding_report(output_format='text')#

Generate a human-readable report of all thresholding operations.

Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)

Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.

Parameters:
output_formatstr, default ‘text’

‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.

Returns:
Union[str, pd.DataFrame]

Formatted report string or DataFrame.

Raises:
KeyError

If thresholding_events_key doesn’t exist in adata.uns.

ValueError

If output_format is not ‘text’ or ‘dataframe’.

TypeError

If adata.uns[thresholding_events_key] is not a dict.

Examples

Generate text report:

>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat')
>>> gmm.fit(n_components=2)
>>> gmm.categorize_samples(['Low', 'High'])
>>> report = gmm.generate_thresholding_report()
>>> print(report)
Thresholding Report
==================================================
Parameters:

output_format (str)

Return type:

Union[str, DataFrame]

plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#

Plot the BIC curve.

Parameters:
adataad.AnnData

AnnData object containing the data.

featurestr

Name of the feature to analyze.

component_rangeint

Maximum number of components to test.

layerstr or None, optional

Optional layer name to use instead of .X. Default is None.

gmm_kwargsdict or None, optional

Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.

curvestr, optional

Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.

directionstr, optional

Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.

axmatplotlib.pyplot.Axes or None, optional

Optional matplotlib axes to plot on. If None, creates new figure. Default is None.

save_pathstr or Path or None, optional

Optional path to save the figure. Parent directory must exist. Default is None.

Raises:
FileNotFoundError

If save_path parent directory doesn’t exist.

Parameters:
Return type:

None

plot_feature_distribution_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, ax=None, x_axis_limits=None)[source]#

Plot histogram of a feature distribution for exploratory analysis.

This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:
featurestr

Feature name to plot (must exist in adata.var_names).

obs_labelOptional[str], optional

Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.

value_to_subsetOptional[str], optional

Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.

layerOptional[str], optional

Layer to use for data. If None, uses adata.X. Defaults to None.

hist_kwargsOptional[Dict], optional

Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.

axOptional[plt.Axes], optional

Matplotlib axes to plot on. If None, uses current axes. Defaults to None.

x_axis_limitsOptional[tuple], optional

(min, max) for x-axis. Use None for data-driven limits. Defaults to None.

Returns:
Axes

The matplotlib axes object.

Raises:
ValueError

If value_to_subset is provided without obs_label.

KeyError

If obs_label doesn’t exist in adata.obs.

ValueError

If value_to_subset is not present in adata.obs[obs_label].

Examples

Explore entire dataset:

seq_gmm.plot_feature_distribution_exploratory(
    feature='Int_Intg_DNA_nuc',
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(5, 15)
)
plt.title('DNA Content Distribution - All Cells')
plt.show()

Explore specific subset:

seq_gmm.plot_feature_distribution_exploratory(
    feature='Int_Intg_DNA_nuc',
    obs_label='cell_cycle_phase',
    value_to_subset='G1/S/G2',
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(5, 15)
)
plt.title('DNA Content in G1/S/G2 Cells - Exploratory')
plt.show()
Parameters:
Return type:

Axes

plot_feature_strip_plot_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None, vmax=None)[source]#

Plot strip plot + histogram for exploratory analysis.

Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.

This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:
featurestr

Feature name to plot (must exist in adata.var_names).

obs_labelOptional[str], optional

Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.

value_to_subsetOptional[str], optional

Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.

layerOptional[str], optional

Layer to use for data. If None, uses adata.X. Defaults to None.

hist_kwargsOptional[Dict], optional

Keyword arguments for histogram. Defaults to None.

strip_plot_kwargsOptional[Dict], optional

Keyword arguments for strip plot. Only used when scatter_density=False. Defaults to None.

scatter_densitybool, optional

If True, uses density-based coloring. If False, uses uniform scatter plot. Defaults to True.

x_axis_limitsOptional[tuple], optional

(min, max) for x-axis. Use None for data-driven limits. Defaults to None.

vmaxOptional[Union[int, float]], optional

Maximum density value for colormap. Only used when scatter_density=True. If None, auto-calculated. Defaults to None.

Returns:
tuple

(fig, (ax_strip, ax_hist)) - Figure and axes objects.

Raises:
ValueError

If value_to_subset is provided without obs_label.

KeyError

If obs_label doesn’t exist in adata.obs.

ValueError

If value_to_subset is not present in adata.obs[obs_label].

Examples

Explore entire dataset:

fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory(
    feature='Int_Intg_DNA_nuc',
    scatter_density=True,
    x_axis_limits=(5, 15)
)
plt.suptitle('DNA Content Distribution - All Cells')
plt.show()

Explore specific subset:

fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory(
    feature='Int_Intg_DNA_nuc',
    obs_label='cell_cycle_phase',
    value_to_subset='G1/S/G2',
    scatter_density=True,
    x_axis_limits=(5, 15)
)
plt.suptitle('DNA Content Distribution - G1/S/G2 Cells')
plt.show()
Parameters:
Return type:

tuple

plot_hist_distribution_with_boundaries(operation_name, num_std=5, title=None, hist_kwargs=None, cmap=None, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#

Plot histogram with boundaries for a specific operation.

Parameters:
operation_namestr

Name of operation to plot (from .uns keys).

num_stdint, optional

Number of standard deviations for GMM plotting. Defaults to 5.

titleOptional[str], optional

Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

hist_kwargsOptional[Dict], optional

Kwargs for histogram. Defaults to None.

cmapplt.cm.ScalarMappable, optional

Colormap. Defaults to ‘rainbow’.

axplt.Axes, optional

Axes to plot on. If None, creates new. Defaults to None.

x_axis_limitsOptional[tuple], optional

X-axis limits (min, max). Defaults to None.

resolutionint, optional

Resolution for plotting. Defaults to 1000.

save_pathOptional[Union[str, Path]], optional

Path to save the figure. Parent directory must exist. Defaults to None.

Returns:
Axes

The matplotlib axes object. Call plt.show() to display it.

Raises:
KeyError

If operation_name not found in .uns.

ValueError

If operation has no decision boundaries.

ValueError

If resolution <= 0 or <= n_components.

FileNotFoundError

If save_path parent directory doesn’t exist.

Examples

ax = seq_gmm.plot_hist_distribution_with_boundaries('Plk1_refinement')
plt.show()
Parameters:
Return type:

Axes

plot_strip_plot_histogram_with_decision_boundaries(operation_name, cmap=None, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#

Plot 1D strip plot with histogram and decision boundaries for a specific operation.

This method wraps the base class implementation to provide visualization for sequential thresholding operations. It creates a density strip plot (or label-colored scatter) alongside a horizontal histogram showing the distribution and decision boundaries for the specified operation.

Parameters:
operation_namestr

Name of operation to plot (from .uns keys).

cmapplt.cm.ScalarMappable, optional

Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].

y_axis_limitsOptional[Tuple[float, float]], optional

Y-axis limits (min, max). If None, uses data min/max. Defaults to None.

resolutionint, optional

Resolution for boundary plotting. Defaults to 1000.

scatter_densitybool, optional

If True, color by density; if False, color by labels. Defaults to True.

vmaxOptional[Union[int, float]], optional

Maximum density value for colormap. If None, auto-calculated. Defaults to None.

hist_kwargsOptional[Dict], optional

Kwargs for histogram (bins, color, etc.). Defaults to None.

strip_plot_kwargsOptional[Dict], optional

Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.

titleOptional[str], optional

Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

Returns:
Figure

The matplotlib figure object. Call plt.show() to display it.

Raises:
KeyError

If operation_name not found in .uns.

ValueError

If operation has no decision boundaries.

Examples

Basic usage with label-colored scatter:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False
)
plt.show()

Custom title:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False,
    title='M Phase Separation'
)
plt.show()

Customize strip plot appearance:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False,
    strip_plot_kwargs={'s': 5, 'alpha': 0.8, 'marker': 'o'}
)
plt.show()
Parameters:
Return type:

Figure

refine_labels_with_gmm(feature, obs_label, value_to_refine, n_components, ordered_labels, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#

Refine existing categorical labels by thresholding a subset with GMM.

Modifies adata.obs[obs_label] in-place (within the copy), replacing cells with value_to_refine with new labels based on GMM thresholding.

Parameters:
featurestr

Feature to threshold on (e.g., ‘Plk1’).

obs_labelstr

Obs column to modify in-place (e.g., ‘cell_cycle’).

value_to_refinestr

Which label value to refine (e.g., ‘G0’).

n_componentsint

Number of GMM components to fit.

ordered_labelsList[str]

New labels to assign (e.g., [‘G0_low’, ‘G0_high’]).

duplicate_labelsbool, optional

Allow duplicate labels for label collapsing. Defaults to False.

operation_namestr

Required name for tracking this operation.

layerOptional[str], optional

Layer to use for data access. If None, uses adata.X. Defaults to None.

gmm_kwargsOptional[dict], optional

GMM kwargs for this operation. Overrides default if provided. Defaults to None.

overwritebool, optional

If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.

Raises:
ValueError

If operation_name is None or empty.

KeyError

If operation_name already exists in .uns and overwrite=False.

KeyError

If obs_label doesn’t exist in adata.obs.

ValueError

If value_to_refine is not present in adata.obs[obs_label].

ValueError

If no cells have the value_to_refine.

Examples

# Before: adata.obs['cell_cycle'] = ['G0', 'G0', 'G1', 'S', 'G0']
seq_gmm.refine_labels_with_gmm(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='G0',
    n_components=2,
    ordered_labels=['G0_low', 'G0_high'],
    operation_name='Plk1_G0_refinement'
)
# After: adata.obs['cell_cycle'] = ['G0_low', 'G0_high', 'G1', 'S', 'G0_low']
Parameters:
Return type:

None

refine_labels_with_manual_thresholds(feature, obs_label, value_to_refine, manual_thresholds, ordered_labels, operation_name=None, layer=None, overwrite=False)[source]#

Refine existing categorical labels using manual thresholds.

Similar to refine_labels_with_gmm() but uses explicit threshold values instead of fitting a GMM.

Parameters:
featurestr

Feature to threshold on.

obs_labelstr

Obs column to modify in-place.

value_to_refinestr

Which label value to refine.

manual_thresholdsList[Union[float, int]]

Threshold values. Length must be len(ordered_labels) - 1.

ordered_labelsList[str]

New labels to assign.

operation_namestr

Required name for tracking this operation.

layerOptional[str], optional

Layer to use for data access. If None, uses adata.X. Defaults to None.

overwritebool, optional

If True, allows overwriting an existing operation with the same name. Useful for updating thresholds. Defaults to False.

Raises:
ValueError

If operation_name is None or empty.

KeyError

If operation_name already exists in .uns and overwrite=False.

KeyError

If obs_label doesn’t exist in adata.obs.

ValueError

If value_to_refine is not present in adata.obs[obs_label].

ValueError

If no cells have the value_to_refine.

ValueError

If len(manual_thresholds) != len(ordered_labels) - 1.

Examples

seq_gmm.refine_labels_with_manual_thresholds(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='G0',
    manual_thresholds=[1.5],
    ordered_labels=['G0_low', 'G0_high'],
    operation_name='Plk1_G0_manual'
)
Parameters:
Return type:

None

return_adata()[source]#

Return the modified AnnData object.

Returns:
ad.AnnData

Modified AnnData object with all operations applied.

Return type:

AnnData

Examples

seq_gmm = SequentialGMM(adata)
seq_gmm.threshold_entire_dataset(...)
seq_gmm.refine_labels_with_gmm(...)
adata_modified = seq_gmm.return_adata()
threshold_entire_dataset(feature, label_obs_save_str, n_components, ordered_labels, manual_thresholds=None, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#

Threshold entire dataset to create initial categorical labels.

This method creates a new obs column with categorical labels based on GMM thresholding of a single feature across all cells. It’s a wrapper around GMMThresholding that stores results in the sequential thresholding framework.

Parameters:
featurestr

Feature name to threshold on (must exist in adata.var_names).

label_obs_save_strstr

New column name in adata.obs for labels.

n_componentsint

Number of GMM components to fit.

ordered_labelsList[str]

Labels to assign (length = n_components).

manual_thresholdsOptional[List[Union[float, int]]], optional

Manual threshold values. If None, calculated automatically from GMM. Length must be n_components - 1. Defaults to None.

duplicate_labelsbool, optional

Allow duplicate labels for label collapsing. Defaults to False.

operation_namestr

Required name for tracking this operation.

layerOptional[str], optional

Layer to use for data access. If None, uses adata.X. Defaults to None.

gmm_kwargsOptional[dict], optional

GMM kwargs for this operation. Overrides default if provided. Defaults to None.

overwritebool, optional

If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.

Raises:
ValueError

If operation_name is None or empty.

KeyError

If operation_name already exists in .uns and overwrite=False.

Notes

Other exceptions raised by GMMThresholding.

Examples

seq_gmm.threshold_entire_dataset(
    feature='DNA_content',
    label_obs_save_str='cell_cycle',
    n_components=3,
    ordered_labels=['G0', 'G1', 'S'],
    operation_name='DNA_initial_threshold'
)
Parameters:
Return type:

None