SequentialGMM#

class SequentialGMM(adata, thresholding_events_key='sequential_gmm_thresholding_events', gmm_kwargs=None, random_state=42)[source]#

Bases: GaussianMixtureModelBase

Sequential GMM thresholding for iterative population refinement.

This class enables performing multiple sequential GMM thresholding operations on subsets of cells, where each operation refines a specific categorical label from a previous thresholding event. This is useful for hierarchical cell type classification or iterative gating strategies.

Unlike GMMThresholding which thresholds a single feature once, this class allows:

Initial thresholding on entire dataset
Refinement of specific label values through additional thresholding
Tracking operation provenance (parent-child relationships)
Multiple operations stored in a single .uns key

Attributes:

adataad.AnnData: A copy of the input AnnData object, modified during processing.
thresholding_events_keystr: Key in adata.uns for storing all operations.
gmm_kwargsDict: Default GMM kwargs (can be overridden per operation).
random_stateint: Random state for reproducibility.

Examples

Example workflow:

# Initialize
seq_gmm = SequentialGMM(
    adata=adata,
    thresholding_events_key='sequential_thresholding'
)

# Create initial labels on entire dataset
seq_gmm.threshold_entire_dataset(
    feature='DNA_content',
    label_obs_save_str='cell_cycle',
    n_components=2,
    ordered_labels=['Low', 'High'],
    operation_name='DNA_threshold'
)

# Refine 'Low' cells only
seq_gmm.refine_labels_with_gmm(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='Low',
    n_components=2,
    ordered_labels=['Low_neg', 'Low_pos'],
    operation_name='Plk1_refinement'
)

# Get modified adata
adata = seq_gmm.return_adata()

Parameters:

adata (AnnData)
thresholding_events_key (str)
gmm_kwargs (Optional[dict])
random_state (int)

Methods

`determine_optimal_number_components`	Determine the optimal number of components for the GMM.
`generate_thresholding_report`	Generate a human-readable report of all thresholding operations.
`plot_bayesian_information_criterion_curve`	Plot the BIC curve.
`plot_feature_distribution_exploratory`	Plot histogram of a feature distribution for exploratory analysis.
`plot_feature_strip_plot_exploratory`	Plot strip plot + histogram for exploratory analysis.
`plot_hist_distribution_with_boundaries`	Plot histogram with boundaries for a specific operation.
`plot_strip_plot_histogram_with_decision_boundaries`	Plot 1D strip plot with histogram and decision boundaries for a specific operation.
`refine_labels_with_gmm`	Refine existing categorical labels by thresholding a subset with GMM.
`refine_labels_with_manual_thresholds`	Refine existing categorical labels using manual thresholds.
`return_adata`	Return the modified AnnData object.
`threshold_entire_dataset`	Threshold entire dataset to create initial categorical labels.

determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#

Determine the optimal number of components for the GMM.

Parameters:

adataad.AnnData: AnnData object containing the data.
featurestr: Name of the feature to analyze.
component_rangeint: Maximum number of components to test.
layerstr or None, optional: Optional layer name to use instead of .X. Default is None.
gmm_kwargsdict or None, optional: Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
metricstr, optional: Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
return_bic_listbool, optional: If True, returns tuple of (optimal_n, bic_list). Default is False.

Returns:

int or tuple of (int, list of (int or float)): Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.

Parameters:

adata (AnnData)
feature (str)
component_range (int)
layer (Optional[str])
gmm_kwargs (Optional[dict])
metric (str)
curve (str)
direction (str)
return_bic_list (bool)

Return type:

Union[int, Tuple[int, List[Union[int, float]]]]

generate_thresholding_report(output_format='text')#

Generate a human-readable report of all thresholding operations.

Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)

Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.

Parameters:

output_formatstr, default ‘text’: ‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.

Returns:

Union[str, pd.DataFrame]: Formatted report string or DataFrame.

Raises:

KeyError: If thresholding_events_key doesn’t exist in adata.uns.
ValueError: If output_format is not ‘text’ or ‘dataframe’.
TypeError: If adata.uns[thresholding_events_key] is not a dict.

Examples

Generate text report:

>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat')
>>> gmm.fit(n_components=2)
>>> gmm.categorize_samples(['Low', 'High'])
>>> report = gmm.generate_thresholding_report()
>>> print(report)
Thresholding Report
==================================================

Parameters:: output_format (str)
Return type:: Union[str, DataFrame]

plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#

Plot the BIC curve.

Parameters:

adataad.AnnData: AnnData object containing the data.
featurestr: Name of the feature to analyze.
component_rangeint: Maximum number of components to test.
layerstr or None, optional: Optional layer name to use instead of .X. Default is None.
gmm_kwargsdict or None, optional: Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
curvestr, optional: Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
directionstr, optional: Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
axmatplotlib.pyplot.Axes or None, optional: Optional matplotlib axes to plot on. If None, creates new figure. Default is None.
save_pathstr or Path or None, optional: Optional path to save the figure. Parent directory must exist. Default is None.

Raises:

FileNotFoundError: If save_path parent directory doesn’t exist.

Parameters:

adata (AnnData)
feature (str)
component_range (int)
layer (Optional[str])
gmm_kwargs (Optional[dict])
curve (str)
direction (str)
ax (Axes)
save_path (Union[str, Path, None])

Return type:

None

plot_feature_distribution_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, ax=None, x_axis_limits=None)[source]#

Plot histogram of a feature distribution for exploratory analysis.

This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:

featurestr: Feature name to plot (must exist in adata.var_names).
obs_labelOptional[str], optional: Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.
value_to_subsetOptional[str], optional: Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.
layerOptional[str], optional: Layer to use for data. If None, uses adata.X. Defaults to None.
hist_kwargsOptional[Dict], optional: Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.
axOptional[plt.Axes], optional: Matplotlib axes to plot on. If None, uses current axes. Defaults to None.
x_axis_limitsOptional[tuple], optional: (min, max) for x-axis. Use None for data-driven limits. Defaults to None.

Returns:

Axes: The matplotlib axes object.

Raises:

ValueError: If value_to_subset is provided without obs_label.
KeyError: If obs_label doesn’t exist in adata.obs.
ValueError: If value_to_subset is not present in adata.obs[obs_label].

Examples

Explore entire dataset:

seq_gmm.plot_feature_distribution_exploratory(
    feature='Int_Intg_DNA_nuc',
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(5, 15)
)
plt.title('DNA Content Distribution - All Cells')
plt.show()

Explore specific subset:

seq_gmm.plot_feature_distribution_exploratory(
    feature='Int_Intg_DNA_nuc',
    obs_label='cell_cycle_phase',
    value_to_subset='G1/S/G2',
    hist_kwargs={'bins': 30, 'color': 'steelblue'},
    x_axis_limits=(5, 15)
)
plt.title('DNA Content in G1/S/G2 Cells - Exploratory')
plt.show()

Parameters:

feature (str)
obs_label (Optional[str])
value_to_subset (Optional[str])
layer (Optional[str])
hist_kwargs (Optional[Dict])
ax (Optional[Axes])
x_axis_limits (Optional[tuple])

Return type:

Axes

plot_feature_strip_plot_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None, vmax=None)[source]#

Plot strip plot + histogram for exploratory analysis.

Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.

This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.

Parameters:

featurestr: Feature name to plot (must exist in adata.var_names).
obs_labelOptional[str], optional: Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.
value_to_subsetOptional[str], optional: Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.
layerOptional[str], optional: Layer to use for data. If None, uses adata.X. Defaults to None.
hist_kwargsOptional[Dict], optional: Keyword arguments for histogram. Defaults to None.
strip_plot_kwargsOptional[Dict], optional: Keyword arguments for strip plot. Only used when scatter_density=False. Defaults to None.
scatter_densitybool, optional: If True, uses density-based coloring. If False, uses uniform scatter plot. Defaults to True.
x_axis_limitsOptional[tuple], optional: (min, max) for x-axis. Use None for data-driven limits. Defaults to None.
vmaxOptional[Union[int, float]], optional: Maximum density value for colormap. Only used when scatter_density=True. If None, auto-calculated. Defaults to None.

Returns:

tuple: (fig, (ax_strip, ax_hist)) - Figure and axes objects.

Raises:

ValueError: If value_to_subset is provided without obs_label.
KeyError: If obs_label doesn’t exist in adata.obs.
ValueError: If value_to_subset is not present in adata.obs[obs_label].

Examples

Explore entire dataset:

fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory(
    feature='Int_Intg_DNA_nuc',
    scatter_density=True,
    x_axis_limits=(5, 15)
)
plt.suptitle('DNA Content Distribution - All Cells')
plt.show()

Explore specific subset:

fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory(
    feature='Int_Intg_DNA_nuc',
    obs_label='cell_cycle_phase',
    value_to_subset='G1/S/G2',
    scatter_density=True,
    x_axis_limits=(5, 15)
)
plt.suptitle('DNA Content Distribution - G1/S/G2 Cells')
plt.show()

Parameters:

feature (str)
obs_label (Optional[str])
value_to_subset (Optional[str])
layer (Optional[str])
hist_kwargs (Optional[Dict])
strip_plot_kwargs (Optional[Dict])
scatter_density (bool)
x_axis_limits (Optional[tuple])
vmax (Union[int, float, None])

Return type:

tuple

plot_hist_distribution_with_boundaries(operation_name, num_std=5, title=None, hist_kwargs=None, cmap=None, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#

Plot histogram with boundaries for a specific operation.

Parameters:

operation_namestr: Name of operation to plot (from .uns keys).
num_stdint, optional: Number of standard deviations for GMM plotting. Defaults to 5.
titleOptional[str], optional: Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
hist_kwargsOptional[Dict], optional: Kwargs for histogram. Defaults to None.
cmapplt.cm.ScalarMappable, optional: Colormap. Defaults to ‘rainbow’.
axplt.Axes, optional: Axes to plot on. If None, creates new. Defaults to None.
x_axis_limitsOptional[tuple], optional: X-axis limits (min, max). Defaults to None.
resolutionint, optional: Resolution for plotting. Defaults to 1000.
save_pathOptional[Union[str, Path]], optional: Path to save the figure. Parent directory must exist. Defaults to None.

Returns:

Axes: The matplotlib axes object. Call plt.show() to display it.

Raises:

KeyError: If operation_name not found in .uns.
ValueError: If operation has no decision boundaries.
ValueError: If resolution <= 0 or <= n_components.
FileNotFoundError: If save_path parent directory doesn’t exist.

Examples

ax = seq_gmm.plot_hist_distribution_with_boundaries('Plk1_refinement')
plt.show()

Parameters:

operation_name (str)
num_std (int)
title (Optional[str])
hist_kwargs (Optional[Dict])
cmap (Optional[Colormap])
ax (Optional[Axes])
x_axis_limits (Optional[tuple])
resolution (int)
save_path (Union[str, Path, None])

Return type:

Axes

plot_strip_plot_histogram_with_decision_boundaries(operation_name, cmap=None, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#

Plot 1D strip plot with histogram and decision boundaries for a specific operation.

This method wraps the base class implementation to provide visualization for sequential thresholding operations. It creates a density strip plot (or label-colored scatter) alongside a horizontal histogram showing the distribution and decision boundaries for the specified operation.

Parameters:

operation_namestr: Name of operation to plot (from .uns keys).
cmapplt.cm.ScalarMappable, optional: Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].
y_axis_limitsOptional[Tuple[float, float]], optional: Y-axis limits (min, max). If None, uses data min/max. Defaults to None.
resolutionint, optional: Resolution for boundary plotting. Defaults to 1000.
scatter_densitybool, optional: If True, color by density; if False, color by labels. Defaults to True.
vmaxOptional[Union[int, float]], optional: Maximum density value for colormap. If None, auto-calculated. Defaults to None.
hist_kwargsOptional[Dict], optional: Kwargs for histogram (bins, color, etc.). Defaults to None.
strip_plot_kwargsOptional[Dict], optional: Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.
titleOptional[str], optional: Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.

Returns:

Figure: The matplotlib figure object. Call plt.show() to display it.

Raises:

KeyError: If operation_name not found in .uns.
ValueError: If operation has no decision boundaries.

Examples

Basic usage with label-colored scatter:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False
)
plt.show()

Custom title:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False,
    title='M Phase Separation'
)
plt.show()

Customize strip plot appearance:

fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries(
    operation_name='separate_M_phase',
    scatter_density=False,
    strip_plot_kwargs={'s': 5, 'alpha': 0.8, 'marker': 'o'}
)
plt.show()

Parameters:

operation_name (str)
cmap (Optional[Colormap])
y_axis_limits (Optional[Tuple[float, float]])
resolution (int)
scatter_density (bool)
vmax (Union[int, float, None])
hist_kwargs (Optional[Dict])
strip_plot_kwargs (Optional[Dict])
title (Optional[str])

Return type:

Figure

refine_labels_with_gmm(feature, obs_label, value_to_refine, n_components, ordered_labels, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#

Refine existing categorical labels by thresholding a subset with GMM.

Modifies adata.obs[obs_label] in-place (within the copy), replacing cells with value_to_refine with new labels based on GMM thresholding.

Parameters:

featurestr: Feature to threshold on (e.g., ‘Plk1’).
obs_labelstr: Obs column to modify in-place (e.g., ‘cell_cycle’).
value_to_refinestr: Which label value to refine (e.g., ‘G0’).
n_componentsint: Number of GMM components to fit.
ordered_labelsList[str]: New labels to assign (e.g., [‘G0_low’, ‘G0_high’]).
duplicate_labelsbool, optional: Allow duplicate labels for label collapsing. Defaults to False.
operation_namestr: Required name for tracking this operation.
layerOptional[str], optional: Layer to use for data access. If None, uses adata.X. Defaults to None.
gmm_kwargsOptional[dict], optional: GMM kwargs for this operation. Overrides default if provided. Defaults to None.
overwritebool, optional: If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.

Raises:

ValueError: If operation_name is None or empty.
KeyError: If operation_name already exists in .uns and overwrite=False.
KeyError: If obs_label doesn’t exist in adata.obs.
ValueError: If value_to_refine is not present in adata.obs[obs_label].
ValueError: If no cells have the value_to_refine.

Examples

# Before: adata.obs['cell_cycle'] = ['G0', 'G0', 'G1', 'S', 'G0']
seq_gmm.refine_labels_with_gmm(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='G0',
    n_components=2,
    ordered_labels=['G0_low', 'G0_high'],
    operation_name='Plk1_G0_refinement'
)
# After: adata.obs['cell_cycle'] = ['G0_low', 'G0_high', 'G1', 'S', 'G0_low']

Parameters:

feature (str)
obs_label (str)
value_to_refine (str)
n_components (int)
ordered_labels (List[str])
duplicate_labels (bool)
operation_name (Optional[str])
layer (Optional[str])
gmm_kwargs (Optional[dict])
overwrite (bool)

Return type:

None

refine_labels_with_manual_thresholds(feature, obs_label, value_to_refine, manual_thresholds, ordered_labels, operation_name=None, layer=None, overwrite=False)[source]#

Refine existing categorical labels using manual thresholds.

Similar to refine_labels_with_gmm() but uses explicit threshold values instead of fitting a GMM.

Parameters:

featurestr: Feature to threshold on.
obs_labelstr: Obs column to modify in-place.
value_to_refinestr: Which label value to refine.
manual_thresholdsList[Union[float, int]]: Threshold values. Length must be len(ordered_labels) - 1.
ordered_labelsList[str]: New labels to assign.
operation_namestr: Required name for tracking this operation.
layerOptional[str], optional: Layer to use for data access. If None, uses adata.X. Defaults to None.
overwritebool, optional: If True, allows overwriting an existing operation with the same name. Useful for updating thresholds. Defaults to False.

Raises:

ValueError: If operation_name is None or empty.
KeyError: If operation_name already exists in .uns and overwrite=False.
KeyError: If obs_label doesn’t exist in adata.obs.
ValueError: If value_to_refine is not present in adata.obs[obs_label].
ValueError: If no cells have the value_to_refine.
ValueError: If len(manual_thresholds) != len(ordered_labels) - 1.

Examples

seq_gmm.refine_labels_with_manual_thresholds(
    feature='Plk1',
    obs_label='cell_cycle',
    value_to_refine='G0',
    manual_thresholds=[1.5],
    ordered_labels=['G0_low', 'G0_high'],
    operation_name='Plk1_G0_manual'
)

Parameters:

feature (str)
obs_label (str)
value_to_refine (str)
manual_thresholds (List[Union[int, float]])
ordered_labels (List[str])
operation_name (Optional[str])
layer (Optional[str])
overwrite (bool)

Return type:

None

return_adata()[source]#

Return the modified AnnData object.

Returns:

ad.AnnData: Modified AnnData object with all operations applied.

Return type:

AnnData

Examples

seq_gmm = SequentialGMM(adata)
seq_gmm.threshold_entire_dataset(...)
seq_gmm.refine_labels_with_gmm(...)
adata_modified = seq_gmm.return_adata()

threshold_entire_dataset(feature, label_obs_save_str, n_components, ordered_labels, manual_thresholds=None, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#

Threshold entire dataset to create initial categorical labels.

This method creates a new obs column with categorical labels based on GMM thresholding of a single feature across all cells. It’s a wrapper around GMMThresholding that stores results in the sequential thresholding framework.

Parameters:

featurestr: Feature name to threshold on (must exist in adata.var_names).
label_obs_save_strstr: New column name in adata.obs for labels.
n_componentsint: Number of GMM components to fit.
ordered_labelsList[str]: Labels to assign (length = n_components).
manual_thresholdsOptional[List[Union[float, int]]], optional: Manual threshold values. If None, calculated automatically from GMM. Length must be n_components - 1. Defaults to None.
duplicate_labelsbool, optional: Allow duplicate labels for label collapsing. Defaults to False.
operation_namestr: Required name for tracking this operation.
layerOptional[str], optional: Layer to use for data access. If None, uses adata.X. Defaults to None.
gmm_kwargsOptional[dict], optional: GMM kwargs for this operation. Overrides default if provided. Defaults to None.
overwritebool, optional: If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.

Raises:

ValueError: If operation_name is None or empty.
KeyError: If operation_name already exists in .uns and overwrite=False.

Notes

Other exceptions raised by GMMThresholding.

Examples

seq_gmm.threshold_entire_dataset(
    feature='DNA_content',
    label_obs_save_str='cell_cycle',
    n_components=3,
    ordered_labels=['G0', 'G1', 'S'],
    operation_name='DNA_initial_threshold'
)

Parameters:

feature (str)
label_obs_save_str (str)
n_components (int)
ordered_labels (List[str])
manual_thresholds (Optional[List[Union[int, float]]])
duplicate_labels (bool)
operation_name (Optional[str])
layer (Optional[str])
gmm_kwargs (Optional[dict])
overwrite (bool)

Return type:

None

SequentialGMM#

This Page