SequentialGMM#
- class SequentialGMM(adata, thresholding_events_key='sequential_gmm_thresholding_events', gmm_kwargs=None, random_state=42)[source]#
Bases:
GaussianMixtureModelBaseSequential GMM thresholding for iterative population refinement.
This class enables performing multiple sequential GMM thresholding operations on subsets of cells, where each operation refines a specific categorical label from a previous thresholding event. This is useful for hierarchical cell type classification or iterative gating strategies.
Unlike GMMThresholding which thresholds a single feature once, this class allows:
Initial thresholding on entire dataset
Refinement of specific label values through additional thresholding
Tracking operation provenance (parent-child relationships)
Multiple operations stored in a single .uns key
- Attributes:
- adataad.AnnData
A copy of the input AnnData object, modified during processing.
- thresholding_events_keystr
Key in adata.uns for storing all operations.
- gmm_kwargsDict
Default GMM kwargs (can be overridden per operation).
- random_stateint
Random state for reproducibility.
Examples
Example workflow:
# Initialize seq_gmm = SequentialGMM( adata=adata, thresholding_events_key='sequential_thresholding' ) # Create initial labels on entire dataset seq_gmm.threshold_entire_dataset( feature='DNA_content', label_obs_save_str='cell_cycle', n_components=2, ordered_labels=['Low', 'High'], operation_name='DNA_threshold' ) # Refine 'Low' cells only seq_gmm.refine_labels_with_gmm( feature='Plk1', obs_label='cell_cycle', value_to_refine='Low', n_components=2, ordered_labels=['Low_neg', 'Low_pos'], operation_name='Plk1_refinement' ) # Get modified adata adata = seq_gmm.return_adata()
- Parameters:
Methods
Determine the optimal number of components for the GMM.
Generate a human-readable report of all thresholding operations.
Plot the BIC curve.
Plot histogram of a feature distribution for exploratory analysis.
Plot strip plot + histogram for exploratory analysis.
Plot histogram with boundaries for a specific operation.
Plot 1D strip plot with histogram and decision boundaries for a specific operation.
Refine existing categorical labels by thresholding a subset with GMM.
Refine existing categorical labels using manual thresholds.
Return the modified AnnData object.
Threshold entire dataset to create initial categorical labels.
- determine_optimal_number_components(adata, feature, component_range, layer=None, gmm_kwargs=None, metric='bic', curve='convex', direction='decreasing', return_bic_list=False)#
Determine the optimal number of components for the GMM.
- Parameters:
- adataad.AnnData
AnnData object containing the data.
- featurestr
Name of the feature to analyze.
- component_rangeint
Maximum number of components to test.
- layerstr or None, optional
Optional layer name to use instead of .X. Default is None.
- gmm_kwargsdict or None, optional
Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
- metricstr, optional
Metric to use for optimization (currently only ‘bic’ supported). Default is ‘bic’.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
- return_bic_listbool, optional
If True, returns tuple of (optimal_n, bic_list). Default is False.
- Returns:
- int or tuple of (int, list of (int or float))
Optimal number of components, or tuple of (optimal_n, bic_list) if return_bic_list=True.
- generate_thresholding_report(output_format='text')#
Generate a human-readable report of all thresholding operations.
Reads thresholding metadata from adata.uns and creates a summary showing: - Operation names and order - Features used - Number of components - Thresholds calculated - Labels assigned - Parent operations (for refinements) - Cell counts per category (captured at operation time)
Note: Cell counts reflect the state immediately after each operation was performed, not the current state of the data. This is important because subsequent refinement operations may change labels, but the historical counts are preserved.
- Parameters:
- output_formatstr, default ‘text’
‘text’ for formatted string, ‘dataframe’ for pandas DataFrame.
- Returns:
- Union[str, pd.DataFrame]
Formatted report string or DataFrame.
- Raises:
- KeyError
If thresholding_events_key doesn’t exist in adata.uns.
- ValueError
If output_format is not ‘text’ or ‘dataframe’.
- TypeError
If adata.uns[thresholding_events_key] is not a dict.
Examples
Generate text report:
>>> gmm = GMMThresholding(adata, feature='gene1', label_obs_save_str='gene1_cat') >>> gmm.fit(n_components=2) >>> gmm.categorize_samples(['Low', 'High']) >>> report = gmm.generate_thresholding_report() >>> print(report) Thresholding Report ==================================================
- plot_bayesian_information_criterion_curve(adata, feature, component_range, layer=None, gmm_kwargs=None, curve='convex', direction='decreasing', ax=None, save_path=None)#
Plot the BIC curve.
- Parameters:
- adataad.AnnData
AnnData object containing the data.
- featurestr
Name of the feature to analyze.
- component_rangeint
Maximum number of components to test.
- layerstr or None, optional
Optional layer name to use instead of .X. Default is None.
- gmm_kwargsdict or None, optional
Keyword arguments for GaussianMixture. If None, uses defaults. Default is None.
- curvestr, optional
Type of curve for knee detection (‘convex’ or ‘concave’). Default is ‘convex’.
- directionstr, optional
Direction of curve (‘decreasing’ or ‘increasing’). Default is ‘decreasing’.
- axmatplotlib.pyplot.Axes or None, optional
Optional matplotlib axes to plot on. If None, creates new figure. Default is None.
- save_pathstr or Path or None, optional
Optional path to save the figure. Parent directory must exist. Default is None.
- Raises:
- FileNotFoundError
If save_path parent directory doesn’t exist.
- plot_feature_distribution_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, ax=None, x_axis_limits=None)[source]#
Plot histogram of a feature distribution for exploratory analysis.
This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.
- Parameters:
- featurestr
Feature name to plot (must exist in adata.var_names).
- obs_labelOptional[str], optional
Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.
- value_to_subsetOptional[str], optional
Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.
- layerOptional[str], optional
Layer to use for data. If None, uses adata.X. Defaults to None.
- hist_kwargsOptional[Dict], optional
Keyword arguments for plt.hist(). Defaults to {‘bins’: 50, ‘color’: ‘black’, ‘alpha’: 0.7}.
- axOptional[plt.Axes], optional
Matplotlib axes to plot on. If None, uses current axes. Defaults to None.
- x_axis_limitsOptional[tuple], optional
(min, max) for x-axis. Use None for data-driven limits. Defaults to None.
- Returns:
- Axes
The matplotlib axes object.
- Raises:
- ValueError
If value_to_subset is provided without obs_label.
- KeyError
If obs_label doesn’t exist in adata.obs.
- ValueError
If value_to_subset is not present in adata.obs[obs_label].
Examples
Explore entire dataset:
seq_gmm.plot_feature_distribution_exploratory( feature='Int_Intg_DNA_nuc', hist_kwargs={'bins': 30, 'color': 'steelblue'}, x_axis_limits=(5, 15) ) plt.title('DNA Content Distribution - All Cells') plt.show()
Explore specific subset:
seq_gmm.plot_feature_distribution_exploratory( feature='Int_Intg_DNA_nuc', obs_label='cell_cycle_phase', value_to_subset='G1/S/G2', hist_kwargs={'bins': 30, 'color': 'steelblue'}, x_axis_limits=(5, 15) ) plt.title('DNA Content in G1/S/G2 Cells - Exploratory') plt.show()
- plot_feature_strip_plot_exploratory(feature, obs_label=None, value_to_subset=None, layer=None, hist_kwargs=None, strip_plot_kwargs=None, scatter_density=True, x_axis_limits=None, vmax=None)[source]#
Plot strip plot + histogram for exploratory analysis.
Similar to plot_strip_plot_histogram_with_decision_boundaries() but WITHOUT decision boundaries, for exploring data before running threshold operations.
This method allows you to visualize feature distributions WITHOUT running any thresholding, so you can explore your data and decide on manual thresholds or the number of components to use for GMM.
- Parameters:
- featurestr
Feature name to plot (must exist in adata.var_names).
- obs_labelOptional[str], optional
Obs column to use for subsetting. If provided with value_to_subset, only plots cells with that label value. If None, plots all cells. Defaults to None.
- value_to_subsetOptional[str], optional
Specific label value to plot. Requires obs_label to be specified. If None, plots all cells (or all cells in obs_label if provided). Defaults to None.
- layerOptional[str], optional
Layer to use for data. If None, uses adata.X. Defaults to None.
- hist_kwargsOptional[Dict], optional
Keyword arguments for histogram. Defaults to None.
- strip_plot_kwargsOptional[Dict], optional
Keyword arguments for strip plot. Only used when scatter_density=False. Defaults to None.
- scatter_densitybool, optional
If True, uses density-based coloring. If False, uses uniform scatter plot. Defaults to True.
- x_axis_limitsOptional[tuple], optional
(min, max) for x-axis. Use None for data-driven limits. Defaults to None.
- vmaxOptional[Union[int, float]], optional
Maximum density value for colormap. Only used when scatter_density=True. If None, auto-calculated. Defaults to None.
- Returns:
- tuple
(fig, (ax_strip, ax_hist)) - Figure and axes objects.
- Raises:
- ValueError
If value_to_subset is provided without obs_label.
- KeyError
If obs_label doesn’t exist in adata.obs.
- ValueError
If value_to_subset is not present in adata.obs[obs_label].
Examples
Explore entire dataset:
fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory( feature='Int_Intg_DNA_nuc', scatter_density=True, x_axis_limits=(5, 15) ) plt.suptitle('DNA Content Distribution - All Cells') plt.show()
Explore specific subset:
fig, (ax_strip, ax_hist) = seq_gmm.plot_feature_strip_plot_exploratory( feature='Int_Intg_DNA_nuc', obs_label='cell_cycle_phase', value_to_subset='G1/S/G2', scatter_density=True, x_axis_limits=(5, 15) ) plt.suptitle('DNA Content Distribution - G1/S/G2 Cells') plt.show()
- plot_hist_distribution_with_boundaries(operation_name, num_std=5, title=None, hist_kwargs=None, cmap=None, ax=None, x_axis_limits=None, resolution=1000, save_path=None)[source]#
Plot histogram with boundaries for a specific operation.
- Parameters:
- operation_namestr
Name of operation to plot (from .uns keys).
- num_stdint, optional
Number of standard deviations for GMM plotting. Defaults to 5.
- titleOptional[str], optional
Plot title. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
- hist_kwargsOptional[Dict], optional
Kwargs for histogram. Defaults to None.
- cmapplt.cm.ScalarMappable, optional
Colormap. Defaults to ‘rainbow’.
- axplt.Axes, optional
Axes to plot on. If None, creates new. Defaults to None.
- x_axis_limitsOptional[tuple], optional
X-axis limits (min, max). Defaults to None.
- resolutionint, optional
Resolution for plotting. Defaults to 1000.
- save_pathOptional[Union[str, Path]], optional
Path to save the figure. Parent directory must exist. Defaults to None.
- Returns:
- Axes
The matplotlib axes object. Call plt.show() to display it.
- Raises:
- KeyError
If operation_name not found in .uns.
- ValueError
If operation has no decision boundaries.
- ValueError
If resolution <= 0 or <= n_components.
- FileNotFoundError
If save_path parent directory doesn’t exist.
Examples
ax = seq_gmm.plot_hist_distribution_with_boundaries('Plk1_refinement') plt.show()
- plot_strip_plot_histogram_with_decision_boundaries(operation_name, cmap=None, y_axis_limits=None, resolution=1000, scatter_density=True, vmax=None, hist_kwargs=None, strip_plot_kwargs=None, title=None)[source]#
Plot 1D strip plot with histogram and decision boundaries for a specific operation.
This method wraps the base class implementation to provide visualization for sequential thresholding operations. It creates a density strip plot (or label-colored scatter) alongside a horizontal histogram showing the distribution and decision boundaries for the specified operation.
- Parameters:
- operation_namestr
Name of operation to plot (from .uns keys).
- cmapplt.cm.ScalarMappable, optional
Colormap for density or labels. Defaults to mpl.colormaps[‘plasma’].
- y_axis_limitsOptional[Tuple[float, float]], optional
Y-axis limits (min, max). If None, uses data min/max. Defaults to None.
- resolutionint, optional
Resolution for boundary plotting. Defaults to 1000.
- scatter_densitybool, optional
If True, color by density; if False, color by labels. Defaults to True.
- vmaxOptional[Union[int, float]], optional
Maximum density value for colormap. If None, auto-calculated. Defaults to None.
- hist_kwargsOptional[Dict], optional
Kwargs for histogram (bins, color, etc.). Defaults to None.
- strip_plot_kwargsOptional[Dict], optional
Kwargs for strip plot scatter (e.g., s, alpha, marker). Only used when scatter_density=False. Defaults to None.
- titleOptional[str], optional
Title for the plot. If not provided, defaults to feature name. Pass empty string ‘’ to suppress title. Defaults to None.
- Returns:
- Figure
The matplotlib figure object. Call plt.show() to display it.
- Raises:
- KeyError
If operation_name not found in .uns.
- ValueError
If operation has no decision boundaries.
Examples
Basic usage with label-colored scatter:
fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries( operation_name='separate_M_phase', scatter_density=False ) plt.show()
Custom title:
fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries( operation_name='separate_M_phase', scatter_density=False, title='M Phase Separation' ) plt.show()
Customize strip plot appearance:
fig = seq_gmm.plot_strip_plot_histogram_with_decision_boundaries( operation_name='separate_M_phase', scatter_density=False, strip_plot_kwargs={'s': 5, 'alpha': 0.8, 'marker': 'o'} ) plt.show()
- refine_labels_with_gmm(feature, obs_label, value_to_refine, n_components, ordered_labels, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#
Refine existing categorical labels by thresholding a subset with GMM.
Modifies adata.obs[obs_label] in-place (within the copy), replacing cells with value_to_refine with new labels based on GMM thresholding.
- Parameters:
- featurestr
Feature to threshold on (e.g., ‘Plk1’).
- obs_labelstr
Obs column to modify in-place (e.g., ‘cell_cycle’).
- value_to_refinestr
Which label value to refine (e.g., ‘G0’).
- n_componentsint
Number of GMM components to fit.
- ordered_labelsList[str]
New labels to assign (e.g., [‘G0_low’, ‘G0_high’]).
- duplicate_labelsbool, optional
Allow duplicate labels for label collapsing. Defaults to False.
- operation_namestr
Required name for tracking this operation.
- layerOptional[str], optional
Layer to use for data access. If None, uses adata.X. Defaults to None.
- gmm_kwargsOptional[dict], optional
GMM kwargs for this operation. Overrides default if provided. Defaults to None.
- overwritebool, optional
If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.
- Raises:
- ValueError
If operation_name is None or empty.
- KeyError
If operation_name already exists in .uns and overwrite=False.
- KeyError
If obs_label doesn’t exist in adata.obs.
- ValueError
If value_to_refine is not present in adata.obs[obs_label].
- ValueError
If no cells have the value_to_refine.
Examples
# Before: adata.obs['cell_cycle'] = ['G0', 'G0', 'G1', 'S', 'G0'] seq_gmm.refine_labels_with_gmm( feature='Plk1', obs_label='cell_cycle', value_to_refine='G0', n_components=2, ordered_labels=['G0_low', 'G0_high'], operation_name='Plk1_G0_refinement' ) # After: adata.obs['cell_cycle'] = ['G0_low', 'G0_high', 'G1', 'S', 'G0_low']
- refine_labels_with_manual_thresholds(feature, obs_label, value_to_refine, manual_thresholds, ordered_labels, operation_name=None, layer=None, overwrite=False)[source]#
Refine existing categorical labels using manual thresholds.
Similar to refine_labels_with_gmm() but uses explicit threshold values instead of fitting a GMM.
- Parameters:
- featurestr
Feature to threshold on.
- obs_labelstr
Obs column to modify in-place.
- value_to_refinestr
Which label value to refine.
- manual_thresholdsList[Union[float, int]]
Threshold values. Length must be len(ordered_labels) - 1.
- ordered_labelsList[str]
New labels to assign.
- operation_namestr
Required name for tracking this operation.
- layerOptional[str], optional
Layer to use for data access. If None, uses adata.X. Defaults to None.
- overwritebool, optional
If True, allows overwriting an existing operation with the same name. Useful for updating thresholds. Defaults to False.
- Raises:
- ValueError
If operation_name is None or empty.
- KeyError
If operation_name already exists in .uns and overwrite=False.
- KeyError
If obs_label doesn’t exist in adata.obs.
- ValueError
If value_to_refine is not present in adata.obs[obs_label].
- ValueError
If no cells have the value_to_refine.
- ValueError
If len(manual_thresholds) != len(ordered_labels) - 1.
Examples
seq_gmm.refine_labels_with_manual_thresholds( feature='Plk1', obs_label='cell_cycle', value_to_refine='G0', manual_thresholds=[1.5], ordered_labels=['G0_low', 'G0_high'], operation_name='Plk1_G0_manual' )
- return_adata()[source]#
Return the modified AnnData object.
- Returns:
- ad.AnnData
Modified AnnData object with all operations applied.
- Return type:
Examples
seq_gmm = SequentialGMM(adata) seq_gmm.threshold_entire_dataset(...) seq_gmm.refine_labels_with_gmm(...) adata_modified = seq_gmm.return_adata()
- threshold_entire_dataset(feature, label_obs_save_str, n_components, ordered_labels, manual_thresholds=None, duplicate_labels=False, operation_name=None, layer=None, gmm_kwargs=None, overwrite=False)[source]#
Threshold entire dataset to create initial categorical labels.
This method creates a new obs column with categorical labels based on GMM thresholding of a single feature across all cells. It’s a wrapper around GMMThresholding that stores results in the sequential thresholding framework.
- Parameters:
- featurestr
Feature name to threshold on (must exist in adata.var_names).
- label_obs_save_strstr
New column name in adata.obs for labels.
- n_componentsint
Number of GMM components to fit.
- ordered_labelsList[str]
Labels to assign (length = n_components).
- manual_thresholdsOptional[List[Union[float, int]]], optional
Manual threshold values. If None, calculated automatically from GMM. Length must be n_components - 1. Defaults to None.
- duplicate_labelsbool, optional
Allow duplicate labels for label collapsing. Defaults to False.
- operation_namestr
Required name for tracking this operation.
- layerOptional[str], optional
Layer to use for data access. If None, uses adata.X. Defaults to None.
- gmm_kwargsOptional[dict], optional
GMM kwargs for this operation. Overrides default if provided. Defaults to None.
- overwritebool, optional
If True, allows overwriting an existing operation with the same name. Useful for updating n_components or thresholds. Defaults to False.
- Raises:
- ValueError
If operation_name is None or empty.
- KeyError
If operation_name already exists in .uns and overwrite=False.
Notes
Other exceptions raised by GMMThresholding.
Examples
seq_gmm.threshold_entire_dataset( feature='DNA_content', label_obs_save_str='cell_cycle', n_components=3, ordered_labels=['G0', 'G1', 'S'], operation_name='DNA_initial_threshold' )