Gaussian Mixture Model Thresholding#

This notebook demonstrates the GMMThresholding class from the cc_mapping package, which provides a flexible, semi-supervised approach to categorizing samples based on continuous features using Gaussian Mixture Models (GMMs).

Key Features#

Automatic threshold determination: Fit GMMs and automatically calculate decision boundaries
Manual threshold override: Specify custom thresholds while leveraging GMM visualization
Label collapsing: Fit many GMM components for adaptive boundaries, then collapse to fewer categories
Exploratory visualization: Explore data distributions before committing to thresholding
Comprehensive visualization: Multiple plotting methods to understand your data and thresholds

[ ]:

# Import all required libraries
from pathlib import Path
from urllib.request import urlretrieve

import anndata as ad
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from cc_mapping.thresholding import GMMThresholding
from cc_mapping.utils import create_boolean_label_combination

%matplotlib inline

Setup and Data Loading#

[2]:

url = "https://zenodo.org/records/4525425/files/control_manifold_allfeatures.csv?download=1"

cwd = Path.cwd()

download_path = cwd / "data" / "control_manifold_allfeatures.csv"

if not download_path.parent.exists():
    download_path.parent.mkdir(parents=True)
    urlretrieve(url, download_path)

[3]:

# Create results directory structure for saving figures
results_dir = cwd / "results"
single_results_dir = results_dir / "single"

# Create directories if they don't exist
results_dir.mkdir(exist_ok=True)
single_results_dir.mkdir(exist_ok=True)

print(f"Results directory created at: {results_dir}")
print(f"Single threshold results directory: {single_results_dir}")

Results directory created at: c:\Users\dap182\Documents\git\cc_mapping\notebooks\results
Single threshold results directory: c:\Users\dap182\Documents\git\cc_mapping\notebooks\results\single

[4]:

csv = pd.read_csv(download_path, index_col=0, low_memory=False)
csv.index = csv.index.astype(str)
csv.head()

[4]:

	E2F1 (nuc median)	cycA (nuc median)	cycD1 (nuc median)	p21 (nuc median)	Int_Intg_DNA_nuc	Skp2 (nuc median)	Cdt1 (nuc median)	Nuc area	Cdh1 (nuc median)	cycE (nuc median)	...	STAT3 (phospho/total nuc)	age	phase	PHATE_1	PHATE_2	PCNA foci	DNA content	Local cell density	annotated age	annotated phase
Unnamed: 0.1.1
0	0.006851	0.005875	0.019471	0.026032	5.746899	0.006493	0.005158	553.0	0.014336	0.009262	...	1.770186	10.263173	G1	-0.020654	0.007907	0.005151	2.298759	7.0	NaN	NaN
1	0.012360	0.028153	0.007462	0.004318	8.852262	0.022797	0.005951	490.0	0.015229	0.008392	...	1.455814	12.109644	S	0.025945	-0.000735	0.006861	3.540905	1.0	NaN	NaN
2	0.007279	0.005707	0.006592	0.003632	4.951003	0.015366	0.004929	363.0	0.005730	0.009117	...	1.290323	4.954907	G1	0.002961	-0.004575	0.004148	1.980401	6.0	NaN	NaN
3	0.006531	0.016602	0.009369	0.006264	10.466743	0.019196	0.005234	579.0	0.009361	0.007118	...	1.435065	13.424587	G2	0.034884	0.002944	0.002457	4.186697	9.0	NaN	NaN
5	0.006744	0.006027	0.009033	0.005295	5.249119	0.010422	0.007797	296.0	0.008896	0.007767	...	1.266304	1.828240	G1	-0.019039	0.000584	0.003277	2.099648	3.0	NaN	NaN

5 rows × 299 columns

[5]:

# convert the dataframe into an anndata object
# ensure the data is not of object type (e.g. contains strings)
adata = ad.AnnData(
    X=csv.values[:,:-10].astype(np.float32),
)
adata.var_names = csv.columns.values[:-10]

[6]:

list(adata.var_names)[:3]

[6]:

['E2F1 (nuc median)', 'cycA (nuc median)', 'cycD1 (nuc median)']

Exploratory Data Analysis#

Before performing GMM thresholding, explore your data distribution to:

Understand the shape of your data distribution
Decide how many GMM components to use
Determine appropriate manual threshold values
Identify potential data quality issues

Initialize for Exploration#

[7]:

# Initialize the GMM thresholding object for exploration
# Note: We don't need to fit() or categorize_samples() yet for exploratory plots
gmm_explore = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='cell_cycle_explore'
)

Histogram Distribution#

[8]:

# Plot histogram without any thresholding
fig, ax = plt.subplots(figsize=(8, 5))

gmm_explore.plot_feature_distribution_exploratory(
    hist_kwargs={'bins': 100, 'color': 'steelblue', 'alpha': 0.7},
    x_axis_limits=(0, 0.08),
    ax=ax
)

ax.set_title('CycD1 Distribution - Exploratory View')
ax.set_xlabel('CycD1 (nuc median)')
ax.set_ylabel('Density')
plt.tight_layout()
plt.show()

../_images/tutorials_Single_Thresholding_Workflow_12_0.png

Strip Plot with Density#

[9]:

# Plot strip plot with density coloring (no thresholding yet)
fig, (ax_strip, ax_hist) = gmm_explore.plot_feature_strip_plot_exploratory(
    scatter_density=True,
    x_axis_limits=(0, 0.08),
    hist_kwargs={'bins': 100, 'color': 'black'},
)

plt.suptitle('CycD1 Distribution - Strip Plot with Density', y=1.02)
plt.tight_layout()
plt.show()

C:\Users\dap182\AppData\Local\Temp\ipykernel_49784\869146395.py:9: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  plt.tight_layout()

../_images/tutorials_Single_Thresholding_Workflow_14_1.png

[10]:

# initialize the gmm thresholding object with the adata object and the feature of interest
gmm_kwargs = {'init_params': 'k-means++', 'n_init':10, 'max_iter':1000}

gmm_thresholding = GMMThresholding(adata= adata,
                                    feature = 'cycD1 (nuc median)',
                                    label_obs_save_str= 'Low/High DNA Content',
                                    gmm_kwargs = gmm_kwargs)

# plot the bayesian information criterion curve to determine the optimal number of components
component_range = 5
gmm_thresholding.plot_bic_curve(
    component_range=component_range,
)

# the optimal number of components can be calculated without plotting as well
optimal_component_number = gmm_thresholding.determine_optimal_components(component_range=component_range)
print(f'The optimal number of components is {optimal_component_number}')

#fit the GMM with the optimal number of components
gmm_thresholding.fit(n_components=optimal_component_number)

The optimal number of components is 2

../_images/tutorials_Single_Thresholding_Workflow_15_1.png

Automatic Thresholding#

The standard workflow:

Determine optimal number of GMM components using BIC
Fit the GMM with the optimal number of components
Categorize samples using automatically calculated decision boundaries

[11]:

# Categorize samples using automatically calculated decision boundaries from the fitted GMM
# The ordered_labels parameter specifies the labels from low to high feature values
gmm_thresholding.categorize_samples(ordered_labels=['Low', 'High'])

[12]:

gmm_thresholding.plot_hist_distribution_with_boundaries(
    save_path=single_results_dir / 'basic_threshold_histogram.png'  # Optional: save the figure
)

Figure saved to: c:\Users\dap182\Documents\git\cc_mapping\notebooks\results\single\basic_threshold_histogram.png

[12]:

<Axes: title={'center': 'cycD1 (nuc median)'}, xlabel='cycD1 (nuc median)', ylabel='Density'>

../_images/tutorials_Single_Thresholding_Workflow_18_2.png

Visualize Results#

[13]:

#plots a strip plot with the labeled histogram and the decision boundaries, coloring by sample density
hist_kwargs = {"bins": 100, 'color': 'black'}
# Note: vmax is set to match the exploratory plot's auto-calculated range for consistent colorbars
gmm_thresholding.plot_strip_plot_histogram_with_decision_boundaries(hist_kwargs=hist_kwargs, vmax=4, y_axis_limits=(0,0.06));

../_images/tutorials_Single_Thresholding_Workflow_20_0.png

Strip plot colored by sample density:

[14]:

#plots a strip plot with the labeled histogram and the decision boundaries, coloring by sample labels
hist_kwargs = {"bins": 100, 'color': "black"}
gmm_thresholding.plot_strip_plot_histogram_with_decision_boundaries(y_axis_limits=(0,0.06),
                                                                    hist_kwargs=hist_kwargs,
                                                                    scatter_density=False,
                                                                    );

../_images/tutorials_Single_Thresholding_Workflow_22_0.png

Or colored by assigned labels:

Manual Thresholding#

Override automatically calculated thresholds by specifying manual_thresholds. This is useful for applying consistent thresholds across datasets or using domain-specific cutoff values.

[15]:

# Reinitialize to demonstrate manual thresholds
gmm_thresholding_manual = GMMThresholding(
    adata=adata,
    feature='cycD1 (nuc median)',
    label_obs_save_str='Manual Low/High DNA Content',
    gmm_kwargs=gmm_kwargs
)

# Fit the GMM (still useful for visualization even when using manual thresholds)
gmm_thresholding_manual.fit(n_components=optimal_component_number)

# Categorize with manual threshold at 0.05
# The manual_thresholds parameter overrides automatic boundary calculation
gmm_thresholding_manual.categorize_samples(
    ordered_labels=['Low', 'High'],
    manual_thresholds=[0.05]
)

# Visualize to see the manually set threshold
gmm_thresholding_manual.plot_hist_distribution_with_boundaries(resolution=100)

[15]:

<Axes: title={'center': 'cycD1 (nuc median)'}, xlabel='cycD1 (nuc median)', ylabel='Density'>

../_images/tutorials_Single_Thresholding_Workflow_25_1.png

Results and Utilities#

[16]:

adata = gmm_thresholding.return_adata()
adata

[16]:

AnnData object with n_obs × n_vars = 6797 × 289
    obs: 'Low/High DNA Content'
    uns: 'gmm_thresholding_events'

Thresholding metadata is stored in adata.uns['gmm_thresholding_events'].

### Utility Functions#

[17]:

# Generate a text report using the method from the base class
report = gmm_thresholding.generate_thresholding_report(
    output_format='text'
)
print(report)

Thresholding Report
==================================================

1. cycD1 (nuc median) (Standard Thresholding)
---------------------------------------------
   Feature: cycD1 (nuc median)
   Layer: None
   Obs column: Low/High DNA Content
   Components: 2
   Thresholds: [0.0191]
   Labels: ['Low', 'High']
   Cell counts: Low=5744, High=1053

==================================================
Total operations: 1
Operation types:
  - standard: 1

Or as a DataFrame:

[18]:

# Generate a DataFrame report
report_df = gmm_thresholding.generate_thresholding_report(
    output_format='dataframe'
)
display(report_df)

	Operation	Type	Feature	Layer	Obs Label	Components	Thresholds	Labels	Parent	Refined From	Total Cells
0	1. cycD1 (nuc median)	standard	cycD1 (nuc median)	None	Low/High DNA Content	2	0.0191	Low, High	None	N/A	6797

Combine categorical labels with boolean operators (AND, OR, XOR) for advanced filtering.

[19]:

# Create a synthetic treatment label for demonstration
np.random.seed(42)
n_cells = len(adata)
adata.obs['treatment'] = pd.Categorical(
    np.random.choice(['control', 'drug_A', 'drug_B'], size=n_cells)
)

# Combine two categorical labels using boolean operators
# operator options: 'AND', 'OR', 'XOR'
# Note: Parameter names have changed - use obs_key_1, match_values_1, etc.
adata = create_boolean_label_combination(
    adata,
    obs_key_1='treatment',
    match_values_1=['control'],
    obs_key_2='Low/High DNA Content',
    match_values_2=['High'],
    operator='AND',
    output_obs_key='control_and_high_DNA',
    true_label='control_with_high_DNA',
    false_label='other',
    overwrite=False  # Default: raises error if output_obs_key already exists
)

print("Combined label distribution:")
print(adata.obs['control_and_high_DNA'].value_counts())

Combined label distribution:
control_and_high_DNA
other                    6455
control_with_high_DNA     342
Name: count, dtype: int64