Python Example: Running the Hypergeometric Test

FOX requires two inputs: The AUC matrix provided by AUCell and the RSS matrix calculated by SCENIC. The functions returnThreshold_dictionary and rundifferential_test_AUC explain how we test for the expression of regulons.

Hypergeometric testing by FOX (internals)

First, we define the essential helper functions needed to conduct differential testing.

def returnThreshold_dictionary(threshold_file, regulonC = None):
    """
    Returns a dictionary of thresholds based on the input threshold file.
    The file is processed depending on whether it's a 3.5 format or not.

    :param threshold_file: Path to the file containing threshold data.
    :param regulonC: The column/row of the regulon to use.
    :return: A dictionary mapping regulon names to threshold values.
    """
    thresholds, regulonNames, regulonThresholds = None, [], []

    # Handle case where threshold_file is in 3.5 format
    if "3.5_" in threshold_file:
        thresholds = pd.read_csv(threshold_file, sep="\t").T
        regulonNames = list(map(lambda x: x.split(" ")[0], thresholds.loc["regulon"].tolist()))
        regulonThresholds = thresholds.loc[regulonC]  # Row-based threshold extraction
    else:
        thresholds = pd.read_csv(threshold_file)
        regulonNames = list(map(lambda x: x.split(" ")[0], thresholds[regulonC]))
        regulonThreshold = thresholds[regulonC]  # Column-based threshold extraction

    threshold_dict = {}

    for i in range(len(regulonNames)):
        threshold_dict[regulonNames[i]] = regulonThresholds[i]
        i += 1

    return threshold_dict

def rundifferential_test_AUC(df, metaClusterLabel, control, treatment, regulon_name, threshold_file, alpha = .05):
    """
    This function performs a hypergeometric test on the differential expression of regulons
    between control and treatment groups.

    :param df: DataFrame containing the AUC values for all cells.
    :param metaClusterLabel: Label for the meta-cluster (e.g., cell type or sample group).
    :param control: Label for the control group.
    :param treatment: Label for the treatment group.
    :param regulon_name: The name of the regulon being tested.
    :param threshold_file: Path to the file containing thresholds for regulons.
    :param alpha: The significance level for testing.

    :return: Tuple containing the regulon name and p-values for both control and treatment groups.
    """
    df.columns = list(map(lambda x: x.split(" ")[0], df.columns))  # Clean up column names

    if threshold_file is None:
        calculate_mean = df[regulon_name].mean()
        df['is_successful_{}'.format(regulon_name)] = df[regulon_name] > df[regulon_name].mean()
    else:
        threshold = returnThreshold_dictionary(threshold_file, 'threshold')
        df['is_successful_{}'.format(regulon_name)] = df[regulon_name] > threshold[regulon_name]

    # Count successful control cells
    control_successful_cells = df[(df[metaClusterLabel] == control) & (df['is_successful_{}'.format(regulon_name)])].shape[0]

    # Count successful treatment cells
    treatment_successful_cells = df[(df[metaClusterLabel] == treatment) & (df['is_successful_{}'.format(regulon_name)])].shape[0]

    total_successful_cells = df[df['is_successful_{}'.format(regulon_name)]].shape[0]

    # Total population size
    total_population = df.shape[0]
    n_1 = df[df[metaClusterLabel] == control].shape[0]
    n_2 = df[df[metaClusterLabel] == treatment].shape[0]

    # Calculate p-values using the hypergeometric distribution
    p_value_1 = hypergeom.sf(control_successful_cells - 1, total_population, total_successful_cells, n_1)
    p_value_2 = hypergeom.sf(treatment_successful_cells - 1, total_population, total_successful_cells, n_2)

    return (regulon_name, round(p_value_1, 2), round(p_value_2, 2))

Preparation of data for FOX

To run FOX, we’re assuming that you are using Seurat. We are using SCENIC, after you run it you should see a file named 3.4_regulonAUC.Rds with a file named 3.5_AUCellThresholds_Info.tsv. You want to save both of those!

library(SCENIC)
library(AUCell)

regulonAUC = readRDS("3.4_regulonAUC.Rds")

## optional step... remove the extended "low-confidence"
cells_AUC = regulonAUC[!grepl("extended", rownames(regulonAUC)), ]

cells_AUC = getAUC(cells_AUC)
seurat_obj@meta.data <- cbind(seurat_obj@meta.data, t(cells_AUC))

write.csv(seurat_obj@meta.data, file = "RAS_AUC_values.csv")
## calculate then the RSS file!

cellInfo <- data.frame(seuratCluster=Idents(seurat_obj))
rss <- calcRSS(AUC=getAUC(regulonAUC), cellAnnotation=cellInfo[colnames(regulonAUC), "seuratCluster"])

# if you removed the extended earlier, remove them down here
rss = rss[!grepl("extended", rownames(rss)), ]

write.csv(rss, "rss_values_.csv")

Usage Example

To run FOX, you’ll need to prepare your data (such as RSS matrices and metadata) and pass it to the class. Here’s an example of how to initialize and use FOX:

from FOXREG import ComparisonTree
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Read in the data
data = pd.read_csv("rss_values_.csv")  # RSS values
df_RAS = pd.read_csv("RAS_AUC_values.csv")  # AUC metadata

# Define labels for your comparison
labels = data.columns[1:].tolist()

# Initialize the ComparisonTree with your data
comparison = ComparisonTree(
    "<Baseline >", df_RAS, "newLabels", data, labels, "Unnamed: 0",
    "3.5_AUCellThresholds_Info.tsv"
)

# Build the tree and run analyses
comparison.construct_tree()
p_vals = comparison.plotRSS_NMF("B", drawQuadrants=True, include_pvals=True)
comparison.plot_3dEmbedding(rawRSS=False)
comparison.analyze_factors("B", percentages=True)
comparison.compareLayers("B", "Naive CD4 T", 0.055)
tr = comparison.create_global_tree()
tree, dict_save = tr

Explanation of Functions

returnThreshold_dictionary: This function parses the threshold file and returns a dictionary that maps regulon names to threshold values. It handles different file formats.
rundifferential_test_AUC: This function runs a differential test on regulon activity between control and treatment groups, using the hypergeometric distribution to compute the statistical significance of observed differences.

—