Skip to main content
Fig. 2 | BMC Methods

Fig. 2

From: Discovery of optimal cell type classification marker genes from single cell RNA sequencing data

Fig. 2

NS-Forest version 4.0 workflow. The algorithm uses an anndata object in.h5ad format, containing the cell-by-gene expression matrix and cluster labels for each cell, as data input (step 1). The median gene expression for each gene in each cluster (i.e., a cluster-by-gene median matrix) is calculated and genes that have positive median expression in at least one cluster are pre-selected (not shown). The Binary Expression Score (see Methods for explaination of notations) is then calculated for each cluster-gene pair (step 2) producing a cluster-by-gene Binary Score matrix (note that a gene may have different Binary Score values in different clusters), and a dataset-specific threshold is calculated based on the Binary Score distribution and user-selected mild, moderate, or high criterion. This threshold value is used to select candidate genes for each cluster with a Binary Expression Score greater than or equal to the threshold (step 3). These candidate genes are passed to build binary classification models for each cluster using the random forest (RF) machine learning method. Features (genes) are extracted from the RF model and ranked by the Gini Impurity index, and the top RF features are then reranked by their pre-calculated Binary Scores (step 4). A short list of the top-ranked candidate genes that are not only ranked high in the RF classification models but also have high Binary Scores are passed for decision tree feature evaluation and determining the best marker gene combination. A single-split decision tree is built for each evaluated gene for determining the optimal expression threshold for classification. All combinations of any length of these genes are considered using ‘AND’ logic to combine the decision trees, and the best combination is determined by the highest F-beta score as an objective function for optimizing the overall classification performance (step 5). The F-beta score, Positive Predictive Value (PPV) (a.k.a. precision), recall, On-Target Fraction, as well as true/false positive/negative classification values are reported for each cluster, serving as metrics for evaluating the performance of the final maker gene combinations (step 6)

Back to article page