Fig. 1
From: scAI-SNP: a method for inferring ancestry from single-cell data

Schematic of model training and inference. a The geographic locations of the 26 populations in the 1KGP dataset are shown in the truncated world map, with 5 color codes based on geographic regions. The right side shows the population description and three letter code that will be used throughout the paper. b Schematic of Model Training is shown on top. The training data is shown as a matrix of 3201 individuals by 70 million genotypes at SNP sites. We first subset the SNPs to keep 4.5 million ancestry-informative SNP sites as described in Sect. 2.2 of Methods. We then conducted PCA to reduce the dimensionality to 600 principal components (PCs). Model prediction is shown on the bottom. An example of a user input of 1 sample is shown although users can input an arbitrary number of samples. The input data is first centered using the mean vector from the training data, imputed for sites with missing genotypes, and scaled appropriately as described in Sect. 2.5 of Methods. Convex optimization is used to compute the contributions of the 26 population groups to the ancestry of the input data with the constraint that the contributions are non-negative and sum to 1. c The confusion matrix of classification results after splitting the 1KGP data into training-test (80–20%) is shown. Each test input is classified as the member of the population group with maximum contribution to that individual. Missing data was simulated by randomly removing 99% of the SNP sites from the data. Classification accuracy with 99% missing data is 86% (with no missing data, the accuracy is 89%), and all the misclassifications except for three cases are still within the same geographic region. Supplementary Fig. 1 shows the confusion matrix generated with different degrees of missing genotypes