Skip to content

Conversation

@rnmitchell
Copy link
Contributor

Add ancestry prediction using PCA to MixDeR. Reference data from 1000 Genomes is used in PCA.

The method first performs an initial deconvolution using the same method as established in the mixture deconvolution step utilizing the 1000 Genomes Global population allele frequency data. The 54 ancestry SNPs are then extracted from each single source inferred SNP profile, added to the 1000 Genomes reference dataset and PCA is performed for each contributor. The PCA plots are created and saved to the specified output directory. The user can then examine and determine if the contributor's ancestry can be determined. The user can then choose to use the predicted super population allele frequency data in the mixture deconvolution step (which will also be provided in MixDeR) or use the global allele frequency data.

  • Update Shiny app to allow user to select ancestry prediction step and specify settings for the initial mixture deconvolution and subsequent inferred allele filtering.
  • Run mixture deconvolution utilizing the settings specified by the user, the global 1000 Genomes allele frequency data, apply the allele 1 & 2 probability thresholds and create the final inferred genotypes for each contributor.
  • For each contributor separately: extract the ancestry SNPs, merge with the known 1000 Genomes data, run PCA and create the PC1 vs. PC2 plot.

@rnmitchell rnmitchell marked this pull request as ready for review October 3, 2025 19:48
@rnmitchell
Copy link
Contributor Author

This is ready for review @standage. Let's talk next week about it!

@rnmitchell rnmitchell requested a review from standage October 3, 2025 19:48
centroids = function(groups, pca, inpath, ID) {
dir.create(file.path(inpath, "Centroids_Plots"), showWarnings = FALSE, recursive=TRUE)

ancestry_colors = read.table("/Users/rebecca.mitchell/Desktop/ancestry_colors.txt", header=T, sep="\t") %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix, if not already: hard-coded path

Comment on lines +34 to +37
ncols=ncol(geno)
geno_filt=geno[,c(7:ncols)]
snps = data.frame("snp_id"=colnames(geno_filt))
snps = snps %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code autoformatting could give a more consistent style in these files. Something to consider.


ancestry_colors = read.table("/Users/rebecca.mitchell/Desktop/ancestry_colors.txt", header=T, sep="\t") %>%
add_row(id = "Unk", reg = "Unk", population = "Unk", color="red", superpop_color="red") %>%
add_row(id= "Centroid", reg = "Centroid", population = "Centroid", color = "black", superpop_color="black")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing up as white for some reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants