# Variation in PCA weights

I have weights of SNP variation (output through Eigenstrat program) for each SNP for the three main PCs. I wish to reduce my list of SNPs to those that show maximum differentiation between the three PCs. Can anyone help me with which statistical method to use to do this.

say, if each PC describes the magnitude of variation, how can I find mutually exclusive rows or rows that define the strongest differenciator of each PC. e.g:

SNPNam PC1-wt PC 2-wt PC3-wt
SNP_1 -1.489 -0.029 -0.507
SNP_2 -1.446 -0.816 0.661
SNP_3 -1.416 0.338 1.631
SNP_4 -1.392 -1.452 0.062
SNP_5 -1.278 0.362 -1.006
SNP_6 -1.21 0.514 0.144
SNP_7 -1.119 -0.633 0.163
SNP_8 -1.112 -0.193 -0.256
SNP_9 -1.054 1.081 -1.519
SNP_10 -0.936 -1.052 -0.419
SNP_11 -0.861 0.381 -0.207
SNP_12 -0.662 0.852 -0.211
SNP_13 -0.503 -1.602 0.585
SNP_14 -0.417 0.529 -1.003
SNP_15 0.101 -0.85 -0.258
SNP_16 0.198 -0.435 -1.599
SNP_17 0.588 -1.292 -1.257
SNP_18 1.167 0.891 1.106
SNP_19 1.35 0.036 0.729
SNP_20 1.532 1.599 0.499

Any help regarding which test to perform and how to (which package) is very much appreciated.

It looks like you are referring to eigenanalysis for SNPs data and the article from Nick Patterson, Population Structure and Eigenanalysis (PLoS Genetics 2006), where the first component explains the largest variance on allele frequency wrt. potential stratification in the sample (due to ethnicity or, more generally, ancestry). So I wonder why you want to consider all three first components, unless they appear to be significant from their expected distribution according to TW distribution. Anyway, in R you can isolate the most informative SNPs (i.e. those that are at the extreme of the successive principal axes) with the apply() function, working on row, e.g.

apply(snp.df, 1, function(x) any(abs(x)>threshold))


where snp.df stands for the data you show and which is stored either as a data.frame or matrix under R, and threshold is the value you want to consider (this can be Mean $\pm$ 6 SD, as in Price et al. Nature Genetics 2007 38(8): 904, or whatever value you want). You may also implement the iterative PCA yourself.

Finally, the TW test can be implemented as follows:

##' Test for the largest eigenvalue in a gaussian covariance matrix
##'
##' This function computes the test statistic and associated p-value
##' for a Wishart matrix focused on individuals (Tracy-Widom distribution).
##'
##' @param C a rectangular matrix of bi-allelic markers (columns) indexed
##'        on m individuals. Caution: genotype should be in {0,1,2}.
##' @return test statistic and tabulated p-value
##' @reference \cite{Johnstone:2001}
##' @seealso The RMTstat package provides other interesting functions to
##'          deal with Wishart matrices.
##' @example
##' X <- replicate(100,sample(0:2,20,rep=T))
tw.test <- function(C) {
m <- nrow(C)  # individuals
n <- ncol(C)  # markers
# compute M
C <- scale(C, scale=F)
pj <- attr(C,"scaled:center")/2
M <- C/sqrt(pj*(1-pj))
# compute X=MM'
X <- M %*% t(M)
ev <- sort(svd(X)\$d, decr=T)[1:(m-1)]
nprime <- ((m+1)*sum(ev)^2)/(((m-1)*sum(ev^2))-sum(ev)^2)
l <- (m-1)*ev[1]/sum(ev)
# normalize l and compute test statistic
num <- (sqrt(nprime-1)+sqrt(m))
mu <- num^2/nprime
sigma <- num/nprime*(1/sqrt(nprime-1)+1/sqrt(m))^(1/3)
l <- (l-mu)/sigma
# estimate associated p-value
if (require(RMTstat)) pv <- ptw(l, lower.tail=F)
else pv <- NA
return(list(stat=l, pval=pv))
}