Implementation of RVT1 and RVT2 in R and SQL
introduction. RVT1 is a model for finding genes in which more than one rare variant can contribute to a phenotype. I introduced RVT1 in more detail in this post, but here is a quick summary. The idea is that you have called variants on a bunch of case/control samples and you want to test the hypothesis that a number of your cases have different rare variants in the same gene, leading to the same loss of function and same phenotype. This is useful when your sample size is too small to detect the impact of these rare variants individually. Indeed, the ability to look at the impact of multiple variants too rare to be included on SNP arrays is considered to be one of the major benefits of exome sequencing. Formally, Morris and Zeggini 2010 define RVT1 as follows:
Consider a sample of unrelated individuals, phenotyped for a normally distributed trait, and typed for rare variants in a gene or small genomic region. Let ni denote the number of rare variants for which theith individual has been successfully genotyped, and let ri denote the number of these variants at which they carry at least one copy of the minor allele. We can model the phenotype, yi, of the ith individual in a linear regression framework, given by yi = E[yi]+ɛi, where ɛi,σE), and
In this expression, xi denotes a vector of covariate measurements for the ith individual, with corresponding regression coefficients β. The parameter λ is the expected increase in the phenotype for an individual carrying a full complement of minor alleles at rare variants compared to an individual carrying none.
update 2012-10-08: below code now includes RVT2 as well. See Morris and Zeggini 2010 for explanation of that model.
Note that when they say “linear regression,” it’s linear if you have a quantitative phenotype and logistic if you have a dichotomous phenotype. update 2012-10-08: it may also make more sense to do linear regression even with a dichotomous phenotype, and this code has been updated to do that. see comment. end update There’s already an R implementation of RVT1 in AssotesteR, described here and source code available here. But it’s somewhat limited because it only correlates genotype with phenotype; there is no room to add in xi, that vector of covariates, such as principal components of population substructure. And for my purposes, most of the battle was just applying the right filters and getting the data into the right format and into R, then getting it out of R again into a format I can play with. At that point I felt like it made more sense to just do my own implementation end-to-end.
assumptions. This post assumes you’ve already called variants and read them into SQL tables as I specified in this GATK pipeline– and that you’ve also gotten annotations as to the (population, not sample) allele frequency of your variants– you can use annovar as described here. You’ll also need R, RPostgreSQL, and PostgreSQL.
the code. Here is an R script with a SQL script embedded in it. The SQL code aggregates the rare variants by gene, R fits the logistic regression model for each gene, and then puts the results into a new SQL table so you can play with them later.
First, create a table of source data in SQL:
-- create table of underlying data for RVT1 and RVT2 models
drop table if exists rvt_data;
create table rvt_data as
select rvgm.gene_name,
sv.sid,
s.phen_code,
sum(case when ((sv.variant_allele_count > 0 and rvgm.vaf < .125) or (sv.variant_allele_count < 2 and rvgm.vaf > .875)) and use_flag then 1 else 0 end)::numeric ri, -- number of rare variant loci where this sample has a rare variant
sum(case when use_flag then sv.alleles_genotyped else 0 end)::numeric/2.0 ni -- number of rare variant loci genotyped for this sample
from variants v, sample_variants sv, samples s,
(select e.vid, e.gene_name, coalesce(a.n1000g2012feb_all,0.0) vaf
from annovar_genome_summary a, effects e
where e.vid = a.vid
and (a.n1000g2012feb_all < .125 or a.n1000g2012feb_all is null or a.n1000g2012feb_all > .875) -- specify your MAF threshold in this line. I used 12.5%.
and e.effect_impact in ('MODERATE','HIGH') -- sepcify your inclusion criteria here
group by e.vid, e.gene_name, coalesce(a.n1000g2012feb_all,0.0)
order by e.vid, e.gene_name) rvgm -- [rare variant]-[gene] match
where v.vid = sv.vid
and sv.sid = s.sample_id
and v.vid = rvgm.vid
group by rvgm.gene_name, sv.sid, s.phen_code
order by rvgm.gene_name, sv.sid
;
Then pull the contents of that table into R, run the analysis and put the RVT1 and RVT2 results each into a new table:
library(RPostgreSQL) # first load package RPostgresql
drv <- dbDriver("PostgreSQL")
readcon <- dbConnect(drv, dbname="mydb", user="postgres", password="password")
writecon <- dbConnect(drv, dbname="mydb", user="postgres", password="password")
nsamples <- 50
readsql <- "
select gene_name, sid, phen_code, ri, ni
from rvt_data
order by gene_name, sid
;"
rs <- dbSendQuery(readcon,readsql)
writesql <- "
drop table if exists rvt1_results;
create table rvt1_results (
gene_name varchar,
lambda numeric,
p_value numeric
);
drop table if exists rvt2_results;
create table rvt2_results (
gene_name varchar,
lambda numeric,
p_value numeric
);
"
tmp <- dbSendQuery(writecon,writesql)
while (!dbHasCompleted(rs)) { # for each gene
tbl <- fetch(rs,n=nsamples) # get the data for the fifty samples
nonzero <- (tbl$ni != 0) # find the rows where ni is nonzero, i.e. the sample had at least one locus genotyped
tbl <- tbl[nonzero,] # restrict the analysis to rows where ni is nonzero
if(nrow(tbl) < .9*nsamples) next # only analyze genes where at least 90% of samples can be included
gene_name <- tbl$gene_name[1]
# RVT1
rini <- tbl$ri/tbl$ni # ri/ni
if (length(unique(rini)) != 1) { # skip monomorphic cases
model <- lm(tbl$phen_code ~ rini) # you can add your xi vector of covariates to this model here
coeff <- summary(model)$coef["rini","Estimate"]
p_value <- summary(model)$coef["rini","Pr(>|t|)"]
writesql <- paste("insert into rvt1_results (gene_name,lambda,p_value) values('",gene_name,"',",coeff,",",p_value,");",sep="")
tmp <- dbSendQuery(writecon,writesql)
}
# RVT2
Iri <- 1*(tbl$ri > 0) # indicator variable for whether any rare variants are present. multiplying by 1 converts boolean t/f to integer 1/0
if (length(unique(Iri)) != 1) { # skip monomorphic cases
model <- lm(tbl$phen_code ~ Iri)
coeff <- summary(model)$coef["Iri","Estimate"]
p_value <- summary(model)$coef["Iri","Pr(>|t|)"]
writesql <- paste("insert into rvt2_results (gene_name,lambda,p_value) values('",gene_name,"',",coeff,",",p_value,");",sep="")
tmp <- dbSendQuery(writecon,writesql)
}
}
dbDisconnect(readcon)
dbDisconnect(writecon)
dbUnloadDriver(drv)
In writing this I found useful Christopher Manning’s tutorial on logistic regression in R and this stackoverflow discussion of how to extract coefficients from an R model. I also learned that in R it appears there really is no better way to put variables into a SQL statement than by concatenation [1][2][3]. And I discovered that my new least favorite thing about R is that, when results for a variable are undefined, that variable’s coefficient is simply eliminated from the result matrix, thus making indexing impossible. It’s mentioned in this stackoverflow discussion, though the answer marked correct there did not work for me– instead, I just figured out I could use the if/else clause with update 2012-10-08: I now check for monomorphic variants with is.na
as shown in the code above.if (length(unique(rini)) != 1)
. It’s also possible to run the model first and then exclude results with “NA” coefficients with this line of code: if(is.na(coef(model)["rini"]))
. end update Finally,if you run this and encounter some error you will need to dump all the remaining database results before you can close or reuse the connection. That can be achieved with dbClearResult(rs)
.
A note about how this code is structured. In a large dataset it’s likely at least one sample will have zero rare variant loci genotyped for at least one gene that has such loci. So if you divide ri/ni in your SQL code you’ll get a division by zero error. And if you use a where
clause to exclude those rows, then you won’t have exactly nsamples
rows per gene in the table you read into R, which would make it more complicated to iterate through the results. So to avoid that problem I only calcuate ri and ni in SQL, and then do the zero-checking and division in R.
next steps. After you’ve gotten your p values using this script, you can run this matplotlib scriptto QQ plot them. And once you’ve created variables to capture any population substructure you wish to control for, you can add those into your glm model easily, ex. tbl$phenotype ~ rini + covariates
.
update 2012-09-25: the code above has been updated to include variants where the reference genome actually has the minor allele. (Which happens often: in my dataset, about 10% of the variants have a 1000 genomes frequency > 50%, implying the reference has a frequency < 50%). To correctly handle those cases, I increment ri for samples where the “vaf” (variant allele frequency) is < .125 and the sample has at least one variant allele, OR the vaf > .875 and the sample has at least one reference allele.
update 2012-09-27: fixed bug. sum(sv.alleles_genotyped)::numeric
changed to sum(sv.alleles_genotyped)::numeric/2.0
update 2012-10-08: I overhauled all the above code to make the following changes: (1) changed from logistic to linear model (see discussion below), (2) stored model’s input data in a separate table so you can query it later, (3) allowed it to run on the subset of samples with at least one successfully genotyped locus, down to a genotyping rate threshold (currently set to .9) and (4) added the RVT2 model. Also here is a SQL query that provides table showing the RVT1 results along with some summary data about each gene which makes this more intuitive to browse rather than just looking at p values alone:
select r.gene_name,
r.lambda,
r.p_value,
least(r.p_value*tests.n,1) bonf, -- p value after correction for multiple testing
sum(case when d.phen_code = 1 then d.ri else 0 end) case_rv, -- total number of rare variants in cases
sum(case when d.phen_code = 0 then d.ri else 0 end) control_rv, -- total number of rare variants in controls
max(d.ni) n_loci -- number of rare variant loci genotyped
from rvt_data d, rvt1_results r, (select count(*)::numeric n from rvt1_results) tests
where d.gene_name = r.gene_name
group by r.gene_name,
r.lambda,
r.p_value,
tests.n
order by r.p_value asc
;