---
title: "denovolyzeR intro"
author: "James Ware"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{denovolyzeR_intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

## Statistics for *de novo* variant analysis

#### Introduction

This package provides functions to analyse *de novo* genetic variants using the statistical framework described in [Samocha *et al* (2014) Nature Genetics 10.1038/ng.3050](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4222185/). This vignette demonstrates the usage of the `denovolyzeR` package to recapitulate the analyses described in this paper:  

1. What is the overall burden of *de novo* variation in the study cohort: are there more *de novos* than expected?
  As well as looking at the total burden of *de novos*, results are returned for different classes of variant, such as loss-of-function (lof), missense (mis) and synonymous (syn).
2. Do *de novo* variants cluster in specific genes: are there more genes containing multiple *de novos* than expected?
3. Are there any individual genes that contain more *de novos* than expected?

If using the package, please cite [Ware *et al* (2015) Curr Protoc Hum Genet. 10.1002/0471142905.hg0725s87](http://www.ncbi.nlm.nih.gov/pubmed/26439716).  

## Installation
```
# Install the package if you haven't already.
# OPTION 1 - install the latest release from CRAN:
install.packages("denovolyzeR")  

# OPTION 2 - install the latest development version from GitHub.  Either download and install, or use devtools:
if(!"devtools" %in% installed.packages()){
  install.packages("devtools")
}
devtools::install_github("jamesware/denovolyzeR")
```

## Example analysis

We start with a table of de novo variants. An example dataset is provided:

``` {r}
library(denovolyzeR)
# have a look at the example data:
dim(autismDeNovos)
head(autismDeNovos)
```

#### Overall *de novo* burden

First we want to know whether there are more de novos than expected, using the `denovolyzeByClass()` function. These variants were obtained by sequencing 1,078 cases, so we use `nsamples=1078`.


``` {r}
denovolyzeByClass(genes=autismDeNovos$gene,
                  classes=autismDeNovos$class,
                  nsamples=1078)
```

The total number of de novos is almost exactly as our model predicts.  However, we see a statistically significant excess of LOF variants in this population.

#### Genes containing multiple *de novos* 

Next, we look to see if the total number of genes that contain more than one de novo is greater than expected, using the `denovolyzeMultiHits()` function.

``` {r}
denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078)
```

obs = the number of genes in our dataset with >1 *de novo* variant  
expMean = the expected number of genes containing >1 *de novo*: an average obtained by permutation  
expMax = the maximum number of genes containing >1 *de novo* in `nperms` permutations (default `nperms=100`)  
pValue = an empirical p value  
nVars = the total number of *de novo* variants in each class  
Note that the number of observed genes with >1 protein-altering variant does **not** equal the number of genes with >1 lof + number of genes with >1 missense, as genes containing 1 lof + 1 missense will only be counted as "multihits" in the combined analysis.

Here it looks like there may be an excess of genes with >1 lof variant, >1 missense, and >1 protein-altering variant.  We will want to increase the number if permutations here to get a handle on our level of significance.

``` {r}
denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000)
```

There is another important option here.  The expected number of genes containing >1 hit is obtained by permutation: Given n *de novo* variants, how many genes contain >1 de novo?  There are two options for selecting n: by default it is derived from your data: e.g. in the example above autismDeNovos contains `r sum(autismDeNovos$class %in% c("frameshift","non","splice"))` lof variants, so this is the number used in the permutation. This is controlled by the default parameter `nVars="actual"`
``` {r}
sum(autismDeNovos$class %in% c("frameshift","non","splice"))
```

This is a conservative approach, addressing the question: "given the number of variants in our dataset, do we see more genes with >1 variant than expected?"

An alternative approach simply asks whether there are more genes with >1 variant than our de novo model predicts.  This is accessed by setting `nVars="expected"`.

``` {r}
denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000,
                    nVars="expected")
```

#### Do any individual genes contain more *de novos* than expected

We see `r library(dplyr); autismDeNovos %>% filter(class %in% c("frameshift","non","splice")) %>% group_by(gene) %>% summarize(n=n()) %>% filter(n>1) %>% nrow` genes containing >1 *de novo* lof variant.  This is more than expected, but are any of these genes individually significant?  We can `denovolyzeByGene()` to find out.

By default this function compares the number of LOF variants against expectation for each gene, and then the total number of protein-altering variants (LOF + missense).  It can also be configured to return other classes if relevant.

``` {r}
head(
denovolyzeByGene(genes=autismDeNovos$gene,
                 classes=autismDeNovos$class,
                 nsamples=1078)
  )
```

Several genes meet statistical significance after correcting for multiple testing.  Default options apply two tests across `r load("../R/sysdata.rda"); length(unique(pDNM$geneName))` genes, so a Bonferroni corrected p-value threshold at $\alpha$ = 0.05 would be $`r signif(0.05*0.5*(1/length(unique(pDNM$geneName))),2)`$.

#### Geneset analysis

The analyses presented so far have been exome-wide.  It may be appropriate to restrict analyses to a geneset of interest - for example, it may be relevant to examine the burden of *de novo* variation in a pathway of interest, or initial variant detection may have been restricted to a set of candidate genes (rather than whole exome sequencing).  All of the above funtions can be targeted to a subset of genes using the `includeGenes` argument.

The package includes as an example a list of `r nrow(fmrpGenes)` genes that interact with the fragile X mental retardation protein (FMRP).  Is this geneset enriched for *de novos*, and recurrent *de novos*, in our autism trios?

```{r geneset}
nrow(fmrpGenes); head(fmrpGenes)

denovolyzeByClass(genes=autismDeNovos$gene,
                  classes=autismDeNovos$class,
                  nsamples=1078,
                  includeGenes=fmrpGenes$geneName)

denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000,
                    includeGenes=fmrpGenes$geneName)
```

## Other functions

`viewProbabilityTable` provides access to the underlying *de novo* probability tables used to calculate expected *de novo* burdens throughout this package:

```{r viewProbabilityTable}
head(
  viewProbabilityTable()
  )

head(
  viewProbabilityTable(format="long")
  )
```

Most of the core functionality of this package is contained in the `denovolyze` function.  The `denovolyzeByClass` and `denovolyzeByGene` functions used in this vignette are convenience functions that set defaults appropriate to the most common usages of this function.  Full details of additional options and default behaviours are available using `?denovolyze`.