Variant Sets

ASTRID contains results from the analysis of three single-nucleotide variant (SNV) sets representing germline neutral variation (ExAC), known germline pathogenic variation (ClinVar), and somatic mutations recurrent across multiple tumor samples (COSMIC).  

Missense variants from each of these sets were mapped into representative protein structures through Ensembl transcripts, which were matched with UniProt accession and Protein Data Bank (accessed on 01-07-2016) IDs using cross-reference tables provided by UniProt. Reference protein sequences were aligned with observed sequences in the PDB using SIFTS.  Discrepancies were corrected by Needleman-Wunsch pairwise alignment with Biopython. Proteins were represented by the subset of minimally overlapping PDB structures described by Kamburov et al.


Assessments of Spatial Distributions

We examine the spatial distributions of genetic variants in protein structures using an approach based on the Ripley’s statistic.  The univariate (single dataset) K quantifies the spatial heterogeneity of a set of variants by comparing the proportion of variants within a given distance from one another to the expectation under a random spatial distribution.  Variants are considered clustered if the proportion of neighbors exceeds expectation and dispersed if the number of neighbors is lower than the expectation.  The statistic is computed across a range of distance thresholds (t), and K can be interpreted as the proportion of variant pairs within distance t of one another.  

To summarize spatial patterns into a protein-level summary statistic, we computed the area between the observed K curve and the median empirical null K curve using Simpson’s rule.  We repeat this process to calculate a permutation P-value and Z-score for each protein-derived statistic.  We adjust all analyses for multiple testing using false discovery rate (FDR) at 10%.   We also adapted the bivariate D statistic to enable comparisons between sets.  The bivariate analysis evaluates whether one set of variants is more or less clustered than another by computing the difference in their univariate K.


Pathogenic Proximity Scoring

To measure the average proximity of a variant x to a set of known variants Y, we identify the proportion of variants in Y within some distance t of x.  For each protein structure, we selected the distance threshold t at which the bivariate Z-score between pathogenic and neutral variants was most extreme.  The pathogenic proximity (PathProx) score for each variant was then defined as the difference in average proximity to pathogenic (ClinVar) and neutral (ExAC) variation.  We performed leave-one-out cross validation of ClinVar pathogenic and ExAC missense variants in proteins for which ClinVar pathogenic variants were determined to be significantly clustered by both univariate and bivariate analyses. We then calculated receiver-operator-characteristic (ROC) and precision-recall (PR) curves from the pathogenic proximity score of each variant and summarized predictive performance by computing the area under the ROC and PR curves (AUC).  ROC and PR curves are shown on ASTRID entry pages where PathProx scores were evaluated.