CKDGen WES analysis plan

From CKDGen wiki
Jump to navigation Jump to search

Motivation and Rationale

SAIGEis a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using SAIGE is to use a consistent framework for association analysis in the CKDGen consortium which works for single-variant and gene-level analysis.


Outline of analysis protocol

This is an overview of the analysis protocol for analyzing WES CKDGen datasets using the SAIGE pipeline. We assume that your dataset was already called (using e.g. GaTK) and that you have a VCF file with all the genotype data.

  1. Download and install SAIGE
  2. Prepare VCF file with genotypes
  3. Download the phenotype generation script and get analysis-ready phenotypes
  4. Prepare PED file with phenotypes and covariates
  5. Run SAIGE association pipeline
  6. Report association results in appropriate format

Download and install SAIGE

Please follow these recommendations: [SAIGE install ]

Getting started with an example

bgzip and tabix VCF files

Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF

bgzip input.vcf ## this command will produce input.vcf.gz
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi

Sample IDs in the VCF file must the same as those from PED file.

Download the phenotype generation script and get analysis-ready phenotypes

Download the phenotype generation script from https://github.com/genepi-freiburg/ckdgen-pheno/tree/wes-pheno.

On the page, click on the green “clone or download” button, then select “download ZIP”.

You will need to provide the following additional information: assay used for measurement of blood creatinine (Jaffe vs. enzymatic), year of blood creatinine measurement, and lower limit of detection for the assay used to measure urinary albumin.


Note that the phenotypes to be used in the analysis are contained in the output file "ckdgen-pheno-STUDY-DATE.phenotype.txt".


Prepare Pheno file for phenotypes and covariates

SAIGE needs a phenotype file. The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.


An example phenotype file:

IID SEX DISEASE QT AGE 
NA12344 1 1 94.17 66.1
NA12347 1 1 109.54 44.0
NA12348 2 2 119.40 46.6
NA06984 1 2 87.72 39.3


Run SAIGE association pipeline

ANALYSIS OF QUANTITATIVE OUTCOMES

Table 1 provides an overview of all quantitative outcomes requested from the CKDGen collaborators. Please contribute whichever of these phenotypes are available in your study.

Family- and pedigree- based studies, important note: please use your own custom pipeline for GWAS if preferred. If you would like to use EPACTS, please use linear mixed models to incorporate relatedness information, as implemented in the q.emmax option (see Table 1).

Table 1

Outcome

Description of the outcome

Linear regression model: test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 3

eGFR_overall

Age- and sex-adjusted residuals of ln(eGFR)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_overall_impPanel_chr_date.txt

eGFR_nonDM

Age- and sex-adjusted residuals of ln(eGFR) among those without diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_nonDM_impPanel_chr_date.txt


eGFR_DM

Age- and sex-adjusted residuals of ln(eGFR) among those with diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_DM_impPanel_chr_date.txt


creatinine_overall

Age- and sex-adjusted residuals of ln(crea)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_trait_impPanel_chr_date.txt

UACR_overall

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_overall_impPanel_chr_date.txt

UACR_nonDM

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR) among those without diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_nonDM_impPanel_chr_date.txt

UACR_DM

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR) among those with diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_DM_impPanel_chr_date.txt

bun_overall

Age- and sex-adjusted residuals of ln(bun) [calculated from urea]

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_bun_overall_impPanel_chr_date.txt

uric_acid_overall

Age- and sex-adjusted residuals of uric acid

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_overall_impPanel_chr_date.txt

uric_acid_men

Age-adjusted residuals of uric acid among men

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_men_impPanel_chr_date.txt

uric_acid_women

Age-adjusted residuals of uric acid among women

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_women_impPanel_chr_date.txt



Example command to run the linear Wald test using SAIGE:



ANALYSIS OF BINARY OUTCOMES

For analysis of binary outcomes, we are requesting association analyses using



Family- and pedigree- based studies, important note: to account for relatedness, please include genetic principal components in the logistic regression model; in this way we will obtain valid beta (and so odds ratios) estimates for the meta-analysis. Please, do not use linear mixed models as for continuous traits (they provide valid p-values but invalid betas), unless you have a software to run logistic mixed models.

Prospective studies, important note: the necessary covariate baseline GFR is called “egfr_ckdepi_creat” and located in the file *out.csv generated by the phenotype generation script, along with age and sex. Note that this it different from the variable “eGFRcrea_overall", which is the age- and sex-adjusted residuals for quantitative trait GWAS contained in the *phenotype.txt file generated by the script.


Table 2 provides an overview of all binary outcomes requested from the CKDGen collaborators. Please contribute whichever of these phenotypes are available in your study.


Table 2

Outcome

Description of the outcome

Logistic regression model: test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 3

CKD_overall

CKD as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_overall_impPanel_chr_date.txt

CKD_DM

CKD as generated from phenotype script among those with diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_DM_impPanel_chr_date.txt

CKD_nonDM

CKD as generated from phenotype script among those without diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_nonDM_impPanel_chr_date.txt

MA_overall

MA as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_overall_impPanel_chr_date.txt

MA_DM

MA as generated from phenotype script among those with diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_DM_impPanel_chr_date.txt

MA_nonDM

MA as generated from phenotype script among those without diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_nonDM_impPanel_chr_date.txt

Gout_overall

Gout as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_overall_impPanel_chr_date.txt

Gout_men

Gout as generated from phenotype script among men

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_men_impPanel_chr_date.txt

Gout_women

Gout as generated from phenotype script among women

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_women_impPanel_chr_date.txt


Example command to run the binary Wald test using the EPACTS software:

EPACTS-3.2.6/bin/epacts single --vcf [INPUT VCF FILENAME] --ped [INPUT PED FILENAME] \
--out [OUTPUT FILENAME PREFIX] --test b.wald \
--pheno CKD_overall --cov age --cov sex --anno --min-mac 1 --field EC --run 10

Important: To analyze dosages (not genotypes), you must specify the dosage field with the "--field" option. Depending on your input file, you must give the name of the field that contains the genotype dosages. If you created the VCF files using QCTOOL, or working with .vcf files obtained from the Michigan or Sanger imputation servers, you need to use "--field DS", otherwise, you most likely should use "--field EC". Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!
The number of CPUs used is specified by the "--run" option and should be set accordingly.

X CHROMOSOME ANALYSES

Only for the overall phenotypes listed in Table 3, run the analyses on chromosome X for males and females separately, using the same commands listed above for quantitative and binary traits. For information about X chromosome genotype imputation see the Appendix of the CKDGen Round 4 Analysis Plan. As long as imputation did account for the pseudo- and non-pseudoautosomal regions correctly, the additive genetic model for association will work fine as for the autosomes. Remember that X chromosome should be coded as X (do not use 23 or other codings).
Family- and pedigree- based studies, important note: to account for relatedness for continuous traits, please see Table 3; for binary traits, please include genetic principal components in the logistic regression model (please, do not use linear mixed models as they provide valid p-values but invalid betas, unless you have a software to fit logistic mixed models).

Table 3: X chromosome analyses

Outcome

Regression model

EPACTS test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 4

eGFR_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_overall_impPanel_chrX_F_date.txt
Study_ethn_eGFR_overall_impPanel_chrX_M_date.txt

creatinine_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_creatinine_overall_impPanel_chrX_F_date.txt
Study_ethn_creatinine_overall_impPanel_chrX_M_date.txt

UACR_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_overall_impPanel_chrX_F_date.txt
Study_ethn_UACR_overall_impPanel_chrX_M_date.txt

bun_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_urea_overall_impPanel_chrX_F_date.txt
Study_ethn_urea_overall_impPanel_chrX_M_date.txt

uric_acid_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_impPanel_chrX_F_date.txt
Study_ethn_uric_acid_impPanel_chrX_M_date.txt

CKD_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_overall_impPanel_chrX_F_date.txt
Study_ethn_CKD_overall_impPanel_chrX_M_date.txt

MA_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_overall_impPanel_chrX_F_date.txt
Study_ethn_MA_overall_impPanel_chrX_M_date.txt

Gout_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_overall_impPanel_chrX_F_date.txt
Study_ethn_gout_overall_impPanel_chrX_M_date.txt

eGFRdecline
Prospective studies only

Linear

--test q.linear

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_impPanel_chrX_F_date.txt
Study_ethn_eGFRdecline_impPanel_chrX_M_date.txt

Rapid3
Prospective studies only

Logistic

--test b.wald

age, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_rapid3_impPanel_chrX_F_date.txt
Study_ethn_rapid3_impPanel_chrX_M_date.txt

iCKD25
Prospective studies only

Logistic

--test b.wald

age, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_iCKD25_impPanel_chrX_F_date.txt
Study_ethn_iCKD25_impPanel_chrX_M_date.txt



FILENAME CONVENTION

Name all files to be uploaded (GWAS results/GWAS summary statistics and imputation quality files) following the naming key A_B_C_D_E_F.<original_file_extension> as outlined in Table 4:


Table 4

A

B

C

D

E

F

Your study’s name

Ethnicity; use EA for European ancestry, AA for African American, AFR for African, EAS for East Asian, SA for South Asian, HIS for Hispanic, IA for Indian ancestry or as applicable

The analyzed study trait, e.g. “eGFR_overall”

Imputation reference panel, use “1KGPph1v3”, “1KGPph3v5” or “HRC”, as applicable

Chromosome:
Autosomes: use “chrXX” (eg: chr03)
X chromosome: use "chrX_F" for females and "chrX_M" for males

Date, use YYYYMMDD


Examples: ARIC_AA_eGFRoverall_1KGPph3v5_chr20_20160530.txt, ARIC_AA_1KGPph3v5_20160530.info

Report results

Please, upload all .epacts.gz files, following the naming convention provided in Table 3, to the FTP server: http://ckdgen.eurac.edu/upload/


User name: ckdgenR4

Password: ExcitingScience!


Please, inform us with an email to ckdgenconsortium@gmail.com when upload is complete, indicating your study and your name.



Troubleshooting Common Issues

We validated the entire pipeline carefully using different imputation platforms and phenotype scenarios, but one can never exclude that the pipeline is completely bug free. For this reason, please, report to us any issue you encounter when using this pipeline.


Frequently Asked Questions

Please, feel free to ask any question when things are not clear. We will reply promptly.