CKDGen Round 4 EPACTS analysis plan

From CKDGen wiki
Revision as of 09:39, 20 February 2017 by Akottgen (talk | contribs) (→‎Convert dosage file into VCF format)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Disclaimer: this document is based on the DIAGRAM EPACTS analysis plan [1].

Motivation and Rationale

EPACTS is a software pipeline developed to perform various statistical tests for analysis of whole-genome / whole-exome sequencing data. The main motivation for using EPACTS is to use a consistent framework for association analysis in the CKDGen consortium.
Before running this pipeline, please read carefully the the CKDGen Round 4 Analysis Plan. There, you find information on imputation reference panels, how to generate the necessary phenotypes for analysis, and the "General instructions" section provides information about minimum sample size, genetic model, and how to handle multi-ethnic samples.


Outline of analysis protocol

This is an overview of the analysis protocol for analyzing imputed CKDGen datasets using the EPACTS pipeline. We assume that your dataset has been imputed using minimac or Impute2. Starting with minimac or impute2 output:

  1. Download and install EPACTS
  2. Prepare VCF file with genotypes / dosages
  3. Download the phenotype generation script and get analysis-ready phenotypes
  4. Prepare PED file with phenotypes and covariates
  5. Run EPACTS association pipeline
  6. Report association results in appropriate format

Download and install EPACTS

  • Important: Please download the latest version of EPACTS (including the supporting files) here: https://github.com/statgen/EPACTS
  • This version of EPACTS includes a recent fix that affects how the DS field is read from .vcf files.
  • Uncompress EPACTS package to the directory you would like to install and then type the following commands (exemplified for EPACTS-3.2.6., now version 3.3.0)
tar xzvf EPACTS-3.2.6.tar.gz
cd EPACTS-3.2.6
./configure --prefix [INSTALL DIRECTORY]
make
make install
  • Perform a test run by running the following command assuming EPACTS-3.2.6 is your INSTALL DIRECTORY
EPACTS-3.2.6/bin/test_run_epacts.sh

Getting started with an example

Once installed, test out the software by running a quick example using the test data provided in the "share/EPACTS" directory. The example VCF and PED files are:

EPACTS-3.2.6/share/EPACTS/1000G_exome_chr20_example_softFiltered.calls.vcf.gz
EPACTS-3.2.6/share/EPACTS/1000G_dummy_pheno.ped

Run the single variant Wald test on the example data using this command:

EPACTS-3.2.6/bin/epacts single \
--vcf EPACTS-3.2.6/share/EPACTS/1000G_exome_chr20_example_softFiltered.calls.vcf.gz \
--ped EPACTS-3.2.6/share/EPACTS/1000G_dummy_pheno.ped \
--min-maf 0.001 --chr 20 --pheno DISEASE --cov AGE --cov SEX --test b.wald --anno \
--out {OUTPUT_DIR}/test --run 2 &

This command will run the single variant binary Wald test on the input VCF and PED files, with a minimum MAF threshold of 0.001.  The phenotype is "DISEASE" and the analysis is being adjusted for the covariates AGE and SEX.  The output file directory prefix is {OUTPUT_DIR}/test.  Finally, EPACTS will run the analysis in parallel on 2 CPUs.

A more detailed description of the example can be found here.


Prepare VCF file with genotypes / dosages

EPACTS requires input genotype / dosage information in VCF format.  From Minimac or Impute2, you will start with your imputed dosage file.

Convert dosage file into VCF format

  1. Haplotype Reference Consortium (HRC) imputation server - no conversion needed
  2. Minimac - no conversion needed for Minimac3
  3. Impute2 - http://www.well.ox.ac.uk/~gav/qctool/#tutorial - detailed instructions here: Impute2 VCF conversion

bgzip and tabix VCF files

Input VCF file must be bgzipped and tabixed before running association to allow efficient random access of the file. Below is an example command to conver plain VCF into bgzipped and tabixed VCF

bgzip input.vcf ## this command will produce input.vcf.gz
tabix -pvcf -f input.vcf.gz ## this command will produce input.vcf.gz.tbi

If the VCF file is separated by chromosome, the VCF file must contain the string "chr1" in the chromosome 1 file, and corresponding chromosome name for other chromosomes.
Sample IDs in the VCF file must the same as those from PED file.

Download the phenotype generation script and get analysis-ready phenotypes

Download the phenotype generation script from https://github.com/genepi-freiburg/ckdgen-pheno/.

On the page, click on the green “clone or download” button, then select “download ZIP”.

You will need to provide the following additional information: assay used for measurement of blood creatinine (Jaffe vs. enzymatic), year of blood creatinine measurement, and lower limit of detection for the assay used to measure urinary albumin.

Please, refer to section 2 of the the CKDGen Round 4 Analysis Plan for the detailed explanation of how to create GWAS-ready phenotypes.

Note that the phenotypes to be used in the analysis are contained in the output file "ckdgen-pheno-STUDY-DATE.phenotype.txt".


Prepare PED file for phenotypes and covariates

EPACTS accepts the PED format supported by MERLIN or PLINK to represent the phenotypes and covariates.  You may prepare either (1) a PED file without column headers + accompanying DAT file, or (2) a PED file with column headers, using the phenotypes generated with the phenotype generation script (section above).

The standard PED format has 6 mandatory columns:

  1. Family ID
  2. Individual ID
  3. Paternal ID
  4. Maternal ID
  5. Sex (1=male; 2=female; other=unknown)
  6. Phenotype (1 = control; 2 = case)

Columns 7 and onwards are additonal covariates and or phenotypes.  For example:

  1. QT
  2. AGE
  3. etc.

Notes:

  • Categorical covariates must be coded as dummy variables!  For example, Sex cannot be coded as "M" or "F".
  • According to your pipeline, continuous covariates (e.g.: PCs) might have to be rescaled prior to use depending on their range.


An example PED file with a header is as follows. Note that the header must start with a "#" symbol.

#FAM_ID IND_ID FAT_ID MOT_ID SEX DISEASE QT AGE 
13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

Alternatively, you can prepare a PED file without a header, and include a corresponding DAT file describing the column headers.

13281 NA12344 NA12347 NA12348 1 1 94.17 66.1
13281 NA12347 0 0 1 1 109.54 44.0
13281 NA12348 0 0 2 2 119.40 46.6
1328 NA06984 0 0 1 2 87.72 39.3
1328 NA06989 0 0 2 1 100.60 41.7
1328 NA12329 NA06984 NA06989 2 1 100.85 46.4
13291 NA06986 0 0 1 2 91.94 61.9
13291 NA06995 NA07435 NA07037 1 2 104.36 57.4
13291 NA06997 NA06986 NA07045 2 2 107.53 53.1

The corresponding DAT file is:

A DISEASE
T QT
C AGE

Key:  A = binary trait; T = quantitative trait; C = covariate

Run EPACTS association pipeline

For detailed description of options, use:

EPACTS-3.2.6/bin/epacts single -man 

ANALYSIS OF QUANTITATIVE OUTCOMES

For analysis of continuous outcomes, we are requesting association analyses using the Linear Wald Test (option q.linear in EPACTS). For more information on the test, please see EPACTS

Table 1 provides an overview of all quantitative outcomes requested from the CKDGen collaborators. Please contribute whichever of these phenotypes are available in your study.

Family- and pedigree- based studies, important note: please use your own custom pipeline for GWAS if preferred. If you would like to use EPACTS, please use linear mixed models to incorporate relatedness information, as implemented in the q.emmax option (see Table 1).

Table 1

Outcome

Description of the outcome

Linear regression model: test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 3

eGFR_overall

Age- and sex-adjusted residuals of ln(eGFR)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_overall_impPanel_chr_date.txt

eGFR_nonDM

Age- and sex-adjusted residuals of ln(eGFR) among those without diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_nonDM_impPanel_chr_date.txt


eGFR_DM

Age- and sex-adjusted residuals of ln(eGFR) among those with diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_DM_impPanel_chr_date.txt


creatinine_overall

Age- and sex-adjusted residuals of ln(crea)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_trait_impPanel_chr_date.txt

UACR_overall

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR)

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_overall_impPanel_chr_date.txt

UACR_nonDM

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR) among those without diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_nonDM_impPanel_chr_date.txt

UACR_DM

Inverse normal transformed age- and sex-adjusted residuals of ln(UACR) among those with diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_DM_impPanel_chr_date.txt

bun_overall

Age- and sex-adjusted residuals of ln(bun) [calculated from urea]

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_bun_overall_impPanel_chr_date.txt

uric_acid_overall

Age- and sex-adjusted residuals of uric acid

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_overall_impPanel_chr_date.txt

uric_acid_men

Age-adjusted residuals of uric acid among men

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_men_impPanel_chr_date.txt

uric_acid_women

Age-adjusted residuals of uric acid among women

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_women_impPanel_chr_date.txt

eGFRdecline

only for prospective studies Age-, sex- and baseline eGFR-adjusted residuals of eGFRdecline.

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_impPanel_chr_date.txt

eGFRdecline_DM

only for prospective studies Age-, sex- and baseline eGFR-adjusted residuals of eGFRdecline in diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_DM_impPanel_chr_date.txt

eGFRdecline_nonDM

only for prospective studies Age-, sex- and baseline eGFR-adjusted residuals of eGFRdecline in non-diabetes

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_nonDM_impPanel_chr_date.txt

eGFRdecline_CKD

only for prospective studies Age-, sex- and baseline eGFR-adjusted residuals of eGFRdecline in CKD

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_CKD_impPanel_chr_date.txt


Example command to run the linear Wald test using the EPACTS software:

EPACTS-3.2.6/bin/epacts single --vcf [INPUT VCF FILENAME] --ped [INPUT PED FILENAME] \
--out [OUTPUT FILENAME PREFIX] --test q.linear \
--pheno eGFR_overall --cov PC1 --anno --min-mac 1 --field EC --run 10

Important: To analyze dosages (not genotypes), you must specify the dosage field with the "--field" option. Depending on your input file, you must give the name of the field that contains the genotype dosages. If you created the VCF files using QCTOOL, or working with .vcf files obtained from the Michigan or Sanger imputation servers, you need to use "--field DS", otherwise, you most likely should use "--field EC". Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!
The number of CPUs used is specified by the "--run" option and should be set accordingly.


ANALYSIS OF BINARY OUTCOMES

For analysis of binary outcomes, we are requesting association analyses using the Logisitic Wald Test (option --test b.wald in EPACTS). For more information on the test, please see EPACTS

Family- and pedigree- based studies, important note: to account for relatedness, please include genetic principal components in the logistic regression model; in this way we will obtain valid beta (and so odds ratios) estimates for the meta-analysis. Please, do not use linear mixed models as for continuous traits (they provide valid p-values but invalid betas), unless you have a software to run logistic mixed models.

Prospective studies, important note: the necessary covariate baseline GFR is called “egfr_ckdepi_creat” and located in the file *out.csv generated by the phenotype generation script, along with age and sex. Note that this it different from the variable “eGFRcrea_overall", which is the age- and sex-adjusted residuals for quantitative trait GWAS contained in the *phenotype.txt file generated by the script.


Table 2 provides an overview of all binary outcomes requested from the CKDGen collaborators. Please contribute whichever of these phenotypes are available in your study.


Table 2

Outcome

Description of the outcome

Logistic regression model: test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 3

CKD_overall

CKD as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_overall_impPanel_chr_date.txt

CKD_DM

CKD as generated from phenotype script among those with diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_DM_impPanel_chr_date.txt

CKD_nonDM

CKD as generated from phenotype script among those without diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_nonDM_impPanel_chr_date.txt

MA_overall

MA as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_overall_impPanel_chr_date.txt

MA_DM

MA as generated from phenotype script among those with diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_DM_impPanel_chr_date.txt

MA_nonDM

MA as generated from phenotype script among those without diabetes

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_nonDM_impPanel_chr_date.txt

Gout_overall

Gout as generated from phenotype script

--test b.wald

age, sex; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_overall_impPanel_chr_date.txt

Gout_men

Gout as generated from phenotype script among men

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_men_impPanel_chr_date.txt

Gout_women

Gout as generated from phenotype script among women

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_women_impPanel_chr_date.txt

Rapid3

only for prospective studies Rapid3 as generated from phenotype script

--test b.wald

age, sex, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_rapid3_impPanel_chr_date.txt

Rapid3_DM

only for prospective studies Rapid3 as generated from phenotype script among those with diabetes

--test b.wald

age, sex, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_rapid3_DM_impPanel_chr_date.txt

Rapid3_ nonDM

only for prospective studies Rapid3 as generated from phenotype script among those without diabetes

--test b.wald

age, sex, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_rapid3_nonDM_impPanel_chr_date.txt

iCKD25

only for prospective studies Incident CKD as generated from phenotype script

--test b.wald

age, sex, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_iCKD25_impPanel_chr_date.txt


Example command to run the binary Wald test using the EPACTS software:

EPACTS-3.2.6/bin/epacts single --vcf [INPUT VCF FILENAME] --ped [INPUT PED FILENAME] \
--out [OUTPUT FILENAME PREFIX] --test b.wald \
--pheno CKD_overall --cov age --cov sex --anno --min-mac 1 --field EC --run 10

Important: To analyze dosages (not genotypes), you must specify the dosage field with the "--field" option. Depending on your input file, you must give the name of the field that contains the genotype dosages. If you created the VCF files using QCTOOL, or working with .vcf files obtained from the Michigan or Sanger imputation servers, you need to use "--field DS", otherwise, you most likely should use "--field EC". Without this option, you will be analyzing the hard genotypes (i.e. --field option defaults to "GT" or "genotypes")!
The number of CPUs used is specified by the "--run" option and should be set accordingly.

X CHROMOSOME ANALYSES

Only for the overall phenotypes listed in Table 3, run the analyses on chromosome X for males and females separately, using the same commands listed above for quantitative and binary traits. For information about X chromosome genotype imputation see the Appendix of the CKDGen Round 4 Analysis Plan. As long as imputation did account for the pseudo- and non-pseudoautosomal regions correctly, the additive genetic model for association will work fine as for the autosomes. Remember that X chromosome should be coded as X (do not use 23 or other codings).
Family- and pedigree- based studies, important note: to account for relatedness for continuous traits, please see Table 3; for binary traits, please include genetic principal components in the logistic regression model (please, do not use linear mixed models as they provide valid p-values but invalid betas, unless you have a software to fit logistic mixed models).

Table 3: X chromosome analyses

Outcome

Regression model

EPACTS test option

Covariates

Output Filename Format:
replace "study", "ethn", impPanel", "chr", and "date" following the key given in Table 4

eGFR_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFR_overall_impPanel_chrX_F_date.txt
Study_ethn_eGFR_overall_impPanel_chrX_M_date.txt

creatinine_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_creatinine_overall_impPanel_chrX_F_date.txt
Study_ethn_creatinine_overall_impPanel_chrX_M_date.txt

UACR_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_UACR_overall_impPanel_chrX_F_date.txt
Study_ethn_UACR_overall_impPanel_chrX_M_date.txt

bun_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_urea_overall_impPanel_chrX_F_date.txt
Study_ethn_urea_overall_impPanel_chrX_M_date.txt

uric_acid_overall

Linear

--test q.linear
--test q.emmax for family-based studies

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_uric_acid_impPanel_chrX_F_date.txt
Study_ethn_uric_acid_impPanel_chrX_M_date.txt

CKD_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_CKD_overall_impPanel_chrX_F_date.txt
Study_ethn_CKD_overall_impPanel_chrX_M_date.txt

MA_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_MA_overall_impPanel_chrX_F_date.txt
Study_ethn_MA_overall_impPanel_chrX_M_date.txt

Gout_overall

Logistic

--test b.wald

age; if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_gout_overall_impPanel_chrX_F_date.txt
Study_ethn_gout_overall_impPanel_chrX_M_date.txt

eGFRdecline
Prospective studies only

Linear

--test q.linear

If needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_eGFRdecline_impPanel_chrX_F_date.txt
Study_ethn_eGFRdecline_impPanel_chrX_M_date.txt

Rapid3
Prospective studies only

Logistic

--test b.wald

age, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_rapid3_impPanel_chrX_F_date.txt
Study_ethn_rapid3_impPanel_chrX_M_date.txt

iCKD25
Prospective studies only

Logistic

--test b.wald

age, baseline eGFR(“egfr_ckdepi_creat” from *out.csv file); if needed: study-specific covariates (e.g., study site, PCs, etc.)

Study_ethn_iCKD25_impPanel_chrX_F_date.txt
Study_ethn_iCKD25_impPanel_chrX_M_date.txt



FILENAME CONVENTION

Name all files to be uploaded (GWAS results/GWAS summary statistics and imputation quality files) following the naming key A_B_C_D_E_F.<original_file_extension> as outlined in Table 4:


Table 4

A

B

C

D

E

F

Your study’s name

Ethnicity; use EA for European ancestry, AA for African American, AFR for African, EAS for East Asian, SA for South Asian, HIS for Hispanic, IA for Indian ancestry or as applicable

The analyzed study trait, e.g. “eGFR_overall”

Imputation reference panel, use “1KGPph1v3”, “1KGPph3v5” or “HRC”, as applicable

Chromosome:
Autosomes: use “chrXX” (eg: chr03)
X chromosome: use "chrX_F" for females and "chrX_M" for males

Date, use YYYYMMDD


Examples: ARIC_AA_eGFRoverall_1KGPph3v5_chr20_20160530.txt, ARIC_AA_1KGPph3v5_20160530.info

Report results

Please, upload all .epacts.gz files, following the naming convention provided in Table 3, to the FTP server: http://ckdgen.eurac.edu/upload/


User name: ckdgenR4

Password: ExcitingScience!


Along with the EPACTS .epacts.gz files files, please also upload:

  1. .info files that contain imputation quality statistics from imputation (*.info for minimac and “*_info” for IMPUTE2). Information from all chromosomes can be concatenated into one file, but separate files per chromosome are preferred.
  2. the two summary .summary.pdf and .summary.txt files from the phenotype generation script, e.g. "ckdgen-pheno-SHIP-0-201606210921.summary.txt", "ckdgen-pheno-SHIP-0-201606210921.summary.pdf"


Please, inform us with an email to ckdgenconsortium@gmail.com when upload is complete, indicating your study and your name.



Troubleshooting Common Issues

We validated the entire pipeline carefully using different imputation platforms and phenotype scenarios, but one can never exclude that the pipeline is completely bug free. For this reason, please, report to us any issue you encounter when using this pipeline.

EPACTS installation errors

  No issues reported so far.

EPACTS running errors

  No issues reported so far.

ERROR: No overlapping IDs between VCF and PED file. Cannot proceed.

  Check that your individual ID's in your PED file are the same as those in your VCF file.

  For example, if your VCF individual ID's include the family ID's (i.e. ABCD->ABCD001), the individual ID's in the PED file must match it exactly.

Estimated allele frequencies and analysis results do not exactly match results from my existing association software

  Check that you have included the same set of covariates (with categorical variables encoded as dummy variables).   Check that you have the same sample size being analyzed.   Finally, check that you used dosages by adding the appropriate "-field" option.  For example, suppose your VCF is:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A001 B001 C001
11 180567 11:180567 C G 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
11 186458 11:186458 G A 0 PASS . GT:EC 1/1:1.9850 1/1:1.9750 1/1:1.9840
11 186462 11:186462 C A 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
11 192958 11:192958 G T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
11 192995 11:192995 C T 0 PASS . GT:EC 1/1:2.0000 1/1:2.0000 1/1:2.0000
11 193065 11:193065 G A 0 PASS . GT:EC 1/1:1.9980 1/1:1.9990 1/1:1.9960
11 193096 11:193096 C T 0 PASS . GT:EC 0/1:0.7840 0/1:0.6280 1/1:1.6550
11 193146 11:193146 G A 0 PASS . GT:EC 1/1:1.8550 1/1:1.8460 1/1:1.7940

  The genotype information has FORMAT "GT:EC".  For the first SNP (chr11:180567) and individual A001, the genotype is 1/1 and dosage is 2.0000.  To access the dosages, you must specify the option "-field EC".


Frequently Asked Questions

Please, feel free to ask any question when things are not clear. We will reply promptly.

How do I code the INDEL variant names and alleles?

Please use the variant name and the allele name directly from IMPUTE or minimac. Please do NOT recode variant names or alleles. We will do this step in the analysis for consistency.

What if I already uploaded files with recoded INDELs?

ACTION: If you have recoded your INDEL alleles, please tell us so we can remove your file and let us know when you can reupload with the original variable and allele names.