Impute2 VCF conversion

From CKDGen wiki
Jump to navigation Jump to search

The workflow and scripts below help with the following steps to convert IMPUTE2 *.gen files into *.vcf files that contain the dosage of the alternate allele in a single field (needed by EPACTS to use dosages for association):

  • QCtool helps with Impute2 genotypes (*.impute2/*.gen) to VCF conversion.
  • It might be necessary to set the correct chromosome number in the VCF file.
  • You need to pay attention to file sorting (by variant position) - script provided
  • To allow EPACTS make use of the genotype dosages, a dosage field (default DS) need to be added to the VCF file based on genotype call probabilities (default GP). The AddVcfDosage script does this final step.
  • You will need to specify the dosage field by "--field" option when running EPACTS.

CAVEAT: Processing huge files can consume considerable amounts of disk space (temporary files!) and memory. Alternative directories for temporary files created by the sort command can be specified by its -T option.


Example script (excerpts):

   # GEN -> VCF
   qctool \
       -g ../GCKD_Common_Clean-chr${CHR}.gen.gz \
       -s ../GCKD_Common_Clean-chr${CHR}.sample \
       -og GCKD_Common_Clean-chr${CHR}.vcf
   
   bgzip GCKD_Common_Clean-chr${CHR}.vcf
   
   # set chromosome number (replace NA with the chromosome number in case it is missing)
   zcat GCKD_Common_Clean-chr${CHR}.vcf.gz | \
       awk -v chr=$CHR 'BEGIN { FS = "\t"; OFS = "\t" } ; { if ($1 == "NA") { $1 = chr } ; print }' | \
       bgzip > GCKD_Common_Clean-chr${CHR}.renumber.vcf.gz
   
   # sort VCF file by variant position
   . resort-vcf.sh GCKD_Common_Clean-chr${CHR}.renumber.vcf.gz \
       GCKD_Common_Clean-chr${CHR}.sorted.vcf.gz
   
   # add dosage field to VCF
   add-vcf-dosage.sh GCKD_Common_Clean-chr${CHR}.sorted.vcf.gz \
       GCKD_Common_Clean-chr${CHR}.final.vcf.gz

This is the "resort-vcf.sh" script (https://github.com/genepi-freiburg/gwas/blob/master/utils/resort-vcf.sh):

   INFN=$1
   OUTFN=$2
   
   if [ ! -f "$INFN" ]
   then
           echo "Input file does not exist: $INFN"
           exit
   fi
   
   if [ -f "$OUTFN" ]
   then
           echo "Output file must not exist: $OUTFN"
           exit
   fi
   
   echo "Sort VCF: $INFN"
   (zgrep ^"#" $INFN ; zgrep -v ^"#" $INFN | sort -k2n) | bgzip > $OUTFN
   
   echo "Index VCF: $OUTFN"
   tabix -p vcf $OUTFN
   
   echo "Done"

The "add-vcf-dosage.sh" script is part of AddVcfDosage and available at https://github.com/genepi-freiburg/AddVcfDosage