phase-snps

Note: To run this step, you must first run download-panel to download the reference-based phasing panel, and specify its location via the argument -D, --refpaneldir. The download-panel command only needs to be run once per system.

This step of HATCHet phases genotypes found in VCF files. It automatically takes care of differences in coordinates if the user has aligned their reads to a version of the reference genome (e.g. hg38) that is different from the version used in the reference panel (e.g. hg19 for the 1000 genomes project), using the liftover utility within Picard. Once genotypes are lifted over and phased, we again perform one last liftover to the coordinates of the genome used for alignment. Liftover is skipped if the version of the reference genome used for alignments corresponds to the same version used in the reference panel. Lastly, in order to account for differences in naming conventions for chromosomes, with or without the “chr” prefix, we also add or remove these prefixes so that chromosome names correspond to those used in the reference panel (without “chr” for the 1000 genomes project hg19 panel).

Input

phase-snps takes one or more VCF files containing heterozygous SNP positions as input, specified using -L, --snps. These are typically produced by genotype_snps.

The following parameters are required to specify and describe the input data:

Name Description Usage
-L, --snps A list of VCF files to phase, one per chromosome Specify a list using a path along with a wildcard, e.g. /path/to/snps/*.vcf.gz
-D, --refpaneldir Path to the Reference Panel This is the location where the 1000 genome project reference panel will be downloaded
-g, --refgenome Path to the reference genome used to align reads Path should include the filename
-V, --refversion Version of reference genome used to align reads Specify the human reference genome used for aligning reads; hg19 or hg38
-N, --chrnotation Chromosome names contain "chr" (e.g., "chr1" instead of "1") Set this flag if and only if your BAM files/reference genome include "chr" in chromosome names
-o, --outdir Output folder for phased VCFs Specify a path or relative path

Output

The following files will be placed in the directory indicated by -o, --outdir:

Name Description
phased.vcf.gz VCF file containing phased genotypes for all chromosomes
phased.log Table showing how many SNPs were present before and after phasing; SNPs may be lost in the process due to various reasons, e.g. they were not present in the reference panel so there is no information to phase them
*_alignments.log Log files from the shapeit phasing program, one per chromosome

Main Parameters

If HATCHet is installed via conda, the dependencies (shapeit, picard, bcftools, and bgzip) should be installed automatically and available on the PATH. You can verify that these dependencies are available by running the check command, i.e., hatchet check.

If HATCHet is installed from source, you may need to install them youself (i.e., via conda or from source) and/or specify their locations using the following arguments:

Name Description Usage
-j, --processes Number of parallel jobs Parallel jobs independently phase VCF files, which are split up by chromosome
-si, --shapeit Path to the shapeit executable shapeit is required to run this command
-pc, --picard Path to the picard executable or JAR file picard is required to run this command
-bt, --bcftools Path to the bcftools executable bcftools is required to run this command
-bg, --bgzip Path to the bgzip executable bgzip is required to run this command