phase-snps¶

Note: To run this step, you must first run download-panel to download the reference-based phasing panel, and specify its location via the argument -D, --refpaneldir. The download-panel command only needs to be run once per system.

This step of HATCHet phases genotypes found in VCF files. It automatically takes care of differences in coordinates if the user has aligned their reads to a version of the reference genome (e.g. hg38) that is different from the version used in the reference panel (e.g. hg19 for the 1000 genomes project), using the liftover utility within Picard. Once genotypes are lifted over and phased, we again perform one last liftover to the coordinates of the genome used for alignment. Liftover is skipped if the version of the reference genome used for alignments corresponds to the same version used in the reference panel. Lastly, in order to account for differences in naming conventions for chromosomes, with or without the “chr” prefix, we also add or remove these prefixes so that chromosome names correspond to those used in the reference panel (without “chr” for the 1000 genomes project hg19 panel).

Input¶

phase-snps takes one or more VCF files containing heterozygous SNP positions as input, specified using -L, --snps. These are typically produced by genotype_snps.

The following parameters are required to specify and describe the input data:

Name	Description	Usage
`-L`, `--snps`	A list of VCF files to phase, one per chromosome	Specify a list using a path along with a wildcard, e.g. /path/to/snps/*.vcf.gz
`-D`, `--refpaneldir`	Path to the Reference Panel	This is the location where the 1000 genome project reference panel will be downloaded
`-g`, `--refgenome`	Path to the reference genome used to align reads	Path should include the filename
`-V`, `--refversion`	Version of reference genome used to align reads	Specify the human reference genome used for aligning reads; hg19 or hg38
`-N`, `--chrnotation`	Chromosome names contain "chr" (e.g., "chr1" instead of "1")	Set this flag if and only if your BAM files/reference genome include "chr" in chromosome names
`-o`, `--outdir`	Output folder for phased VCFs	Specify a path or relative path

Output¶

The following files will be placed in the directory indicated by -o, --outdir:

Name	Description
phased.vcf.gz	VCF file containing phased genotypes for all chromosomes
phased.log	Table showing how many SNPs were present before and after phasing; SNPs may be lost in the process due to various reasons, e.g. they were not present in the reference panel so there is no information to phase them
*_alignments.log	Log files from the shapeit phasing program, one per chromosome

Main Parameters¶

If HATCHet is installed via conda, the dependencies (shapeit, picard, bcftools, and bgzip) should be installed automatically and available on the PATH. You can verify that these dependencies are available by running the check command, i.e., hatchet check.

If HATCHet is installed from source, you may need to install them youself (i.e., via conda or from source) and/or specify their locations using the following arguments:

Name	Description	Usage
`-j`, `--processes`	Number of parallel jobs	Parallel jobs independently phase VCF files, which are split up by chromosome
`-si`, `--shapeit`	Path to the `shapeit` executable	`shapeit` is required to run this command
`-pc`, `--picard`	Path to the `picard` executable or JAR file	`picard` is required to run this command
`-bt`, `--bcftools`	Path to the `bcftools` executable	`bcftools` is required to run this command
`-bg`, `--bgzip`	Path to the `bgzip` executable	`bgzip` is required to run this command