phase-snps¶
Note: To run this step, you must first run download-panel to download the reference-based phasing panel, and specify its location via the argument -D, --refpaneldir
. The download-panel
command only needs to be run once per system.
This step of HATCHet phases genotypes found in VCF files. It automatically takes care of differences in coordinates if the user has aligned their reads to a version of the reference genome (e.g. hg38) that is different from the version used in the reference panel (e.g. hg19 for the 1000 genomes project), using the liftover utility within Picard. Once genotypes are lifted over and phased, we again perform one last liftover to the coordinates of the genome used for alignment. Liftover is skipped if the version of the reference genome used for alignments corresponds to the same version used in the reference panel. Lastly, in order to account for differences in naming conventions for chromosomes, with or without the “chr” prefix, we also add or remove these prefixes so that chromosome names correspond to those used in the reference panel (without “chr” for the 1000 genomes project hg19 panel).
Input¶
phase-snps
takes one or more VCF files containing heterozygous SNP positions as input, specified using -L, --snps
. These are typically produced by genotype_snps.
The following parameters are required to specify and describe the input data:
Name | Description | Usage |
---|---|---|
-L , --snps |
A list of VCF files to phase, one per chromosome | Specify a list using a path along with a wildcard, e.g. /path/to/snps/*.vcf.gz |
-D , --refpaneldir |
Path to the Reference Panel | This is the location where the 1000 genome project reference panel will be downloaded |
-g , --refgenome |
Path to the reference genome used to align reads | Path should include the filename |
-V , --refversion |
Version of reference genome used to align reads | Specify the human reference genome used for aligning reads; hg19 or hg38 |
-N , --chrnotation |
Chromosome names contain "chr" (e.g., "chr1" instead of "1") | Set this flag if and only if your BAM files/reference genome include "chr" in chromosome names |
-o , --outdir |
Output folder for phased VCFs | Specify a path or relative path |
Output¶
The following files will be placed in the directory indicated by -o, --outdir
:
Name | Description |
---|---|
phased.vcf.gz | VCF file containing phased genotypes for all chromosomes |
phased.log | Table showing how many SNPs were present before and after phasing; SNPs may be lost in the process due to various reasons, e.g. they were not present in the reference panel so there is no information to phase them |
*_alignments.log | Log files from the shapeit phasing program, one per chromosome |
Main Parameters¶
If HATCHet is installed via conda
, the dependencies (shapeit
, picard
, bcftools
, and bgzip
) should be installed automatically and available on the PATH.
You can verify that these dependencies are available by running the check command, i.e., hatchet check
.
If HATCHet is installed from source, you may need to install them youself (i.e., via conda
or from source) and/or specify their locations using the following arguments:
Name | Description | Usage |
---|---|---|
-j , --processes |
Number of parallel jobs | Parallel jobs independently phase VCF files, which are split up by chromosome |
-si , --shapeit |
Path to the shapeit executable |
shapeit is required to run this command |
-pc , --picard |
Path to the picard executable or JAR file |
picard is required to run this command |
-bt , --bcftools |
Path to the bcftools executable |
bcftools is required to run this command |
-bg , --bgzip |
Path to the bgzip executable |
bgzip is required to run this command |