Command for running the HATCHet workflow¶
The entire end-end HATCHet
pipeline can be run by using the hatchet run
command. This command requires an ini-file
from which it gets its configuration values. A sample
hatchet.ini file is provided in
this folder for you to get started. You can name this file anything you want and specify it during hatchet run
, but we
will use hatchet.ini
in the writeup below.
Set variables¶
Set all variables in hatchet.ini
with appropriate values. You will likely not need to modify anything at all other
than the paths to the reference genome, paths to the normal and tumor bam files, and unique names for the tumor
samples, all in the run
section of hatchet.ini
:
reference = "/path/to/reference.fa"
normal = "/path/to/normal.bam"
bams = "/path/to/tumor1.bam /path/to/tumor2.bam"
samples = "Primary Met"
Optionally, if you wish to run the HATCHet pipeline only on select chromosome(s), specify their name(s) under the ‘chromosomes’ key, separated by whitespace. For example:
chromosomes = chr21 chr22
This can be very useful when trying to validate your pipeline relatively quickly before running it on all chromosomes.
As an example, this should be set to chr22
for HATCHet Demo data.
To run the pipeline on all chromosomes, leave the key blank.
chromosomes =
Run HATCHet without phasing¶
Use the following command To run HATCHet without phasing:
hatchet run hatchet.ini
As explained above, you can leave all values to their defaults, but you will want to override the reference
, normal
,
bams
and samples
values in the ini file.
Run HATCHet with phasing¶
Running HATCHet with phasing is currently a two part process. It’s a little more labor intensive but may produce cleaner results.
First run hatchet run hatchet.ini
, but enable only the first 3 steps of the HATCHet pipeline in hatchet.ini
:
genotype_snps = True
count_alleles = True
count_reads = True
combine_counts = False
cluster_bins = False
plot_bins = False
compute_cn = False
plot_cn = False
After the run finishes, go to the snps
subdirectory within the output directory (output/
by default) specified in
hatchet.ini
. Here you will find a collection of VCF files, one for each chromosome. These must then be phased (e.g.
using the Michigan Imputation Server), and the location of the
phased VCF file specified in hatchet.ini
as the phase
variable under the combine_counts
section. If you use the
Michigan imputation server:
You may have to use
bcftools annotate
to convert between chromosome names (e.g. chr20 -> 20)Results are always returned in hg19 coordinates, so you may need to convert coordinates back to hg38 using e.g. Picard’s LiftoverVcf
The by-chromosome phased VCF files you receive must be combined with the
bcftools concat
command to give HATCHet a single phased VCF file.
Also in hatchet.ini
, under the combine_counts
section is a blocklength
parameter, which is the haplotype block
size used for combining SNPs when estimating B-Allele frequencies. This should ideally be 20kb to 50kb. While larger
haplotype block sizes allow you to combine more SNPs, the accuracy of phasing declines with the block size used (e.g.
see this paper comparing various
phasing methods).
Then, run the HATCHet workflow again using hatchet run hatchet.ini
, after enabling only the remaining steps of
the HATCHet pipeline. This should have a shorter runtime than when you ran the first 3 steps:
genotype_snps = False
count_alleles = False
count_reads = False
combine_counts = True
cluster_bins = True
plot_bins = True
compute_cn = True
plot_cn = True