# Command for running the HATCHet workflow

The entire end-end `HATCHet` pipeline can be run by using the `hatchet run` command. This command requires an *ini-file*
from which it gets its configuration values. A sample
[hatchet.ini](https://raw.githubusercontent.com/raphael-group/hatchet/master/script/hatchet.ini) file is provided in
this folder for you to get started. You can name this file anything you want and specify it during `hatchet run`, but we
will use `hatchet.ini` in the writeup below.

## Set variables

Set all variables in `hatchet.ini` with appropriate values. You will likely not need to modify anything at all other
than the paths to the reference genome, paths to the normal and tumor bam files, and unique names for the tumor
samples, all in the `run` section of `hatchet.ini`:

```
reference = "/path/to/reference.fa"
normal = "/path/to/normal.bam"
bams = "/path/to/tumor1.bam /path/to/tumor2.bam"
samples = "Primary Met"
```

Optionally, if you wish to run the HATCHet pipeline only on select chromosome(s), specify their name(s) under the
'chromosomes' key, separated by whitespace. For example:

```
chromosomes = chr21 chr22
```

This can be very useful when trying to validate your pipeline relatively quickly before running it on all chromosomes.
As an example, this should be set to `chr22` for [HATCHet Demo data](https://zenodo.org/record/4046906).
To run the pipeline on all chromosomes, leave the key blank.

```
chromosomes =
```

## Run HATCHet without phasing

Use the following command To run HATCHet without phasing:

```
hatchet run hatchet.ini
```

As explained above, you can leave all values to their defaults, but you will want to override the `reference`, `normal`,
`bams` and `samples` values in the ini file.

## Run HATCHet with phasing

Running HATCHet with phasing is currently a two part process. It's a little more labor intensive but may produce cleaner
results.

First run `hatchet run hatchet.ini`, but **enable only the first 3 steps** of the HATCHet pipeline in `hatchet.ini`:

```
genotype_snps = True
count_alleles = True
count_reads = True
combine_counts = False
cluster_bins = False
plot_bins = False
compute_cn = False
plot_cn = False
```

After the run finishes, go to the `snps` subdirectory within the output directory (`output/` by default) specified in
`hatchet.ini`. Here you will find a collection of VCF files, one for each chromosome. These must then be phased (e.g.
using the [Michigan Imputation Server](https://imputationserver.sph.umich.edu/index.html#!)), and the location of the
phased VCF file specified in `hatchet.ini` as the `phase` variable under the `combine_counts` section. If you use the
Michigan imputation server:

1. You may have to use `bcftools annotate` to convert between chromosome names (e.g. chr20 -> 20)
2. Results are always returned in hg19 coordinates, so you may need to convert coordinates back to hg38 using e.g.
   Picard's [LiftoverVcf](https://broadinstitute.github.io/picard/command-line-overview.html#LiftoverVcf)
3. The by-chromosome phased VCF files you receive must be combined with the `bcftools concat` command to give HATCHet a
   single phased VCF file.

Also in `hatchet.ini`, under the `combine_counts` section is a `blocklength` parameter, which is the haplotype block
size used for combining SNPs when estimating B-Allele frequencies. This should ideally be 20kb to 50kb. While larger
haplotype block sizes allow you to combine more SNPs, the accuracy of phasing declines with the block size used (e.g.
see this [paper](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308) comparing various
phasing methods).

Then, run the HATCHet workflow again using `hatchet run hatchet.ini`, after enabling only the remaining steps of
the HATCHet pipeline. This should have a shorter runtime than when you ran the first 3 steps:

```
genotype_snps = False
count_alleles = False
count_reads = False
combine_counts = True
cluster_bins = True
plot_bins = True
compute_cn = True
plot_cn = True
```