# combine-counts

This step constructs variable-length bins that ensure that each bin has at least some number (`--msr`) of SNP-covering reads and at least some number (`--mtr`) of total reads per bin. Then, it combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin.

## Input

`combine-counts` takes in input the output from `count-reads` (i.e., two gzipped files `ch.total.gz` and `ch.thresholds.gz` for each chromosome `ch`, ). Use the `-A, --array` argument to specify a directory containing these input files.

It also requires (specified by the flag `-b`, `--baffile`) a tab-separated file specifying the allele counts for heterzygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the `count-alleles` command and has the following fields:

| Field | Description |
|-------|-------------|
| `CHR` | Name of a chromosome |
| `POS` | Genomic position corresponding to a heterozygous germline in `CHR` |
| `SAMPLE` | Name of a tumor sample |
| `REF_COUNT` | Count of reads covering `POS` with reference allele |
| `ALT_COUNT` | Count of reads covering `POS` with alternate allele |

Finally, `combine-counts` requires a TSV file (`-t, --totalcounts`) specifying the total number of reads aligned in each sample (also typically produced by `count-reads`).

In summary, **the following arguments are required to specify input**:

| Name | Description | Usage | Default |
|------|-------------|-------|---------|
| `-A`, `--array`  | Directory containing intermediate files | Typically populated by `count-reads`. For each chromosome `ch`, this directory should contain files `ch.total.gz` and `ch.thresholds.gz` (as well as `samples.txt` indicating sample names) |  |
| `-b, --baffile`  | Tab-separated file with allele counts | Typically produced by `count-alleles`. See description above. |  |
| `-t, --totalcounts`  | Tab-separated file with total aligned reads for each sample | Typically produced by `count-reads`. | |
| `-V, --refversion` | Reference genome version | Either "hg19" or "hg38". This argument is used to select which centromere locations to use. |


## Output

combine-counts produces a tab-separated file (`-o, --outfile`) with the following fields.

| Field | Description |
|-------|-------------|
| `CHR` | Name of a chromosome |
| `START` | Starting genomic position of a genomic bin in `CHR` |
| `END` | Ending genomic position of a genomic bin in `CHR` |
| `SAMPLE` | Name of a tumor sample |
| `RD` | RDR of the bin in `SAMPLE` (corrected by the total reads in `SAMPLE` vs. the total reads in the matched normal sample) |
| `#SNPS` | Number of SNPs present in the bin in `SAMPLE` |
| `COV` | Average coverage in the bin in `SAMPLE` |
| `ALPHA` | Alpha parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from A allele |
| `BETA` | Beta parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from B allele |
| `BAF` | BAF of the bin in `SAMPLE` |
| `TOTAL_READS` | Total number of reads in the bin in `SAMPLE` |
| `NORMAL_READS` |  Total number of reads in the bin in the matched normal sample |
| `CORRECTED_READS` |  Total number of reads in the bin in `SAMPLE`, corrected by the total reads in `SAMPLE` vs. the total reads in matched normal. |

Currently, it produces one such file that excludes sex chromosomes (for use in HATCHet), and one that includes sex chromosomes (for future use).

## Main parameters

combine-counts has some main parameters; the main values of these parameters allow to deal with most of datasets, but their values can be changed or tuned to accommodate the features of special datasets.

| Name | Description | Usage | Default |
|------|-------------|-------|---------|
| `--msr`  | Minimum SNP-covering reads for each bin | Each bin constructed by this command must have at least this many reads covering heterozygous SNPs in each sample | 5000 |
| `--mtr`  | Minimum total reads for each bin | Each bin constructed by this command must have at least this many total reads in each sample | 5000 |

## Phasing parameters
A phased VCF file must be given via argument `-p, --phase` to apply reference-based phasing. The remaining parameters control the degree to which the phasing information is used.

| Name | Description | Usage | Default |
|------|-------------|-------|---------|
| `-p`, `--phase`  | vcf.gz with phasing for all het. SNPs | File containing phasing data for germline SNPs, typically `phased.vcf.gz` if using the HATCHet pipeline. | None (no phasing is performed) |
| `-s`, `--blocksize`  | Maximum phasing block size | Maximum distance (in bp) between a pair of SNPs included in the same phasing block (ignored if `-p, --phase` is not used) | 25000 |
| `-m`, `--max_spb`  | Maximum number of SNPs per phased block | No more than this many SNPs can be included in the same phasing block (included to minimize phasing errors in high-LD regions) | 10 |
| `-a`, `--alpha`  | Significance threshold to allow adjacent SNPs to be merged | If adjacent SNPs have significantly different BAFs (at this significance level) after taking the phasing into account, they are not merged a priori. Higher means less trust in phasing. | 0.1 |

## Other optional parameters

| Name | Description | Usage | Default |
|------|-------------|-------|---------|
| `-j`, `--processes` | Number of parallel processes to use (default 1) |  | 1 |
| `-z, --not_compressed`  | Indicates that intermediate files are not compressed | For compatibility with legacy versions of previous step -- set this flag if your `.total` and `.thresholds` files are plaintext rather than gzipped. |  |

## Example usage
`hatchet combine-counts -b baf/bulk.1bed -o abin/bulk.bb -j 24 -V hg19 -A array -t array/total.tsv -V hg19`