# combine-counts This step constructs variable-length bins that ensure that each bin has at least some number (`--msr`) of SNP-covering reads and at least some number (`--mtr`) of total reads per bin. Then, it combines the read counts and the allele counts for the identified germline SNPs to compute the read-depth ratio (RDR) and B-allele frequency (BAF) of every genomic bin. ## Input `combine-counts` takes in input the output from `count-reads` (i.e., two gzipped files `ch.total.gz` and `ch.thresholds.gz` for each chromosome `ch`, ). Use the `-A, --array` argument to specify a directory containing these input files. It also requires (specified by the flag `-b`, `--baffile`) a tab-separated file specifying the allele counts for heterzygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the `count-alleles` command and has the following fields: | Field | Description | |-------|-------------| | `CHR` | Name of a chromosome | | `POS` | Genomic position corresponding to a heterozygous germline in `CHR` | | `SAMPLE` | Name of a tumor sample | | `REF_COUNT` | Count of reads covering `POS` with reference allele | | `ALT_COUNT` | Count of reads covering `POS` with alternate allele | Finally, `combine-counts` requires a TSV file (`-t, --totalcounts`) specifying the total number of reads aligned in each sample (also typically produced by `count-reads`). In summary, **the following arguments are required to specify input**: | Name | Description | Usage | Default | |------|-------------|-------|---------| | `-A`, `--array` | Directory containing intermediate files | Typically populated by `count-reads`. For each chromosome `ch`, this directory should contain files `ch.total.gz` and `ch.thresholds.gz` (as well as `samples.txt` indicating sample names) | | | `-b, --baffile` | Tab-separated file with allele counts | Typically produced by `count-alleles`. See description above. | | | `-t, --totalcounts` | Tab-separated file with total aligned reads for each sample | Typically produced by `count-reads`. | | | `-V, --refversion` | Reference genome version | Either "hg19" or "hg38". This argument is used to select which centromere locations to use. | ## Output combine-counts produces a tab-separated file (`-o, --outfile`) with the following fields. | Field | Description | |-------|-------------| | `CHR` | Name of a chromosome | | `START` | Starting genomic position of a genomic bin in `CHR` | | `END` | Ending genomic position of a genomic bin in `CHR` | | `SAMPLE` | Name of a tumor sample | | `RD` | RDR of the bin in `SAMPLE` (corrected by the total reads in `SAMPLE` vs. the total reads in the matched normal sample) | | `#SNPS` | Number of SNPs present in the bin in `SAMPLE` | | `COV` | Average coverage in the bin in `SAMPLE` | | `ALPHA` | Alpha parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from A allele | | `BETA` | Beta parameter related to the binomial model of BAF for the bin in `SAMPLE`, typically total number of reads from B allele | | `BAF` | BAF of the bin in `SAMPLE` | | `TOTAL_READS` | Total number of reads in the bin in `SAMPLE` | | `NORMAL_READS` | Total number of reads in the bin in the matched normal sample | | `CORRECTED_READS` | Total number of reads in the bin in `SAMPLE`, corrected by the total reads in `SAMPLE` vs. the total reads in matched normal. | Currently, it produces one such file that excludes sex chromosomes (for use in HATCHet), and one that includes sex chromosomes (for future use). ## Main parameters combine-counts has some main parameters; the main values of these parameters allow to deal with most of datasets, but their values can be changed or tuned to accommodate the features of special datasets. | Name | Description | Usage | Default | |------|-------------|-------|---------| | `--msr` | Minimum SNP-covering reads for each bin | Each bin constructed by this command must have at least this many reads covering heterozygous SNPs in each sample | 5000 | | `--mtr` | Minimum total reads for each bin | Each bin constructed by this command must have at least this many total reads in each sample | 5000 | ## Phasing parameters A phased VCF file must be given via argument `-p, --phase` to apply reference-based phasing. The remaining parameters control the degree to which the phasing information is used. | Name | Description | Usage | Default | |------|-------------|-------|---------| | `-p`, `--phase` | vcf.gz with phasing for all het. SNPs | File containing phasing data for germline SNPs, typically `phased.vcf.gz` if using the HATCHet pipeline. | None (no phasing is performed) | | `-s`, `--blocksize` | Maximum phasing block size | Maximum distance (in bp) between a pair of SNPs included in the same phasing block (ignored if `-p, --phase` is not used) | 25000 | | `-m`, `--max_spb` | Maximum number of SNPs per phased block | No more than this many SNPs can be included in the same phasing block (included to minimize phasing errors in high-LD regions) | 10 | | `-a`, `--alpha` | Significance threshold to allow adjacent SNPs to be merged | If adjacent SNPs have significantly different BAFs (at this significance level) after taking the phasing into account, they are not merged a priori. Higher means less trust in phasing. | 0.1 | ## Other optional parameters | Name | Description | Usage | Default | |------|-------------|-------|---------| | `-j`, `--processes` | Number of parallel processes to use (default 1) | | 1 | | `-z, --not_compressed` | Indicates that intermediate files are not compressed | For compatibility with legacy versions of previous step -- set this flag if your `.total` and `.thresholds` files are plaintext rather than gzipped. | | ## Example usage `hatchet combine-counts -b baf/bulk.1bed -o abin/bulk.bb -j 24 -V hg19 -A array -t array/total.tsv -V hg19`