There are several classes of input to Multi-Dendrix. This page will explain each of the files, and give an overview of the file formats. Unless otherwise noted, all files are space-separated and lines starting with ‘#’ are ignored.
The only required file for Multi-Dendrix is the mutation matrix. The mutation matrix is an \(m \times n\) matrix
Thus the mutation matrix contains a mapping of patients to the genes or mutation classes they have mutated. Given just a mutation matrix, Multi-Dendrix will compute on the entire set of patients and genes (files for restricting the patients or genes input to Multi-Dendrix are explained below). The mutation matrix is tab-separated.
Mutation matrix file format example
#Patient | Mutation classes (tab-separated) |
TCGA-01 | G1 G2 G3 |
TCGA-02 | G3 G5 |
TCGA-03 | G8 G9 G14 |
The gene and patient whitelists are used to restrict the genes and patients that are analyzed from a mutation matrix. They are both lists, with one gene (respectively patient) per line. If a gene / patient whitelist is given, Multi-Dendrix will only consider the genes / patients if they are in the whitelist. Genes / patients in the whitelist but not in the mutation matrix will be ignored.
The patient whitelist is required for analyzing (sub)type-specific mutations. However, when used for analyzing (sub)type-specific mutations, the patient whitelist requires the (sub)type to be present on the same line as the patient ID, e.g.
#Patient | (Sub)type |
TCGA-01 | Luminal A |
TCGA-02 | Basal-like |
The gene and patient blacklists are the natural counterpart to the gene and patient whitelists explained above. They have the exact same file format (one line per gene / patient), and any blacklisted genes / patients that appear in the mutation matrix will be removed. Genes / patients in the blacklist but not in the mutation matrix will be ignored.
The main file format for loading PPI networks in the Multi-Dendrix pipeline is the edge list. The edge list is used to create a graph via the NetworkX module’s load_edgelist function. The basic format used here is to list one edge per line as a space-separated pair of gene. You can read more about the format’s accepted by NetworkX in the documentation of the parse_edgelist function.