# ANOVA¶

The Anova (Analysis of variance) method is used to determine which genes exhibit statistically significant variability of insertion counts across multiple conditions. Unlike other methods which take a comma-separated list of wig files as input, the method takes a combined_wig file (which combined multiple datasets in one file) and a samples_metadata file (which describes which samples/replicates belong to which experimental conditions).

## How does it work?¶

The method performs the One-way anova test for each gene across conditions. It takes into account variability of normalized transposon insertion counts among TA sites and among replicates, to determine if the differences among the mean counts for each condition are significant.

## Example¶

> python3 transit.py anova <combined wig file> <samples_metadata file> <annotation .prot_table> <output file> [Optional Arguments]
Optional Arguments:
-n <string>         :=  Normalization method. Default: -n TTR
--exclude-conditions <cond1,...> :=  Comma separated list of conditions to ignore for the analysis. Default: None
--include-conditions <cond1,...> :=  Comma separated list of conditions to include for the analysis. Default: All
--ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions) (by default, LFCs for each condition are computed relative to the grandmean across all condintions)
-iN <float> :=  Ignore TAs occurring within given percentage (as integer) of the N terminus. Default: -iN 0
-iC <float> :=  Ignore TAs occurring within given percentage (as integer) of the C terminus. Default: -iC 0
-PC <N>     := Pseudocounts to use in calculating LFCs. Default: -PC 5
-alpha <N>  := value added to MSE in F-test for moderated anova (makes genes with low counts less significant). Default: -alpha 1000
-winz       :=  winsorize insertion counts for each gene in each condition
(replace max count in each gene with 2nd highest; helps mitigate effect of outliers)


The output file generated by ANOVA identifies which genes exhibit statistically significant variability in counts across conditions (see Output and Diagnostics below).

Note: the combined_wig input file can be generated from multiple wig files through the Transit GUI (File->Export->Selected_Datasets->Combined_wig), or via the ‘export’ command on the command-line (see combined_wig_).

Format of the samples metadata file: a tab-separated file (which you can edit in Excel) with 3 columns: Id, Condition, and Filename (it must have these headers). You can include other columns of info, but do not include additional rows. Individual rows can be commented out by prefixing them with a ‘#’. Here is an example of a samples metadata file: The filenames should match what is shown in the header of the combined_wig (including pathnames, if present).

ID      Condition    Filename
glyc1   glycerol     /Users/example_data/glycerol_rep1.wig
glyc2   glycerol     /Users/example_data/glycerol_rep2.wig
chol1   cholesterol  /Users/example_data/cholesterol_rep1.wig
chol2   cholesterol  /Users/example_data/cholesterol_rep2.wig
chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig


## Parameters¶

The following parameters are available for the ANOVA method:

• –include-conditions: Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
• –exclude-conditions: Can use this to drop conditions not of interest.
• –ref: Specify which condition to use as a reference for computing LFCs. By default, LFCs for each gene in each condition are calculated with respect to the grand mean count across all conditions (so conditions with higher counts will be balanced with conditions with lower counts). However, if there is a defined reference condition in the data, it may be specified using –ref (in which case LFCs for that condition will be around 0, and will be positive or negative for the other conditions, depending on whether counts are higher or lower than the reference condition. If there is more than one condition to use as reference (i.e. pooled), they may be given as a comma-separated list.
• -n: Normalization Method. Determines which normalization method to use when comparing datasets. Proper normalization is important as it ensures that other sources of variability are not mistakenly treated as real differences. See the Normalization section for a description of normalization method available in TRANSIT. Default: -n TTR
• -PC <N>: Pseudocounts to use in calculating LFCs (see below). Default: -PC 5
• -alpha <N>: Value added to MSE in F-test for moderated ANOVA: F = MSR/(MSE+alpha). This is helpful because genes with very low counts are occasionally ranked as significant by traditional ANOVA, even though the apparent variability is probably due to noise. Setting alpha to a number like 1000 helps filter out these irrelevant genes by reducing their significance. If you want to emulate the standard ANOVA test, you can set alpha to 0. Default: -alpha 1000
• -winz: winsorize insertion counts for each gene in each condition. Replace max count in each gene with 2nd highest. This can help mitigate effect of outliers.

## Output and Diagnostics¶

The anova method outputs a tab-delimited file with results for each gene in the genome. P-values are adjusted for multiple comparisons using the Benjamini-Hochberg procedure (called “q-values” or “p-adj.”). A typical threshold for conditional essentiality on is q-value < 0.05.

Orf Gene ID.
Name Name of the gene.
TAs Number of TA sites in Gene
Means… Mean readcounts for each condition
LFCs… Log-fold-changes of counts in each condition vs mean across all conditions
MSR Mean-squared residual
MSE+alpha Mean-squared error, plus moderation value
p-value P-value calculated by the Anova test.
LFC = log2((mean_insertions_in_condition + PC)/(mean_of_means_across_all_conditions + PC))