ANOVA

The Anova (Analysis of variance) method is used to determine which genes exhibit statistically significant variability of insertion counts across multiple conditions. Unlike other methods which take a comma-separated list of wig files as input, the method takes a combined_wig file (which combined multiple datasets in one file) and a samples_metadata file (which describes which samples/replicates belong to which experimental conditions).


How does it work?

The method performs the One-way anova test for each gene across conditions. It takes into account variability of normalized transposon insertion counts among TA sites and among replicates, to determine if the differences among the mean counts for each condition are significant.

Example

python3 transit.py anova <combined wig file> <samples_metadata file> <annotation .prot_table> <output file> [Optional Arguments]
      Optional Arguments:
      -n <string>         :=  Normalization method. Default: -n TTR
      --exclude-conditions <cond1,...> :=  Comma separated list of conditions to ignore for the analysis. Default: None
      --include-conditions <cond1,...> :=  Comma separated list of conditions to include for the analysis. Default: All
      --ref <cond> := which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions) (by default, LFCs for each condition are computed relative to the grandmean across all condintions)
      -iN <float> :=  Ignore TAs occurring within given percentage (as integer) of the N terminus. Default: -iN 0
      -iC <float> :=  Ignore TAs occurring within given percentage (as integer) of the C terminus. Default: -iC 0
      -PC         := Pseudocounts to use in calculating LFCs. Default: -PC 5
      -winz       :=  winsorize insertion counts for each gene in each condition
                      (replace max count in each gene with 2nd highest; helps mitigate effect of outliers)

The output file generated by ANOVA identifies which genes exhibit statistically significant variability in counts across conditions (see Output and Diagnostics below).

Note: the combined_wig input file can be generated from multiple wig files through the Transit GUI (File->Export->Selected_Datasets->Combined_wig), or via the ‘export’ command on the command-line (see combined_wig_).

Format of the samples metadata file: a tab-separated file (which you can edit in Excel) with 3 columns: Id, Condition, and Filename (it must have these headers). You can include other columns of info, but do not include additional rows. Individual rows can be commented out by prefixing them with a ‘#’. Here is an example of a samples metadata file: The filenames should match what is shown in the header of the combined_wig (including pathnames, if present).

ID      Condition    Filename
glyc1   glycerol     /Users/example_data/glycerol_rep1.wig
glyc2   glycerol     /Users/example_data/glycerol_rep2.wig
chol1   cholesterol  /Users/example_data/cholesterol_rep1.wig
chol2   cholesterol  /Users/example_data/cholesterol_rep2.wig
chol2   cholesterol  /Users/example_data/cholesterol_rep3.wig

Parameters

The following parameters are available for the ANOVA method:

  • –include-conditions: Includes the given set of conditions from the ZINB test. Conditions not in this list are ignored. Note: this is useful for specifying the order in which the columns are listed in the output file.
  • –exclude-conditions: Can use this to drop conditions not of interest.
  • –ref: Specify which condition to use as a reference for computing LFCs. By default, LFCs for each gene in each condition are calculated with respect to the grand mean count across all conditions (so conditions with higher counts will be balanced with conditions with lower counts). However, if there is a defined reference condition in the data, it may be specified using –ref (in which case LFCs for that condition will be around 0, and will be positive or negative for the other conditions, depending on whether counts are higher or lower than the reference condintion. If there is more than one condition to use as reference (i.e. pooled), they may be given as a comma-separated list.
  • -n Normalization Method. Determines which normalization method to use when comparing datasets. Proper normalization is important as it ensures that other sources of variability are not mistakenly treated as real differences. See the Normalization section for a description of normalization method available in TRANSIT.
  • -PC Pseudocounts to use in calculating LFCs (see below). Default: -PC 5
  • -winz: winsorize insertion counts for each gene in each condition. Replace max count in each gene with 2nd highest. This can help mitigate effect of outliers.

Output and Diagnostics

The anova method outputs a tab-delimited file with results for each gene in the genome. P-values are adjusted for multiple comparisons using the Benjamini-Hochberg procedure (called “q-values” or “p-adj.”). A typical threshold for conditional essentiality on is q-value < 0.05.

Column Header Column Definition
Orf Gene ID.
Name Name of the gene.
TAs Number of TA sites in Gene
Means… Mean readcounts for each condition
LFCs… Log-fold-changes of counts in each condition vs mean across all conditions
p-value P-value calculated by the Anova test.
p-adj Adjusted p-value controlling for the FDR (Benjamini-Hochberg)
status Debug information (If any)

LFCs (log-fold-changes): For each condition, the LFC is calculated as the log-base-2 of the ratio of mean insertion count in that condition relative to the mean of means across all the conditions. Pseudocount are incorporated to reduce the impact of noise on LFCs, based on the formula below. The pseudocounts can be adjusted using the -PC flag. Changing the pseudocounts (via -PC) can reduce the artifactual appearance of genes with high-magnitude LFCs but that have small overall counts (which are susceptible to noise). Changing the pseudocounts will not affect the analysis of statistical significance and hence number of varying genes, however.

LFC = log2((mean_insertions_in_condition + PC)/(mean_of_means_across_all_conditions + PC))

Run-time

A typical run of the anova method takes less than 1 minute for a combined wig file with 6 conditions, 3 replicates per condition.