HMM

The HMM method can be used to determine the essentiality of the entire genome, as opposed to gene-level analysis of the other methods. It is capable of identifying regions that have unusually high or unusually low read counts (i.e. growth advantage or growth defect regions), in addition to the more common categories of essential and non-essential.

Note

Intended only for Himar1 datasets.

How does it work?

For a formal description of how this method works, see our paper [DeJesus2013HMM]:

DeJesus, M.A., Ioerger, T.R. A Hidden Markov Model for identifying essential and growth-defect regions in bacterial genomes from transposon insertion sequencing data. BMC Bioinformatics. 2013. 14:303

Usage

> python3 transit.py hmm <comma-separated .wig files> <annotation .prot_table or GFF3> <output_BASE_filename>
      (will create 2 output files: BASE.sites.txt and BASE.genes.txt)

      Optional Arguments:
          -r <string>     :=  How to handle replicates. Sum, Mean. Default: -r Mean
          -l              :=  Perform LOESS Correction; Helps remove possible genomic position bias. Default: Off.
          -iN <float>     :=  Ignore TAs occuring at given percentage (as integer) of the N terminus. Default: -iN 0
          -iC <float>     :=  Ignore TAs occuring at given percentage (as integer) of the C terminus. Default: -iC 0

Parameters

The HMM method automatically estimates the necessary statistical parameters from the datasets. You can change how the method handles replicate datasets:

Replicates: Determines how the HMM deals with replicate datasets by either averaging the read-counts or summing read counts across datasets. For regular datasets (i.e. mean-read count > 100) the recommended setting is to average read-counts together. For sparse datasets, it summing read-counts may produce more accurate results.

Output and Diagnostics

The HMM method outputs two files. The first file (BASE.sites.txt) provides the most likely assignment of states for all the TA sites in the genome. Sites can belong to one of the following states: “E” (Essential), “GD” (Growth-Defect), “NE” (Non-Essential), or “GA” (Growth-Advantage). In addition, the output includes the probability of the particular site belonging to the given state. The columns of this file are defined as follows:

Column #	Column Definition
1	Coordinate of TA site
2	Observed Read Counts
3	Probability for ES state
4	Probability for GD state
5	Probability for NE state
6	Probability for GA state
7	State Classification (ES = Essential, GD = Growth Defect, NE = Non-Essential, GA = Growth-Defect)
8	Gene(s) that share(s) the TA site.

The second file (BASE.genes.txt) provides a gene-level classification for all the genes in the genome. Genes are classified as “E” (Essential), “GD” (Growth-Defect), “NE” (Non-Essential), or “GA” (Growth-Advantage) depending on the number of sites within the gene that belong to those states.

Column Header	Column Definition
Orf	Gene ID
Name	Gene Name
Desc	Gene Description
N	Number of TA sites
n0	Number of sites labeled ES (Essential)
n1	Number of sites labeled GD (Growth-Defect)
n2	Number of sites labeled NE (Non-Essential)
n3	Number of sites labeled GA (Growth-Advantage)
Avg. Insertions	Mean insertion rate within the gene
Avg. Reads	Mean read count within the gene
State Call	State Classification (ES = Essential, GD = Growth Defect, NE = Non-Essential, GA = Growth-Defect)

Note: Libraries that are too sparse (e.g. < 30%) or which contain very low read-counts may be problematic for the HMM method, causing it to label too many Growth-Defect genes.

Run-time

The HMM method takes less than 10 minutes to complete. The parameters of the method should not affect the running-time.