tnseq_stats

This command is useful for calculating some metrics on input wig (or combined_wig files) for assessement of data quality.

Similar information can be generated through the GUI via the menu options ‘View->Quality Control’ (select samples in main window first).

Usage:

> usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
         python3 src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]

It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it:

Column Header

Column Definition

Comments

dataset

Name of sample (wig file)

density

Fraction of sites with insertions.

“Well-saturated” Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble.

mean_ct

Average read-count over all TA sites.

NZMean

Average read-count, excluding empty sites.

A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems.

NZMedian

Median read-count, excluding empty sites.

As read-counts can often have spikes, median serves as a good robust estimate.

max_ct

Largest read-count at any TA site

Useful to determine whether there are outliers/spikes, which may indicate sequencing issues.

total_cts

Sum of total read-counts in the sample.

Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets.

skewness

3rd-order moment of read-count distribution.

Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when

kurtosis

4th-order moment of read-counts distribution

Lop-sided peak?

PTI

Pickand Tail Index

Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good.

Here is an example:

> python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat
dataset density mean_ct NZmean  NZmedian        max_ct  total_cts       skewness        kurtosis        pickands_tail_index
src/pytransit/data/cholesterol_H37Rv_rep1.wig   0.439   139.6   317.6   147     125355.5        10414005        54.8    4237.7  0.973
src/pytransit/data/cholesterol_H37Rv_rep2.wig   0.439   171.4   390.5   148     704662.8        12786637        105.8   14216.2 1.529
src/pytransit/data/cholesterol_H37Rv_rep3.wig   0.359   173.8   484.2   171     292294.8        12968502        42.2    2328.0  1.584
src/pytransit/data/glycerol_H37Rv_rep1.wig      0.419   123.3   294.5   160     8813.3  9195672 4.0     33.0    0.184
src/pytransit/data/glycerol_H37Rv_rep2.wig      0.516   123.8   240.1   127     8542.5  9235984 4.0     33.5    0.152

In this example, you can see the 5 samples have saturations in the range of 35.9-51.6% (which is decent). The NZMeans are in the range 123-139, but this is post-normalization. (TTR normalization had already been applied to this combined_wig file, so the means can be expected to be scaled to around 100.) If you want to see the NZmeans for the raw data, re-generate the combined_wig file using ‘-n nonorm’, to skip the automatic normalization step. These samples also exhibit skewness that is on the high side (33.0-105.8). This is probably related to the fact that some individual TA sites have very high counts. For example, rep2 of cholesterol has a max count of 704662 at a single TA site, representing over 5% of the 12.78M total insertion counts. TTR is supposed to be robust by ignoring the top 5% of most abundant sites during normalization, but still the rest of the distribution of counts could be skewed. This sample also has a high Pickands’ tail index of 1.53 (which is above 1.0), also suggesting skew. While, we currently don’t have recommendations for hard cutoffs to use for identifying bad samples (e.g. that might need to be re-sequenced), I would say that skew>30 and/or PTI>1.0 are signs that a sample might be noisy or lower-quality. See Quality Control for more discussion about assessing quality of TnSeq datasets. Nonetheless, doing resampling on this data still yielded insights into many genes required for cholesterol metabolism in M. tuberculosis (Griffin et al, 2009).