tnseq_stats
This command is useful for calculating some metrics on input wig (or combined_wig files) for assessement of data quality.
Similar information can be generated through the GUI via the menu options ‘View->Quality Control’ (select samples in main window first).
Usage:
> usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
python3 src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]
It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it:
Column Header |
Column Definition |
Comments |
---|---|---|
dataset |
Name of sample (wig file) |
|
density |
Fraction of sites with insertions. |
“Well-saturated” Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble. |
mean_ct |
Average read-count over all TA sites. |
|
NZMean |
Average read-count, excluding empty sites. |
A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems. |
NZMedian |
Median read-count, excluding empty sites. |
As read-counts can often have spikes, median serves as a good robust estimate. |
max_ct |
Largest read-count at any TA site |
Useful to determine whether there are outliers/spikes, which may indicate sequencing issues. |
total_cts |
Sum of total read-counts in the sample. |
Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets. |
skewness |
3rd-order moment of read-count distribution. |
Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when |
kurtosis |
4th-order moment of read-counts distribution |
Lop-sided peak? |
PTI |
Pickand Tail Index |
Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good. |
Here is an example:
> python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat
dataset density mean_ct NZmean NZmedian max_ct total_cts skewness kurtosis pickands_tail_index
src/pytransit/data/cholesterol_H37Rv_rep1.wig 0.439 139.6 317.6 147 125355.5 10414005 54.8 4237.7 0.973
src/pytransit/data/cholesterol_H37Rv_rep2.wig 0.439 171.4 390.5 148 704662.8 12786637 105.8 14216.2 1.529
src/pytransit/data/cholesterol_H37Rv_rep3.wig 0.359 173.8 484.2 171 292294.8 12968502 42.2 2328.0 1.584
src/pytransit/data/glycerol_H37Rv_rep1.wig 0.419 123.3 294.5 160 8813.3 9195672 4.0 33.0 0.184
src/pytransit/data/glycerol_H37Rv_rep2.wig 0.516 123.8 240.1 127 8542.5 9235984 4.0 33.5 0.152
In this example, you can see the 5 samples have saturations in the range of 35.9-51.6% (which is decent). The NZMeans are in the range 123-139, but this is post-normalization. (TTR normalization had already been applied to this combined_wig file, so the means can be expected to be scaled to around 100.) If you want to see the NZmeans for the raw data, re-generate the combined_wig file using ‘-n nonorm’, to skip the automatic normalization step. These samples also exhibit skewness that is on the high side (33.0-105.8). This is probably related to the fact that some individual TA sites have very high counts. For example, rep2 of cholesterol has a max count of 704662 at a single TA site, representing over 5% of the 12.78M total insertion counts. TTR is supposed to be robust by ignoring the top 5% of most abundant sites during normalization, but still the rest of the distribution of counts could be skewed. This sample also has a high Pickands’ tail index of 1.53 (which is above 1.0), also suggesting skew. While, we currently don’t have recommendations for hard cutoffs to use for identifying bad samples (e.g. that might need to be re-sequenced), I would say that skew>30 and/or PTI>1.0 are signs that a sample might be noisy or lower-quality. See Quality Control for more discussion about assessing quality of TnSeq datasets. Nonetheless, doing resampling on this data still yielded insights into many genes required for cholesterol metabolism in M. tuberculosis (Griffin et al, 2009).