.. _tnseq_stats: tnseq_stats ======== This command is useful for calculating some metrics on input wig (or combined_wig files) for assessement of data quality. Similar information can be generated through the GUI via the menu options 'View->Quality Control' (select samples in main window first). Usage: ------ :: > usage: python3 src/transit.py tnseq_stats + [-o ] python3 src/transit.py tnseq_stats -c [-o ] It generates a table (tab-separated text file that can be opened in Excel) with the following statistics in it: ============= ============================================== ============================================================================================================= Column Header Column Definition Comments ============= ============================================== ============================================================================================================= dataset Name of sample (wig file) density Fraction of sites with insertions. "Well-saturated" Himar1 datasets have >30% saturation. Below this, statistical methods may have trouble. mean_ct Average read-count over all TA sites. NZMean Average read-count, excluding empty sites. A value between 30-200 is usually good for Himar1 datasets. Too high or too low can indicate problems. NZMedian Median read-count, excluding empty sites. As read-counts can often have spikes, median serves as a good robust estimate. max_ct Largest read-count at any TA site Useful to determine whether there are outliers/spikes, which may indicate sequencing issues. total_cts Sum of total read-counts in the sample. Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets. skewness 3rd-order moment of read-count distribution. Sharp peak? Large skew may indicate issues with a dataset. Typically a skew < 50 is desired. May be higher when kurtosis 4th-order moment of read-counts distribution Lop-sided peak? PTI Pickand Tail Index Another statistical measure that indicates presence of individual TA sites with high outlier counts. PTI>1.0 is not good. ============= ============================================== ============================================================================================================= Here is an example: :: > python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol_combined.dat dataset density mean_ct NZmean NZmedian max_ct total_cts skewness kurtosis pickands_tail_index src/pytransit/data/cholesterol_H37Rv_rep1.wig 0.439 139.6 317.6 147 125355.5 10414005 54.8 4237.7 0.973 src/pytransit/data/cholesterol_H37Rv_rep2.wig 0.439 171.4 390.5 148 704662.8 12786637 105.8 14216.2 1.529 src/pytransit/data/cholesterol_H37Rv_rep3.wig 0.359 173.8 484.2 171 292294.8 12968502 42.2 2328.0 1.584 src/pytransit/data/glycerol_H37Rv_rep1.wig 0.419 123.3 294.5 160 8813.3 9195672 4.0 33.0 0.184 src/pytransit/data/glycerol_H37Rv_rep2.wig 0.516 123.8 240.1 127 8542.5 9235984 4.0 33.5 0.152 In this example, you can see the 5 samples have saturations in the range of 35.9-51.6% (which is decent). The NZMeans are in the range 123-139, but this is *post-normalization*. (TTR normalization had already been applied to this combined_wig file, so the means can be expected to be scaled to around 100.) If you want to see the NZmeans for the *raw data*, re-generate the combined_wig file using '-n nonorm', to skip the automatic normalization step. These samples also exhibit *skewness* that is on the high side (33.0-105.8). This is probably related to the fact that some individual TA sites have very high counts. For example, rep2 of cholesterol has a max count of 704662 at a single TA site, representing over 5% of the 12.78M total insertion counts. TTR is supposed to be robust by ignoring the top 5% of most abundant sites during normalization, but still the rest of the distribution of counts could be skewed. This sample also has a high *Pickands' tail index* of 1.53 (which is above 1.0), also suggesting skew. While, we currently don't have recommendations for hard cutoffs to use for identifying bad samples (e.g. that might need to be re-sequenced), I would say that skew>30 and/or PTI>1.0 are signs that a sample might be noisy or lower-quality. See :ref:`Quality Control ` for more discussion about assessing quality of TnSeq datasets. Nonetheless, doing resampling on this data still yielded insights into many genes required for cholesterol metabolism in *M. tuberculosis* `(Griffin et al, 2009) `_.