Example Scripts (for Developers)
Here are two example python scripts for developers showing how to use the pytransit package.
Example 1: src/pytransit_resampling.py
This is a stand-alone script that uses the pytransit library to do resampling between two groups of wig files, analogous to running “python3 transit.py resampling…”. It is a simplified version of src/pytransit/analysis/resampling.py, which has more options.
To run this, cd into your transit/src/ directory. It prints output to stdout. Be sure to put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh).
usage: python3 pytransit_resampling.py <comma_separated_wigs_condA> <comma_separated_wigs_condB> <prot_table>
> cd transit/src/
> python3 pytransit_resampling.py pytransit/data/glycerol_H37Rv_rep1.wig,pytransit/data/glycerol_H37Rv_rep2.wig pytransit/data/cholesterol_H37Rv_rep1.wig,pytransit/data/cholesterol_H37Rv_rep2.wig,pytransit/data/cholesterol_H37Rv_rep3.wig pytransit/genomes/H37Rv.prot_table
The important functions illustrated in this script are:
transit_tools.get_validated_data()
norm_tools.normalize_data()
stat_tools.resampling()
# transit/src/pytransit_resampling.py
import sys,numpy
import pytransit.transit_tools as transit_tools
import pytransit.tnseq_tools as tnseq_tools
import pytransit.stat_tools as stat_tools
import pytransit.norm_tools as norm_tools
from statsmodels.stats.multitest import fdrcorrection
if len(sys.argv)!=4:
print("usage: python3 pytransit_resampling.py <comma_separated_wigs_condA> <comma_separated_wigs_condB> <prot_table>")
sys.exit(0)
wigsA = sys.argv[1].split(',')
wigsB = sys.argv[2].split(',')
nA = len(wigsA)
prot_table = sys.argv[3]
(counts,sites) = transit_tools.get_validated_data(wigsA+wigsB) # data is a DxN numpy array, D=num datasets, N=num TA sites
print(counts.shape)
print("normalizing data with TTR")
(data, factors) = norm_tools.normalize_data(counts, "TTR")
# this divides the raw data (counts at all TA sites) into a smaller array of counts for each gene based on its coordinates
genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites)
results,pvals = [],[]
for gene in genes:
ii = gene.reads # coordinates of TA sites in gene
if gene.n==0: continue # skip genes with 0 TA sites
data1 = gene.reads[:nA] # counts at TA sites in gene for wigs for condition A
data2 = gene.reads[nA:]
stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
(test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail, pval_2tail, testlist) = stats # unpack
vals = [gene.orf,gene.name,gene.n,round(mean1,1),round(mean2,1),round(log2FC,3),pval_2tail]
results.append(vals)
# do multiple-tests correction
pvals = [x[-1] for x in results]
qvals = fdrcorrection(pvals)[1]
print('\t'.join("orf gene numTA meanA meanB LFC Pval Qval".split()))
for vals,qval in zip(results,qvals):
print('\t'.join([str(x) for x in vals+[qval]]))
print("%s/%s hits (qval<0.05)" % (sum([q<0.05 for q in qvals]),len(results)))
Example 2: src/allpairs_resampling.py
While the example above shows how to read-in and process individual wig files, this examples shows how to work with combined_wig and sample metadata files. It does resampling between each pair of conditions, and prints out a matrix of hits (statistically-signficant conditionally-essential genes).
To run this, cd into your transit/src/ directory. It prints output to stdout. Be sure to put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh).
The important parts illustrated in this example script are:
the tnseq_tools.read_combined_wig() function
how to read the metadata file
selecting counts from the data matrix for the samples associated with each condition
import sys,numpy
import pytransit.transit_tools as transit_tools
import pytransit.tnseq_tools as tnseq_tools
import pytransit.stat_tools as stat_tools
import pytransit.norm_tools as norm_tools
from statsmodels.stats.multitest import fdrcorrection
# this is a stand-alone script that uses the pytransit library to do resampling between all pairs of conditions in a combined_wig file
# metadata file indicates which replicates belong to which conditions
# put transit/src/ in your PYTHON_PATH (e.g. using export in bash, or setenv in csh)
# prints output to stdout
if len(sys.argv)!=4:
print("usage: python3 allpairs_resampling.py <combined_wig_file> <metadata_file> <prot_table>")
sys.exit(0)
combined_wig_file = sys.argv[1]
metadata_file = sys.argv[2]
prot_table = sys.argv[3]
#################################
print("reading data")
(sites, data, filenamesInCombWig) = tnseq_tools.read_combined_wig(combined_wig_file)
print("data.shape =",data.shape)
print("normalizing using TTR")
(data, factors) = norm_tools.normalize_data(data, "TTR")
# there is a tnseq_tools.read_samples_metadata(), but it is kind of complicated
# so just read through tab-separated file and collect sample Filenames associated with Conditions
# metadata files have at least 3 columns: Id, Filename, Condition
# the columns in the combined_wig are cross-referenced by Filename (from the original wigs)
Conditions,Samples = [],{}
CondCol,FnameCol = -1,-1
for line in open(metadata_file):
w = line.rstrip().split('\t') # tab-separated
if CondCol==-1: CondCol,FnameCol = w.index("Condition"),w.index("Filename"); continue # error if headers not found
cond,fname = w[CondCol],w[FnameCol]
if cond not in Conditions: Conditions.append(cond)
if cond not in Samples: Samples[cond] = []
Samples[cond].append(fname)
print("\nConditions\t: Samples")
print("-------------------------")
for i,cond in enumerate(Conditions):
print("%s:%-8s" % (i+1,cond),"\t: ",", ".join(Samples[cond]))
#################################
genes = tnseq_tools.Genes(None,prot_table,data=data, position=sites) # divides counts at all TAs sites into groups by orf
print()
print("running resampling on each pair of conditions...")
print("reporting number of conditionally essential genes (Qval<0.05)")
for i,condA in enumerate(Conditions):
for j,condB in enumerate(Conditions):
if i<j:
pvals = []
for gene in genes:
if gene.n==0: continue # skip genes with 0 TA sites
idxA = [filenamesInCombWig.index(s) for s in Samples[condA]]
idxB = [filenamesInCombWig.index(s) for s in Samples[condB]]
data1 = gene.reads[idxA] # counts at TA sites in gene for wigs for condition A
data2 = gene.reads[idxB]
stats = stat_tools.resampling(data1, data2, adaptive=True) # use defaults for other params
(test_obs, mean1, mean2, log2FC, pval_ltail, pval_utail, pval_2tail, testlist) = stats # unpack
pvals.append(pval_2tail)
qvals = fdrcorrection(pvals)[1]
numhits = sum([q<0.05 for q in qvals])
vals = ["%s:%-8s" % (i+1,condA),"%s:%-8s" % (j+1,condB),numhits]
print("\t".join([str(x) for x in vals]))
Here is the output of this script for data from growth of M. tuberculosis H37Rv on media containing iron supplied by different vehicles (e.g. mycobactin, carboxymycobactin, hemin, hemoglobin…), which requires genes in different pathways for uptake (Zhang et al., 2020). The raw data (wig files, with insertion counts at TA sites) have been combined into a combined_wig file and a metatdata file that describes which samples belong to which condition. These files (iron_combined_wig4.txt and iron_samples_metadata.txt) can be found in the transit data directory, transit/src/pytransit/data/. You can also compare this to the heatmap shown on the page for corrplot.
> cd transit/src/
> python3 allpairs_resampling.py pytransit/data/iron_combined_wig4.txt pytransit/data/iron_samples_metadata.txt pytransit/genomes/H37Rv.prot_table
reading data
data.shape = (14, 74605)
normalizing using TTR
Conditions : Samples
-------------------------
1:HighFeMBT : HighFeMBT.wig, HighFeMBT2.wig, HighFeMBT3.wig
2:LowFeMBT : LowFeMBT.wig, LowFeMBT2.wig
3:FeCMBT : FeCMBT.wig, FeCMBT2b.wig
4:Hemin : Hemin.wig, Hemin2b.wig, Hemin3b.wig
5:Hemoglobin : Hemoglobin.wig, Hemoglobin2b.wig
6:HeminMBT : HeminMBT.wig, HeminMBT2.wig
running resampling on each pair of conditions...
reporting number of conditionally essential genes (Qval<0.05)
1:HighFeMBT 2:LowFeMBT 38
1:HighFeMBT 3:FeCMBT 10
1:HighFeMBT 4:Hemin 136
1:HighFeMBT 5:Hemoglobin 134
1:HighFeMBT 6:HeminMBT 30
2:LowFeMBT 3:FeCMBT 35
2:LowFeMBT 4:Hemin 70
2:LowFeMBT 5:Hemoglobin 71
2:LowFeMBT 6:HeminMBT 5
3:FeCMBT 4:Hemin 104
3:FeCMBT 5:Hemoglobin 106
3:FeCMBT 6:HeminMBT 44
4:Hemin 5:Hemoglobin 4
4:Hemin 6:HeminMBT 77
5:Hemoglobin 6:HeminMBT 89
Note
Note, in allpairs_resampling.py, the FDR correction to adjust P-values for testing all genes in parallel is applied within each pairwise comparison. This correction is then repeated independently for each pair of conditions analyzed. Formally, it would be more rigorous to apply the FDR correction one time at the end, to adjust P-values over all pairs and over all genes (which would be G*N*(N-1)/2 probabilities, where G in the number of genes in the genome, and N is the number of conditions). This would likely further reduce the number of significant conditionally-essential genes. But for simplicity, this example script does not do that, because it is just designed to illustrate using the pytransit package.