Implemented: Jun Ding Date: 03/02/2016 How to extend the TarPmiR_extend This extension directory tells how to extend TarPmiR to any new miRNA-mRNA binding experiment data or any new sets of features. The 'TarPmiR' tool was implemented in Objected-Oritend style, users can easily extend the 'binding' class for different purposes. This 'How2Extend' document describes the 'binding' class. Besdies, it also provides a concrete example demonstrating how to extend 'TarPmiR'. ============================================================================================================================================================ 1. 'binding' class in TarPmiR The following is the description of the 'binding' class. ----------------------------------------------------------- class binding: def __init__(self,mir,m,pos): self.mir=mir # miRNA class self.m=m # mRNA class self.p=pos # binding site, e.g 1024,1045, the potential binding sites can be obtained using 'Interaction' class. # features of each miRNA-mRNA binding site [self.en,self.sp,self.bb,self.bm]=self.mfe() # self.en : energy, self.sp: # of pairings in the seed-region, self.bb: base-pairings in the miRNA, self.bm: base-pairings in the mRNA. self.seed=self.hasSeed() # self.seed: 1/0, whether there's a seed self.au=self.AU_content() # slf.au: AU-content score [self.nc,self.pnc]=self.consecutive_pairs() # self.nc: the largest number of consecutive pairs, self.pnc: position of the largest consecutive pairs. self.me=self.me_motif() # self.me: m/e motif score. self.ac=self.acc() # self.ac: accessbility score for given binding site (position). self.np=self.nbp() # self.nbp: Total number of base-pairing for given binding site. self.bl=self.bl() # self.bl: total binding length (nts.) in the mRNA sequence for given binding site [self.pe,self.dpse]=self.prThreeEnd() # self.pe: total number of base-pairing in the 3'end region, self.dpse: difference of # of base-pairings between the seed region and 3'end region. [self.phys,self.phyf]=self.cv() # self.phys: Average phyloP scores in the stem region of given binding site. Average phyloP scores in the flanking region of given binding site. def getFeatures(self) # Return all features calculated in 'binding' class def mfe(self) # return energy def hasSeed(self) # return seed def AU_content(self) # return AU content def consecutive_pairs(self) # return the length and position of the longest consecutive pairs def me_motif(self) # return m/e score def acc(self) # return accessibility score for give binding site def nbp(self) # return total number of base-pairings for given binding site def bl(self) # return the binding length in mRNA for given binding site def prThreeEnd(self) # return number of base-pairings in the 3'end region; and the difference between the seed region and 3'end region def cv(self) # return the PhyloP score in stem region and flanking region for given binding site ----------------------------------------------------------- (1) mir : miRNA class; 'miRNA' class <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< class miRNA: def __init__(self,mi,mi_seq): self.name=mi # miRNA ID self.seq=mi_seq # miRNA sequence <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< For example: A=miRNA('miR-20a', 'TAAAGTGCTTATAGTGCAGGTAG'). We used 'T' here instead of 'U' -------------------------------------------------------------------------------------------- (2) m : mRNA class; 'mRNA' class <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< class miRNA: def __init__(self,mi,mi_seq): self.name=mi self.seq=mi_seq class mRNA: def __init__(self,m,seq): self.name=m self.seq=seq self.acc=self.accessibility_plfold() self.phy=self.phyloP() def accessibility_plfold(self) # function, used to calculate the accessibility score def phyloP(self) # function, used to calculate the phyloP score <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< The potential binding sites (positions used to instance the 'binding' class) are given by the 'Interaction' clas (3) Interactions class <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< class Interactions: def __init__(self,mir,m): self.mir=mir # miRNA class self.m=m # mRNA class def cand_site(self) # return the potential exact binding sites between given miRNA and mRNA. def seed_site(self) # return the seed_site for given miRNA and mRNA def energy_site(self) # return the energy_site for given miRNA and mRNA ============================================================================================================================================== 2. An example showing how to use the 'binding' class to add/reduce/calculate features and predict miRNA-mRNA binding site The detailed example can be found 'extend_TarPmiR.py'. We provided an example: A. We added one more feature (artificial feature, may not be useful), calculating AU percentage in seed region. Users can add/reduce features similarly as what we did in the 'extend_TarmiR'. Search "!UPDATE THIS SECTION!" to locate the description of how to add feature in 'extend_TarPmiR.py'. The following is the general overviw: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< class ebinding(binding): # extend the binding class #-------------------------------------------------------------------------------- def add_feature1(self): # define how to add feature per=float(self.mir.seq.count('A')+self.mir.seq.count('U'))/len(self.mir.seq) return per def UpdatedFeatures(self): # define how to update feature list. You can add features or reduce features in this function NewFeatureList=[] nf=self.add_feature1() # Add new features in the featureList NewFeatureList.append(nf) #---------------------------------------------- F=self.getFeatures() NF=F+NewFeatureList return NF <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< After you 'extend' the binding class, you can use it to calculate all features easily. Then, you can used all trained models to predict on the feature list given by 'binding' class. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< bijk=ebinding(mir,m,k) # extend 'binding' class F=bijk.UpdatedFeatures() # update features pb=RR.predict_proba(F)[0][1] # predict using trained model if pb>pb_cut: # output if it's larger than given cutoff # output bijk <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< An implementation example can be found in 'extend_TarPmiR'. =========================================================================================================================================================== 3. How to get a updated trained model If you add/reduce/modify features, you will also need a updated trained model to do the prediction. You can't use the provided 'Human.pkl' model as it was trained on the origial feature list and CLASH dataset. If you have a few data set and new feature sets, you should use the tool 'build_model.py' in the extension tools to re-build the model. To re-build the trained model, you will also need a training data set. It's format can be found in the 'training_dataset_example' under 'extension_tools' directory <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 1 -27.5 1 0.1007376 0.676 0.507363636364 0.29915 -3.9375744462 20 24 13 1 6 2 0.676 1 -25.5 1 0.05264692 0.603 4.16411111111 2.331475 -5.18008091453 17 20 9 1 4 4 0.603 1 -23.8 1 0.03623946 0.574 4.53307692308 2.433025 -6.10695976654 19 22 13 8 7 1 0.574 1 -23.4 1 0.02417334 0.647 4.17341666667 2.826225 -8.55010550196 19 22 12 9 8 1 0.647 1 -23.9 1 0.01739953 0.559 1.71728571429 1.9150875 -4.76182347791 17 22 13 1 4 4 0.559 ...... 0 -20.0 1 0.0005203946 0.456 3.39796551724 3.5192875 -4.90607308676 19 37 9 21 5 3 0.456 0 -16.8 0 6.75476e-07 0.382 3.2555 2.2803375 -11.0639087645 16 21 6 15 6 1 0.382 0 -18.6 0 7.518154e-05 0.559 1.43833333333 1.9252375 -8.30603545648 18 24 12 11 6 1 0.559 0 -17.2 1 3.30903e-09 0.25 -0.07825 0.2025 -6.34000884684 18 23 5 18 6 0 0.25 0 -18.0 1 0.02086142 0.706 1.407 1.1798 -6.1995071877 16 19 14 1 5 1 0.706 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 1st col: True/Flase binding site? 2nd to last col: All features You can use the 'build_model.py' to get the trained_model on your training data set. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< build_model.py <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 4. An STEP-BY-STEP example to illustrate how to apply TarPmiR to a new experiment data with updated feature list. (1) Get the training data set ready. Based on your experiment data, prepare your training data set ready. As an example, please see 'training_dataset_example' under the 'extension_tools' directory. (2) Build Random Forest model build_model.py (3) Update the feature list by extending TarPmiR.py ->extend_TarPmiR.py (4) Run extend_TarPmiR.py to get the final predictions <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< extend_TarPmiR.py -a test/test_miR.txt -b test/test_mRNA.txt -m extension_tools/Forest.pkl -p 0.5 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<