Implemented: Jun Ding
Date: 03/02/2016 

How to extend the TarPmiR_extend
This extension directory tells how to extend TarPmiR to any new miRNA-mRNA binding experiment data or any new sets of features. 
The 'TarPmiR' tool was implemented in Objected-Oritend style, users can easily extend the 'binding' class for different purposes. 
This 'How2Extend' document describes the 'binding' class. Besdies, it also provides a concrete example demonstrating how to extend 'TarPmiR'.  
============================================================================================================================================================

1. 'binding' class in TarPmiR

The following is the description of the 'binding' class.

-----------------------------------------------------------
class binding:
	def __init__(self,mir,m,pos):
		self.mir=mir                                 # miRNA class
		self.m=m                                     # mRNA class 
		self.p=pos                                   # binding site, e.g 1024,1045, the potential binding sites can be obtained using 'Interaction' class.

		# features of each miRNA-mRNA binding site
		[self.en,self.sp,self.bb,self.bm]=self.mfe() # self.en : energy, self.sp: # of pairings in the seed-region, self.bb: base-pairings in the miRNA, self.bm: base-pairings in the mRNA. 
		self.seed=self.hasSeed()                     # self.seed: 1/0, whether there's a seed 
		self.au=self.AU_content()                    # slf.au: AU-content score
		[self.nc,self.pnc]=self.consecutive_pairs()  # self.nc: the largest number of consecutive pairs, self.pnc: position of the largest consecutive pairs.
		self.me=self.me_motif()                      # self.me: m/e motif score.
		self.ac=self.acc()                           # self.ac: accessbility score for given binding site (position).
		self.np=self.nbp()                           # self.nbp: Total number of base-pairing for given binding site.
		self.bl=self.bl()                            # self.bl: total binding length (nts.) in the mRNA sequence for given binding site
		[self.pe,self.dpse]=self.prThreeEnd()        # self.pe: total number of base-pairing in the 3'end region, self.dpse: difference of # of base-pairings between the seed region and 3'end region.
		[self.phys,self.phyf]=self.cv()              # self.phys: Average phyloP scores in the stem region of given binding site. Average phyloP scores in the flanking region of given binding site.
		
	def getFeatures(self)          # Return all features calculated in 'binding' class
		
	def mfe(self)                  # return energy
	
	def hasSeed(self)              # return seed
		
	def AU_content(self)           # return AU content
		
	def consecutive_pairs(self)    # return the length and position of the longest consecutive pairs
		
	def me_motif(self)             # return m/e score
		
	def acc(self)                  # return accessibility score for give binding site

	def nbp(self)                  # return total number of base-pairings for given binding site
		
	def bl(self)                   # return the binding length in mRNA for given binding site
		
	
	def prThreeEnd(self)           # return number of base-pairings in the 3'end region; and the difference between the seed region and 3'end region
		
	def cv(self)                   # return the PhyloP score in stem region and flanking region for given binding site

-----------------------------------------------------------


(1) mir : miRNA class; 'miRNA' class 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
class miRNA:
	def __init__(self,mi,mi_seq):
		self.name=mi # miRNA ID
		self.seq=mi_seq # miRNA sequence

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<	
For example:
A=miRNA('miR-20a', 'TAAAGTGCTTATAGTGCAGGTAG'). We used 'T' here instead of 'U'

--------------------------------------------------------------------------------------------
(2) m : mRNA class; 'mRNA' class

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
class miRNA:
	def __init__(self,mi,mi_seq):
		self.name=mi
		self.seq=mi_seq
	
class mRNA:		
	def __init__(self,m,seq):
		self.name=m
		self.seq=seq
		self.acc=self.accessibility_plfold() 
		self.phy=self.phyloP()
		
	def accessibility_plfold(self)	# function, used to calculate the accessibility score
 
	def phyloP(self)                # function, used to calculate the phyloP score

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The potential binding sites (positions used to instance the 'binding' class) are given by the 'Interaction' clas

(3) Interactions class
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
class Interactions:
	def __init__(self,mir,m):
		self.mir=mir          # miRNA class
		self.m=m	      # mRNA class
 
	def cand_site(self)           # return the potential exact binding sites between given miRNA and mRNA. 
		
		
	def seed_site(self)           # return the seed_site for given miRNA and mRNA
		
				
	def energy_site(self)         # return the energy_site for given miRNA and mRNA

==============================================================================================================================================


2. An example showing how to use the 'binding' class to add/reduce/calculate features and predict miRNA-mRNA binding site

The detailed example can be found 'extend_TarPmiR.py'. We provided an example:
A. We added one more feature (artificial feature, may not be useful), calculating AU percentage in seed region.
Users can add/reduce features similarly as what we did in the 'extend_TarmiR'. 

Search "!UPDATE THIS SECTION!" to locate the description of how to add feature in 'extend_TarPmiR.py'. 


The following is the general overviw:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
class ebinding(binding):                                 # extend the binding class 
	#--------------------------------------------------------------------------------
	def add_feature1(self):                          # define how to add feature
		per=float(self.mir.seq.count('A')+self.mir.seq.count('U'))/len(self.mir.seq)
		return per
		
	def UpdatedFeatures(self):                      # define how to update feature list. You can add features or reduce features in this function
		NewFeatureList=[]
		nf=self.add_feature1() # Add new features in the featureList
		NewFeatureList.append(nf)
		#----------------------------------------------
		F=self.getFeatures()
		NF=F+NewFeatureList
		return NF

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


After you 'extend' the binding class, you can use it to calculate all features easily. Then, you can used all trained models to predict on the feature list given by 'binding' class. 

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
bijk=ebinding(mir,m,k)  # extend 'binding' class
F=bijk.UpdatedFeatures() # update features
pb=RR.predict_proba(F)[0][1] # predict using trained model
if pb>pb_cut:                # output if it's larger than given cutoff
	# output bijk

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


An implementation example can be found in 'extend_TarPmiR'. 

===========================================================================================================================================================

3. How to get a updated trained model 

If you add/reduce/modify features, you will also need a updated trained model to do the prediction.
You can't use the provided 'Human.pkl' model as it was trained on the origial feature list and CLASH dataset.
If you have a few data set and new feature sets, you should use the tool 'build_model.py' in the extension tools to re-build the model.

To re-build the trained model, you will also need a training data set. It's format can be found in the 'training_dataset_example' under 'extension_tools' directory

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

1	-27.5	1	0.1007376	0.676	0.507363636364	0.29915	-3.9375744462	20	24	13	1	6	2	0.676
1	-25.5	1	0.05264692	0.603	4.16411111111	2.331475	-5.18008091453	17	20	9	1	4	4	0.603
1	-23.8	1	0.03623946	0.574	4.53307692308	2.433025	-6.10695976654	19	22	13	8	7	1	0.574
1	-23.4	1	0.02417334	0.647	4.17341666667	2.826225	-8.55010550196	19	22	12	9	8	1	0.647
1	-23.9	1	0.01739953	0.559	1.71728571429	1.9150875	-4.76182347791	17	22	13	1	4	4	0.559
......
0	-20.0	1	0.0005203946	0.456	3.39796551724	3.5192875	-4.90607308676	19	37	9	21	5	3	0.456
0	-16.8	0	6.75476e-07	0.382	3.2555	2.2803375	-11.0639087645	16	21	6	15	6	1	0.382
0	-18.6	0	7.518154e-05	0.559	1.43833333333	1.9252375	-8.30603545648	18	24	12	11	6	1	0.559
0	-17.2	1	3.30903e-09	0.25	-0.07825	0.2025	-6.34000884684	18	23	5	18	6	0	0.25
0	-18.0	1	0.02086142	0.706	1.407	1.1798	-6.1995071877	16	19	14	1	5	1	0.706

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 

1st col: True/Flase binding site?
2nd to last col: All features 


You can use the 'build_model.py' to get the trained_model on your training data set. 

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
build_model.py <training_dataset_sample>

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


4. An STEP-BY-STEP example to illustrate how to apply TarPmiR to a new experiment data with updated feature list. 

(1) Get the training data set ready.
Based on your experiment data, prepare your training data set ready.
As an example, please see 'training_dataset_example' under the 'extension_tools' directory.

(2) Build Random Forest model

build_model.py <training_dataset_example>

(3) Update the feature list by extending TarPmiR.py ->extend_TarPmiR.py 

(4) Run extend_TarPmiR.py to get the final predictions


<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
extend_TarPmiR.py -a test/test_miR.txt -b test/test_mRNA.txt -m extension_tools/Forest.pkl -p 0.5

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<