Project: Computational Analysis of microRNA Binding
The project aims to develop novel computational methods and tools to study microRNA binding interactions and microRNAs' role in gene regulation. Small (~22 nucleotide), non-coding RNAs called microRNAs have been known to regulate genes involved in key aspects of animal development and physiology through binding-interactions with their mRNA targets. Since the first discovery of microRNAs in C. elegans in 1993, a large number of microRNAs have been discovered in metazoan, plants and viruses. Today, microRNAs are known to express ubiquitously in almost all cell types, evolutionarily conserved in most of metazoan and plant species, and potentially regulate more than 30% of mammalian gene products. Understanding of microRNAs' regulatory functions in the fundamental biological processes is thus essential towards gaining a global view of gene regulation, but still at its early stages despite the rapid advances in microRNA biology.
Major Methods and Tools
*miRModule*
	*MiRModule* is a software tool for systematic discovery of miRNA modules from a set of predefined miRNA target sites. Given a sets of miRNA binding sites, miRModule efficiently identifies groups of miRNAs, whose binding sites significantly co-occur in the same set of target mRNAs, as putative miRNA modules. It works for both experimentally determined miRNA-mRNA binding sites (e.g. from CLASH) and computationally predicted miRNA-mRNA binding sites (e.g. from miRanda). As long as the miRNA-mRNA binding information is provided, miRModule can identify putative miRNA modules based on the provided miRNA binding sites in mRNAs. We provided both Linux and Windows version of the miRModule software. *Download*
			The pipeline to predict miRNA modules. (A) MiRNA-mRNA interaction data from CLASH. Each line represents a target mRNA, each box represents a miRNA target site, with different shapes representing different miRNAs. (B). Identify miRNA groups with their target sites frequently co-occurring in common mRNAs. (C) Identify miRNA module candidates by binomial tests. (D). Predict miRNA modules based on hypergeometric tests
	We studied miRNA modules based on experimentally determined miRNA target sites. We predicted 181 miRNA modules and 306 potential miRNA modules. We demonstrated that miRNA modules preferred to bind weak sites and favoured a combination of all unconventional sites. We also observed that miRNA modules preferred to bind in CDSs and favoured the first and the last exons. We confirmed that more than 70% of miRNA modules bound sites within specific ranges, with enrichment in two previously known ranges. However, many more adjacent sites bound by miRNA modules were >130 nucleotides apart. We further showed that unconventional target sites of miRNA modules were often within shorter distances than other combinations of target sites. Our study shed new light on miRNA binding. The majority of adjacent target sites of miRNA modules were >130 nucleotides apart, which contradicted with previous observations (Brennecke et al., 2005; Doench and Sharp, 2004; Kloosterman et al., 2004; Saetrom et al., 2007; Vella et al., 2004). To understand what resulted in different observations, we focused on target sites of the 181 miRNA modules in 3′ UTRs. We found even when we considered only target sites in 3′ UTRs, more than 75% of adjacent target sites of miRNA modules were >130 nucleotides apart. We also predicted miRNA module candidates using only the 6096 CLASH target sites in 3′ UTRs and then studied the distances of adjacent target sites of these candidates. We still observed that the majority of adjacent target sites of these candidates were >130 nucleotides apart (Supplementary File S4). Therefore, the different observations were unlikely because we used target sites in entire mRNA regions while previous studies used only target sites in 3′ UTRs. Instead, it may be due to the small number of experimentally determined sites in previous experimental studies and the limited quality of predicted sites in the previous computational study, compared with the 18 514 high-quality experimentally determined sites we used. We predicted (potential) miRNA modules on the condition that they downregulated target genes significantly more than some of their miRNA subsets. We further checked whether these (potential) modules downregulated their target genes significantly more than any subset contained in the modules. We confirmed that for all (potential) miRNA modules, their target genes were significantly more down-regulated than the target genes of any of their subsets. We discovered 201 non-synergistic modules. The non-synergistic modules may also play important roles in regulating target genes, as supported by GO and pathway analyses, order preference, and the literature. Moreover, these non-synergistic modules may be competitive miRNA modules that are worth further investigation (Khan et al., 2009)

	*TarPmiR* is a software for predicting miRNA target site from CLASH (cross-linking ligation and sequencing of hybrids) data.
	The identification of microRNA (miRNA) target sites is fundamentally important for studying gene regulation. There are dozens of computational methods available for miRNA target site prediction. Despite their existence, we still cannot reliably identify miRNA target sites, partially due to our limited understanding of the characteristics of miRNA target sites. The recently published CLASH (cross-linking ligation and sequencing of hybrids) data provide an unprecedented opportunity to study the characteristics of miRNA target sites and improve miRNA target site prediction methods. Applying four different machine learning approaches to the CLASH data, we identified seven new features of miRNA target sites. Combining these new features with those commonly used by existing miRNA target prediction algorithms, we developed an approach called TarPmiR for miRNA target site prediction. Testing on two human and one mouse non-CLASH datasets, we showed that TarPmiR predicted more than 74.2 % of true miRNA target sites in each dataset. Compared with three existing approaches, we demonstrated that TarPmiR is superior to these existing approaches in terms of better recall and better precision. Although TarPmiR is based on the published CLASH data, users can easily apply TarPmiR to any new data set by extending the 'binding' class. Please check 'How to extend TarPmiR' for more details. Download
	we identified seven new features together with six conventional features of miRNA target sites. Based on these 13 selected features, we developed a new approach called TarPmiR to predict miRNA target sites. We tested TarPmiR on a human CLASH dataset, two human PAR-CLIP datasets, a mouse HITS-CLIP dataset and a general dataset from TarBase 7.0, and showed that TarPmiR performed at least the same or better than three existing approaches. Not all new features were completely new. We claimed some features as new because they were not used by most of the existing tools, such as miRanda (Enright et al., 2004), TargetScan (Friedman et al., 2009; Grimson et al., 2007), DIANA-microT-CDS (Maragkakis et al., 2009; Paraskevopoulou et al., 2013), rna22-gui (Loher and Rigoutsos, 2012), TargetMiner (Bandyopadhyay and Mitra, 2009), PITA (Kertesz et al., 2007) and RNAhybrid (Krüger and Rehmsmeier, 2006). However, several new features were mentioned in previous studies directly or indirectly. For instance, Thomson et al. (2011) stated that ‘some validated miRNA target sites do not have a complete seed match but instead exhibit 11–12 continuous base pairs in the central region of the miRNA’. We observed similar target sites in the CLASH dataset and proposed the feature ‘The length and position of the longest consecutive pairs’. The selected new features significantly improved the prediction accuracy of TarPmiR. To show the contribution of the new features to the accuracy of TarPmiR, we removed the seven new features and retrained random forests in TarPmiR. Compared with the original TarPmiR with 13 features, the recall and precision of the modified TarPmiR dropped 8.6% and 9.7%, respectively. We also compared the predicted true target sites by different approaches (Supplementary File S4). TarPmiR had the largest number of predicted true sites shared by other tools. However, the percentage of shared true target sites predicted by TarPmiR was lower than that of other tools, suggesting that TarPmiR complements existing tools by predicting sites that cannot be predicted by other tools. In fact, there are 2090 ‘non-seed-matching’ sites in the first CLASH test dataset. TarPmiR was able to identify 1585 (75.8%) of those sites. On the other hand, miRanda and TargetScan were only able to predict 173 (8.28%) and 34 (1.6%) sites, respectively. This also suggested that the traditional tools like TargetScan and miRanda almost cannot predict non-seed-matching binding sites. It is also worth mentioning that CLASH experiments may pick up direct and indirect miRNA target sites. The Argonaut proteins are guided by miRNAs to bind mRNAs, which is referred to as miRNA-dependent recruitment and results in direct miRNA target sites. There is also a miRNA-independent Argonaut protein recruitment mechanism, in which Argonaut proteins are recruited to target mRNAs by protein–protein interaction with RNA-binding proteins and thus miRNAs do not interact with the mRNAs directly (Meister, 2013). In the future, one may want to distinguish these two types of target sites from the CLASH experiments before training predictors for target site prediction. In this way, we may also obtain better features and improve the prediction accuracy. Because of the existence of indirect target sites in CLASH data, the recall of TarPmiR on the CLASH testing datasets may be underestimated. In fact, TarPmiR had a much higher recall on the three independent human and mouse datasets, suggesting that TarPmiR may have a recall larger than 74%. On the other hand, TarPmiR had a much lower precision on the independent datasets, which may be underestimated as well. This was because we treated all segments other than the CCRs or identified miRNA target sites in these independent datasets as true negative target sites, which may not be the case. By the time of this study, only one CLASH dataset was publicly available (Helwak et al., 2013). This human CLASH dataset was used to train TarPmiR. We applied TarPmiR to human and mouse datasets and demonstrated that it works well on these datasets. In the future, with more CLASH datasets available, more important miRNA target site features including tissue-specific features may be discovered and the accuracy of TarPmiR, especially its precision, may be further improved.
	Features selected by four different methods


	*CCmiR* is a software for predicting miRNA target site by considering miRNA cooperation.
	The identification of microRNA (miRNA) target sites is an important and challenging problem. In the past decade, dozens of computational methods have been developed to predict miRNA target sites. Despite their existence, rarely is there a method that considers the well-known competition and cooperation among miRNAs when attempts to discover target sites. To fill this gap, we developed a new approach called CCmiR, which takes the cooperation and competition of multiple miRNAs into account in a statistical model to predict their target sites. Tested on 4 different types of datasets, CCmiR predicted miRNA target sites with a high recall and a reasonable precision. Moreover, we demonstrated that CCmiR identified known and new cooperative and competitive miRNAs supported by literature. Compared with three state-of-the-art computational methods, CCmiR had a higher recall and a higher precision than these popular methods. Download

MDPS algorithm
	Considering the position dependency of neighboring pairings, we used a Markov model to learn the position-wise binding patterns for a given miRNA and its targets. We first defined the five states for the pairings in the alignment of a given miRNA sequence and one of its target sequences: match (), mismatch (), G-U wobble match (), bulge in target (), and bulge in miRNA () (Figure 1). Five states in an miRNA-target interaction With the five states, we designed a 5 by 5 transition matrix that describes the transition probabilities of the five states and a weight matrix to describe the probability of a state that a miRNA position prefers. For a miRNA sequence of length n, its weight matrix is a 4 by n matrix, in which each column corresponds to one position in this miRNA, each row corresponds to one of the following four states: and each number in the matrix gives the probability that the corresponding miRNA position prefers the corresponding state. The state does not correspond to any miRNA position and thus was not considered in the weight matrix . We calculated the transition and the weight matrices using the two training datasets. In brief, to create the weight matrix, we counted the number of the occurrences of each of the four states at each miRNA position in all miRNA-target interactions in a dataset. To create the transition matrix, we calculated the number of times each transition occurred in the interactions. We added a small pseudo count of 0.0001 to all entries in the matrices and then normalized the numbers in each row so that the sum of the numbers in a row to be 1. Both and were calculated from 5' to 3' direction of miRNAs, with the aligned miRNA-target sequences in the training data. We defined two types of models: miRNA-specific and miRNA-general model. The miRNA-specific model was learned by calculating the transition and weight matrices given the pairing information of a specific miRNA and its targets. The miRNA-general model was trained by the pairing information of all available miRNAs and their targets. Note that, a miRNA-general model was parametrized by only one transition matrix and one weight matrix. The transition and weight matrices were the unweighted average of the transition and weight matrices of all the involved miRNA-specific models, respectively. MDPS scoring strategy MDPS selects miRNA target sites by scoring miRNA-target interactions using a dynamic programming algorithm. For a given miRNA and a calculated weight matrix and transition matrix, we have the following DP algorithm to score a target RNA sequence to determine whether it may be a potential target site of this miRNA. Here, we first define two notations, and . We define as the best score of the alignment between miRNA(1…i) and target RNA(1…j), with the last alignment position is at the k-th posture. Here miRNA(1…i) represents the miRNA sequence from the position 1 to the position i. Similarly, target RNA(1…j) represents the target sequence from the position 1 to the position j. There are three different possibilities for the last alignment position. When k=0, it means the last alignment position is at the states which we call posture 0. When k=1, it means the last alignment position is at the posture 1 and the state is . When k=2, it means the last alignment position is at the posture 2 and the state is . We also define state(i,j) as the state of the pairing of the i-th miRNA position and the j-th target position. Since two actual base pairs are involved, state(i,j) can only be one of the states: With the two notations, it is evident that We also have where means the -entry of the weight matrix of this miRNA. With these initialization, we have the following iteration formula to calculate for any i and j: Similarly, we calculate by the following iteration: The iteration has the following initialization: for any j and . Similarly, we initialize by and calculate by With the above three types of iterations, we obtain the maximum of , for any j and k, and for n being the length of the miRNA under consideration. This maximum value is regarded as the score of the alignment of this miRNA and the target under consideration. The actual alignment resulted in this score describes the pairing between this miRNA and this target. Using the above CLASH training datasets, we generated the MDPS models that consisted of the average w and t matrices and a score cutoff that gave the best predictions in cross validation on the corresponding CLASH training dataset. We generated these models for both the target-enriched dataset and the energy-filtered dataset using 10 fold cross validation on the corresponding 80% training data. Since the column size of the w matrix was the length of the corresponding miRNAs, The column size of the average w matrix in the models was the length of the longest miRNAs in the training datasets. If the score was larger than a given cutoff, this sequence was called the target of this miRNA. We tested five different cutoffs and chosen the Average score + 2Standard Deviation* as the final cutoff for the final MDPS models, where the Average score and the Standard deviation are the mean and the standard deviation of the alignment scores of the miRNA-target duplexes in the training datasets, respectively.
Educational Materials		Download
References:
	· Li X, Hu H. Improving miRNA target prediction using CLASH data. in A. Lagana (Ed): microRNA Target Identification, Springer Nature, New York: NY, pp. 75-83. DOI: 10.1007/978-1-4939-9207-2_6. 2019. · Ding J, Li X, Hu H. CCmiR: a computational approach for competitive and cooperative microRNA binding prediction. Bioinformatics, DOI:10.1093/bioinformatics/btx606. 2017. · Wang Y, Goodison S, Li X, Hu H. Prognostic cancer gene signatures share common regulatory motifs. Scientific Reports, DOI:10.1038/s41598-017-05035-3. 2017. · Ding J, Li X, Hu H. TarPmiR: a new approach for microRNA target site prediction. Bioinformatics. doi: 10.1093/bioinformatics/btw318. 2016. · Ding J, Li X, Hu H. MicroRNA modules prefer to bind weak and unconventional target sites. Bioinformatics, 31 (9): 1366 - 1374. doi: 10.1093/bioinformatics/btu833. 2015.
Acknowledgement	·