Light Theme

Gene Transcription Initiation

Gene transcriptional regulation refers to any process by which a cell regulates its genes expression. Properly regulated expression of genes is crucial for ensuring that biological processes are accurately carried out, for genes contributing to development, proliferation, programmed cell death (apoptosis), aging, and differentiation. Gene expression begins when mRNA molecules start to be synthesized, at the point on the gene where they initiate.

To understand the regulation of gene expression, it is essential to discover the transcription initiation mechanisms under various conditions, and how these varied mechanisms lead to different outcomes, or phenotypes. High throughput sequencing of complete RNA sets synthesized in cells has produced large datasets, but matching large-scale computational studies, to understand phenotype-relevant transcription initiation mechanisms are still at its early stage. we are developing computational algorithms and tools to discover the associations between transcription initiation and gene regulation mechanisms towards advancing our understanding of gene transcriptional regulation.

a Novel Approach FlexSLiM for Short Linear Motif Discovery in Protein Sequences

Abstract

Short linear motifs are 3 to 11 amino acid long peptide patterns that play important regulatory roles in modulating protein activities. Although they are abundant in proteins, it is often difficult to discover them by experiments, because of the low affinity binding and transient interaction of short linear motifs with their partners. Moreover, available computational methods cannot effectively predict short linear motifs, due to their short and degenerate nature. Here we developed a novel approach, FlexSLiM, for reliable discovery of short linear motifs in protein sequences. By testing on simulated data and benchmark experimental data, we demonstrated that FlexSLiM more effectively identifies short linear motifs than existing methods. We provide a general tool that will advance the understanding of short linear motifs, which will facilitate the research on protein targeting signals, protein post-translational modifications, and many others.

Software Download

Education Materials

Computational annotation of miRNA transcription start sites

Abstract

Motivation: MicroRNAs (miRNAs) are small noncoding RNAs that play important roles in gene regulation and phenotype development. The identification of miRNA transcription start sites (TSSs) is critical to understand the functional roles of miRNA genes and their transcriptional regulation. Unlike protein-coding genes, miRNA TSSs are not directly detectable from conventional RNA-Seq experiments due to miRNA-specific process of biogenesis. In the past decade, large-scale genome-wide TSS-Seq and transcription activation marker profiling data have become available, based on which, many computational methods have been developed. These methods have greatly advanced genome-wide miRNA TSS annotation. Results: In this study, we summarized recent computational methods and their results on miRNA TSS annotation. We collected and performed a comparative analysis of miRNA TSS annotations from 14 representative studies. We further compiled a robust set of miRNA TSSs (RSmirT) that are supported by multiple studies. Integrative genomic and epigenomic data analysis on RSmirT revealed the genomic and epigenomic features of miRNA TSSs as well as their relations to protein-coding and long non-coding genes.

A Two-Stream Convolutional Neural Network for microRNA Transcription Start Site Feature Integration and Identification

Abstract

MicroRNAs (miRNAs) play important roles in post-transcriptional gene regulation and phenotype development. Under-standing the regulation of miRNA genes is critical to understanding gene regulation. One of the challenges to studying miRNA gene regulation is the lack of condition-specific annotation of miRNA transcription start sites (TSSs). Unlike protein-coding genes, miRNA TSSs can be tens of thousands of nucleotides away from the precursor miRNAs and they are hard to be detected by conventional RNA-Seq experiments. A number of studies have been attempted to computa-tionally predict miRNA TSSs. However, high-resolution condition-specific miRNA TSS prediction remains a challenging problem. Recently, deep learning models have been successfully applied to various bioinformatics problems but have not been effectively created for condition-specific miRNA TSS prediction. Here we created a two-stream deep learning model called D-miRT for computational prediction of condition-specific miRNA TSSs. D-miRT is a natural fit for the integration of low-resolution gene transcription activation markers such as DNase-Seq and histone modification data and high-resolution sequence features. We trained the D-miRT model by integrating genome-scale CAGE experiments and transcription activation marker data across multiple cell lines. Compared with alternative computational models on different sets of training data, D-miRT outperformed all baseline models and demonstrated high accuracy for condition-specific miRNA TSS prediction tasks. Comparing with the most recent approaches on cell-specific miRNA TSS identifi-cation using cell lines that were unseen to the model training processes, D-miRT also showed superior performance.

RSmiRT database

Abstract

RSmiRT database integrate miRNA TSS datasets published in 14 recent studies.

Data and Website

Identify Transcription Start Sites from CAGE Data

Abstract

Gene transcription start site (TSS) identification is important to understanding transcriptional gene regulation. Cap Analysis Gene Expression (CAGE) experiments have recently become common practice for direct measurement of TSSs. Currently, CAGE data available in public databases created unprecedented opportunities to study gene transcriptional initiation mechanisms under various cellular conditions. However, due to potential transcriptional noises inherent in CAGE data, in-silico methods are required to identify bonafide TSSs from noises further. Here we present a computational approach dlCAGE, an end-to-end deep neural network to identify TSSs from CAGE data. dlCAGE incorporate de-novo DNA regulatory motif features discovered by DeepBind model architecture, as well as existing sequence and structural features. Testing results of dlCAGE in several cell lines in comparison with current state-of-the-art approaches showed its superior performance and promise in TSS identification from CAGE experiments.

the potential application of Deep Learning Models to MicroRNA Transcription Start Site Identification

Abstract

microRNAs (miRNA) are ~22 base pair long RNAs that play important roles in regulating gene expression. Understanding the transcriptional regulation of miRNA is critical to gene regulation. However, it is often difficult to precisely identify miRNA transcription start sites (TSSs) due to miRNA-specific biogenesis. Existing computational methods cannot effectively predict miRNA TSSs. Here, we employed deep learning architectures incorporating Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) techniques to detect miRNA TSSs in regions of accessible chromatin. By testing on benchmark experimental data, we demonstrated that deep learning models outperform support vector machine and can accurately distinguish miRNA TSSs.

Education Materials