=========================================================================================== EPIP Author: Amlan Talukder Date: May 28,2018 =========================================================================================== EPIP is a software used to identify target genes of enhancers in human genome. It is developed by the computational System biology group at UCF. INSTALLATION -------------------------------------------------------------------------------------------- (1) Install Python 2.7 (2) Install pybedtools, scikit learn, numpy and pandas packages for Python. (3) Install bedtools EXECUTION -------------------------------------------------------------------------------------------------------------------------------------- (1) cd to the "Codes.ExtractFeatures" directory of the software For example, if you put XXXX under your home directory ( "cd /home/YOURNAME/XXXX") YOURNAME is the your home directory name. (2) You can run the software by running the script "EPIP.py" with the following command: ---------------------------------------------------------------------------------------- usage: EPIP.py [-h] -e ENHANCER_FILE_PATH [-c CELL_LINE] [-g GENE_FILE_PATH] [-s CHROM_SIZES_FILE_PATH] [-d INTEGER] [-f FEATURE_DIR_PATH] Find enhancer targets optional arguments: -h, --help show this help message and exit -c CELL_LINE Comma delimited cell lines among {GM12878,HELA,HMEC,HUVEC,IMR90,K562,KBM7,NHEK}.In order to predict for a cell line that is not in the list above, user must provide the file path of the features to use for that cell line with "-f" command -g GENE_FILE_PATH Path for human genes in bed file format (default: Use human Gencode 19) -s CHROM_SIZES_FILE_PATH Path for chromosome sizes file of human genome in tab delimited file format (default: Use human Gencode 19 chrom sizes) -d INTEGER Maximum distance between enhancers and promoters (default: 2 Mbp) -f FEATURE_DIR_PATH Path of the directory where the feature peak files for each cell line are specified in the following formats feature___peaks.[bed|narrowPea k|broadPeak]. must be among the following features. {H3k4me1,H3k4me2,H3k4me3,H3k9ac,H3 k27ac,H3k27me3,H3k36me3,H3k79me2,H4k20me1,Ctcf,DNaseI, Pol2,Rad21,Smc3} required arguments: -e ENHANCER_FILE_PATH Path for hg19 enhancers in bed file format Example: python EPIP.py -e test_enhancers.bed -g test_genes.bed Required inputs --------------------------------------------------------------------------------------------- The tool takes the file path of the enhancer bed file as mandatory parameters. Optional inputs --------------------------------------------------------------------------------------------- The cell lines, gene file paths, chromosome sizes file path, distance and feature dir paths are the optional input parameters. In case of multiple cell lines as input, the cell lines input has to be comma delimited. Any cell line except the listed cell lines, can be taken as input. For the new cell line, feature files should be stored in a directory specified by the user using the FEATURE_DIR_PATH parameter. A example feature directory "Example_feature_dir" is given in the current folder. If no features specified by the user for the new cell line, only the basic features (distance, corr and css) will be calculated for that cell line. The features specified for the new cell line must be among the list of following features {H3k4me1,H3k4me2,H3k4me3,H3k9ac,H3k27ac,H3k27me3,H3k36me3,H3k79me2,H4k20me1,Ctcf, DNaseI,Pol2,Rad21,Smc3}. Although user can specify different data file for a feature among the listed features, using the feature reference file parameter. If the genes bed file is provided as an input, the promoters are calculated using 1kbp upstream and 100bp downstream around the tss of the genes. The enhancer promoter pairs are considered within a specific distance (default: 2Mbp). This distance can also be taken as an input. MODEL ---------------------------------------------------------------------------------------------------------------------------------- The model pkl file is stored under "TrainedModels" directory. RESULTS ---------------------------------------------------------------------------------------------------------------------------------- The result files are stored under "Results" directory. The result files are created by the name of the cell lines. The output format is tab delimited enhancer regions and target gene ids predicted by the model. The example of the output format is following, chr1_2989176_2989511 ENSG00000169717.5_ACTRT2 chr1_2230678_2230895 ENSG00000116151.9_MORN1 chr1_2059932_2059994 ENSG00000234396.3_RP11-181G12.4 chr1_3581269_3581645 ENSG00000272088.1_RP11-168F9.2 chr1_1978731_1978937 ENSG00000169885.5_CALML6 chr1_2231691_2232185 ENSG00000178642.5_AL513477.1 chr1_1005293_1005547 ENSG00000207607.1_MIR200A chr1_1136075_1136463 ENSG00000260179.1_RP5-902P8.12 chr1_2246556_2246763 ENSG00000243558.1_RP11-181G12.5 LICENSE & CREDITS ------------------------------------------------------------------------------------------------- The software is a freely available for academic use. plase contact xiaoman shawn li (xiaoman@mail.ucf.edu) for further information. CONTACT INFO ------------------------------------------------------------------------------------------------- If you are encountering any problem regarding to EPIP, please refer the manual first. If problem still can not be solved, please feel free to contact us: Amlan Talukder (amlan@knights.ucf.edu) Nancy Haiyan Hu (haihu@cs.ucf.edu) Xiaoman Shawn Li (xiaoman@mail.ucf.edu)