Software Designed for Motif and motif module discovery in high-throughput sequencing data sets

SIOMICS 3.0 Manual

1.Prerequisites
In order to use this software, you should get the following things ready:

(1) You need to have python installed (Python 2.7.x).
You can download Python from here .
Besides,you also need Tkinter (Python 2) or tkinter (Python 3) module to enable the GUI.
For Windows users, the module was already included in the windows Python installer.( link )
For Linux users, please see here for installation instructions.


2.Paramters
(1) Description of parameters

python SIOMICS.py

(2) Recommended parameters for SIOMICS Note: If you can not get desired outputs under this recommended parameters, you might change those parameters based on your specific needs. For example, if 1% of total sequences is much larger than 100 (support too large), it might be good to use 100 instead of 1%. If 1% of total sequences is even smaller than 3 (support too small), it might be good to use 3 as the support. Anyway, the users can always adjust the supports based on their specific application conditions. If you want even more strigent results, you might want use 2%* (Integer) or 5%* (Integer) as the cutoff for -s parameter.
3. Software Usage

For example, if we want to identify motifs from the provided example_seq TF peak dataset located under the directory "example" by using monad+dyad mode We can use the following command:

A. No motif comparison
python SIOMICS3.py -i example/example_seq -o example_output -m 2 -n 100 -s 20 -c 0.01 -t 0 -d 0 -e 0

B. STAMP with default database
python SIOMICS3.py -i example/example_seq -o example_output -m 2 -n 100 -s 20 -c 0.01 -t 1 -d default -e 1e-5

C. STAMP with User defined database
python SIOMICS3.py -i example/example_seq -o example_output -m 2 -n 100 -s 20 -c 0.01 -t 1 -d <Path_to_your_database> -e 1e-5

The following example shows you how to use the SIOMICS (both command line and GUI).

(1) Command line example
For example, if we want to identify all motifs (-m 2) from the provided "example_seq" dataset under the "example" directory using default stategy (-r 0) . The predicted motifs will be compared to JASPAR2016 motif database under STAMP E-value cutoff 1E-5.
We can use the following command:
python SIOMICS3.py -i example/example_seq -o example_output -m 2 -s 20 -c 0.01 -r 0 -n 100 -t 1 -d default -e 1e-5

The meaning of the above parameters:
Try to identify motifs with mode 2 (-m 2) and default strategy (-r 0), corrected p-value < 0.01. The motifs need to co-occur at least 20 times to be claimed as modules. The maximal number of predicted motifs =100. The predicted motifs will be compared to default motif database JASPAR2016 under STAMP E-value cutoff 1E-5.
Note:The format of input sequence is the FASTA format.

If you do not want to specify every parameters by yourself, you can use the "batch_siomics.py" scipt we provided. This script can be used to run SIOMICS3 on a batch of peak sequences with default parameters.
Take the "example" folder included in the software as an example:
We can get the predictions for all datasets under the "example" folder by using the following command:
python batch_siomics.py example

SIOMICS3 will be run on each of sequence file under "example" folder sequentially.
The output files will be put into directory names as <DatasetName_out>

(2) GUI example
In order run GUI version of SIOMICS, just double click "SIOMICS_GUI.py". See the following GUI example:




4. SIOMICS Results
When the software is running, you will see "Running..." shown in the bottom of the GUI. It might show "Not Responding" when SIOMICS is running on Windows, but it's OK. It will show "done" on the bottom of GUI once the results were obtained. SIOMICS will provide result files under the output directory provided.

SIOMICS3 Results:
Result files:
1. X.PWM (predicted motifs in PWM format)
2. X.PWM.TFs (Corresponding TFs for predicted motifs under given E-value cutoff)
This result is only available under Linux since it is based on STAMP and STAMP is not available under Windows
Although we cant' use STAMP standlone tool under Linux, we can use STAMP on-line server to compare the predicted motifs and known motifs in provided database. The online comparison results were provided as X.STAMP.pdf, which is downloaded from STAMP server.
3. X.STAMP.pdf (STAMP comparison details in PDF for predicted motifs)
4. X.mc (Predicted motif combinations)
5. X.tfbs (The TFBSs for predicted motifs)
6. X.trans (The transaction file generated for pattern mining)
7. X.sif (Simple interaction file generated for visualization of TF-TF interaction using cytoscape)
8. Running.log

X.PWM is the result file of the predicted motifs in the format of Position-Weight-Matrix (PWM)
X.PWM.TFs The predicted motifs will be compared with the known motif database using STAMP under given E-value cutoff. The similar known motifs for each predicted motifs will be listed.
X.mc is the result file of motif modules predicted.
X.STAMP.pdf The predicted motifs will be compared with the known motif database using STAMP under given E-value cutoff. This result file presents the comparision details. X.tfbs is the TFBSs of the predicted motifs.
X.trans is the transaction file generated for pattern mining.It's a record of motif in each input sequence.
X.sif is the interaction input for cytoscape, ".sif" format was explained here.
You can use X.sif file to get the TF-TF interaction network by cytoscape.
The instructions of how to load a simple interaction network (.sif) into cytoscape could be found here
running.log records the commands, parameters and running time for SIOMICS3 software.

The following are the examples to describe the format for the result files

See the following example to see the meaning of X.PWM:
			>1	0.0	(J_MA0139.1_CTCF,J_MA0065.2_PPARG_RXRA,J_MA0159.1_RXR_RAR_DR5)
			0.971153846154 0.00961538461538 0.00961538461538 0.00961538461538
			0.00961538461538 0.00961538461538 0.971153846154 0.00961538461538
			0.971153846154 0.00961538461538 0.00961538461538 0.00961538461538
			0.00961538461538 0.00961538461538 0.971153846154 0.00961538461538
			0.209887098285 0.00961538461538 0.770882132484 0.00961538461538
			0.00961538461538 0.00961538461538 0.971153846154 0.00961538461538
			0.00961538461538 0.971153846154 0.00961538461538 0.00961538461538
			0.971153846154 0.00961538461538 0.00961538461538 0.00961538461538
			
The first line represents:
ID of motifs : M0
p-value: 0  
The remaining lines represent the frequencies of"A,C,G,T" in each position.

See the following example to see the format of X.PWM.TFs>:
>	1
J_MA0139.1_CTCF	3.0274e-05	---TGCCCTCT--------	NNSYGCCMCCTRSTGGNNR
J_MA0065.2_PPARG_RXRA	3.6966e-05	TGCCCTCT-------	TGMCCTTTGNCCYNN
J_MA0159.1_RXR_RAR_DR5	3.8927e-05	TGCCCTCT---------	TGACCTNYNNNTGAMCY
			
>1 means it is the motif 1
The following rows represent the similar known motifs in given database.
For example, The row represent that it's similar to J_MA0139.1_CTCF with STAMP E-value 3.0274e-05

See the following example to see the meaning of X.mc:
			M66 M21 (58)	(6.7476668697e-10)
			
This denotes M66 and M21 were regarded as a motif module (co-occur in 58 sequences). The corrected pvalue is 6.7476668697e-10.
An example for X.STAMP.pdf

Note: The above comparison between predicted motifs and known motifs database using STAMP with the default parameters. One can compare the predicted motifs with motifs from other sources with different parameters with STAMP. One can simply use the predicted motifs as the "Input Motifs" to STAMP, and then choose the source of the specified motifs and parameters. For details about how to use STAMP, please refer to the help.

See the following example for X.tfbs
>M0:
M0,mm8_ct_UserTrack_3545_MACS_peak_19 range=chr1:13112912-13113786 5'pad=0 3'pad=0 strand=+ repeatMasking=N,609	GGGGGGGG
M0,mm8_ct_UserTrack_3545_MACS_peak_20 range=chr1:13645559-13646424 5'pad=0 3'pad=0 strand=+ repeatMasking=N,305	GGGTGGGG
M0,mm8_ct_UserTrack_3545_MACS_peak_32 range=chr1:36005234-36006169 5'pad=0 3'pad=0 strand=+ repeatMasking=N,42	GGGTGGGG
M0,mm8_ct_UserTrack_3545_MACS_peak_47 range=chr1:52897802-52898706 5'pad=0 3'pad=0 strand=+ repeatMasking=N,544	GGGGGGGG
M0,mm8_ct_UserTrack_3545_MACS_peak_72 range=chr1:72161911-72162847 5'pad=0 3'pad=0 strand=+ repeatMasking=N,356	GGGTGGGG
M0,mm8_ct_UserTrack_3545_MACS_peak_77 range=chr1:77336476-77337340 5'pad=0 3'pad=0 strand=+ repeatMasking=N,704	GGGTGGGG
			
This is the TFBSs information for predicted motif M0. Let's explain its format by using the first line:
M0,mm8_ct_UserTrack_3545_MACS_peak_19 range=chr1:13112912-13113786 5'pad=0 3'pad=0 strand=+ repeatMasking=N,609	GGGGGGGG
			
The above result represents M0 has a instance in peak (chr1:13112912-13113786,strand=+). The relative position of the instance is starting from 609 and the instance is GGGGGGGG.

The following is an example for X.trans
			2 12 35 56 57 59 62 70
			10 11 23 36 53 54 84
			1 5 23 30 34 36 39 41 48 85
			...
			
Each row represents the motifs, which has instances in this input sequence. (one row is corresponding to one input sequence/one peak)
The following is an exmaple for X.sif
				M1 pp M2
				M13 pp M12
				M13 pp M10
				M2 pp M1
				M7 pp M9
				M13 pp M2
				M7 pp M6
				M7 pp M5
				M7 pp M8
				M7 pp M3
			

The following is an example for running.log
Program starts:
2015-11-20 22:28:09.973436
Running command:
python SIOMICS.py 
-i seq_800/Ctcf
-m 2 
-n 100 
-s 396 
-c 0.01 
-o Ctcf_out 
-t 1 
-d /home/jding/projects/SIOMICS3/SIOMOCIS3_20151108/Clique_Dyad/STAMP/jaspar2010.motifs
-e 0.0001
Motif prediction Success!
Motif module prediction Success!
Program ends:
2015-11-21 02:28:42.444177

			
You can retrieve the command you have used and time cost of the software from this running.log file
5. Contact Info
If you have any question regarding to the SIOMICS software or you have found any bugs, please feel free to contact us via xiaoman@mail.ucf.edu. For any non-academic use of this software, please also contact xiaoman@mail.ucf.edu.