Motif detectives proyect for CSHL "Programming for Biology" course 2022
Brianda Lopez Aviña / Marleny García Lozano / Bana Abolibdeh / Chrissi Heil / Mina Peyton / Aparna Thomas
Collaboration with TA Cynthia Cardinault (Centro de Investigación Científica de Yucatán)
The purpose of this project is to offer the user a program to identify an specific motif/consensus sequence, a.k.a transcription factor binding site, on their organism genome of interest.
The transcription factors (TF) are proteins that activate or repress gene expression by binding to consensus sequences located at the start of the gene (promoter). Determining the localization of TF-binding sequences will help us to identify direct targets gene on genomes; one of the most challenging problems in molecular biology and bioinformatics.
INPUTS (links below):
- GENOME.fa
- GENOME.gff
- selected MOTIF
OUTPUT:
- FILE.txt
To test the code developed, we will use the C. elegans genome (specifically, Chr 1) and the Retinoic Acid Response Element motif (RARE-DR5)
INPUT FILES:
- FASTA FILE Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa.gz
- GFF FILE Caenorhabditis_elegans.WBcel235.108.chromosome.I.gff3.gz
MOTIF regular expression - RARE/DR5 ([A|G]G[G|T]T[C|G]A.....[A|G]G[G|T]T[C|G]A)
Figure1. Motif finder pipeline
1. Fasta parser; extracting data fields from GENOME.fa.gz file -> included in motif_finder_version2_BA.py to run this script, be sure to download in the same directory the md_fasta_parser.py which is the source for fasta parser fuction
2. Search for the motif sequence on the genome fasta sequence; extract motif coordinates (# start nucleotide, # end nucleotide) -> motif_finder_version2_BA.py to run this script, be sure to download in the same directory the md_fasta_parser.py which is the source for fasta parser fuction
STDOUT from Step2= motif_hit_out.txt
3. gff parse; extract chromosome number, feature (exon, CDS, mRNA, etc), feature coordinates (# start nucleotide, # end nucleotide) and description (gene_ID, protein_ID) -> step included on gff3_motif_analyzer.py
4. Determine which genes on the genome have the motif pair motif coordinates extracted in Step2 with feature coordinates extracted on Step3 to determine where in the chromosome the motif is located. This returns a a list of motifs associated with the gene_ID -> step iincluded on gff3_motif_analyzer.py (STDOUT= mapped_motif_hits.out)
Additional step: Application of our scripts on DEGs from RNAseq data
Is your motif present in differentally expressed genes?
To answer that question, we proposed to use RNAseq data from two stages in development of C. elegans: oocyta and one-cell stage embryo from the paper:Global characterization of the oocyte-to-embryo transition in Caenorhabditis elegans uncovers a novel mRNA clearance mechanism
INPUTS for DEGs Analysis:
- list of DEGs from data base (already prepared by the authors up_genes_1cell_embryo.tx)
- list of motif hits obtained in step4 mapped_motif_hits.out
This analysis work running only one code -> SCRIPT NAME: [Op_genes_motifs.py](https://github.com/cyntsc/Motif_Detectives/blob/main/op_genes_motifs.py)
OUTPUTS:
- list of DEGs that have the motif in their sequence (STDOUT= [up_genes_match_motif_out.txt](https://github.com/cyntsc/Motif_Detectives/blob/main/up_genes_match_motif_out.txt))
In order to present this program as user-friendly plataform, we implemented a graphical user interface (GUI) with the previous scripts. By running a Python script (name.py) a window will pop-out for the user to type the motif sequence of interest, and the program will display on the GUI the same output as Step4.
SCRIPT = md_gui.py