A lab member recently ran a ChIP-Seq on Pol-II and gave me the files from sequencing to annotate. I generated a annotated peak file in the following format:
PeakID Chr Start End Strand Peak Score Focus Ratio/Region Size Annotation Detailed Annotation Distance to TSS Nearest PromoterID Entrez ID Nearest Unigene Nearest Refseq Nearest Ensembl Gene Name Gene Alias Gene Description Gene Type
Pol-II-Chip.MACS2_peak_77166 chr6 74229770 74231637 + 7081 NA promoter-TSS (EEF1A1) promoter-TSS (NM_001402) 52 EEF1A1 1915 Hs.745122 XM_005248666 ENSG00000156508 EEF1A1 CCS-3|CCS3|EE1A1|EEF-1|EEF1A|EF-Tu|EF1A|GRAF-1EF|HNGC:16303|LENG7|PTI1|eEF1A-1 eukaryotic translation elongation factor 1 alpha 1 protein-coding
However he wants to generate a metagene graph of only the exon regions. I attempted to simply sort and remove all the rows that were not exons, but my python script was unable to detect peakfile formatted lines when I did this. Is there a way to choose only exon annotations from a txt file like this? I've realized any sort of sorting done in excel will throw my script off and not allow it to read peak file formatted lines.
If this isn't possible, how could I get only exon annotations from a .narrowPeak file?
Could you clarify a little, for example by stating the hypothesis to test, such as "Pol II will be enriched at the start of each exon" vs "Pol-II will be enriched in exons vs introns", etc. I think I could help a bit if you made your goals clearer.