I downloaded a genome file and a gff file from NCBI ftp site. While extracting the promoter sequence from the upstream of gene of interest, there are a few sequence that basically just Ns. The windows I set was 1000bp and when I tried a bigger range such as 3k bp using bedops, the same sequence is still there. Is this common ? Will it affect downstream analysis using MEME, TOMTOM and enrichment? Should I keep them or remove them, if so, may I have some suggestion? Thank you very much.
For eukaryotes with complex genomes, these gaps are common. These Ns are filled in by the assembler / scaffolder / contig orderer (some other software, e.g., for optical map integration) when the order and orientation of contigs and scaffolds can be inferred, but there is an undetermined region between them. You will have to read the genome metadata - and possibly check the genome agp file - to know if this is the case.
MEME will treat N as any base (see their DNA alphabet table). If you can avoid including these in your input regions, you should get cleaner search results. Starting with good MEME matrices will be useful, if you do TOMTOM searches for matches with published TFs, and further enrichment calculations off of that.
For eukaryotes with complex genomes, these gaps are common. These Ns are filled in by the assembler / scaffolder / contig orderer (some other software, e.g., for optical map integration) when the order and orientation of contigs and scaffolds can be inferred, but there is an undetermined region between them. You will have to read the genome metadata - and possibly check the genome agp file - to know if this is the case.