Dear all,
in the frame of a project that aims at identifying and describing members of a specific gene family we used several routinary steps to identify, annotate, and analyze gene sequences, whose gene models were already published and available on public sequence databases. When going more in details with the analyses we've noticed that some of these sequences contained stretchs of N's in their genomic sequences (almost all in intron sequences). The N portions of these sequences ranges between 4 and 22% of the whole genomic sequence and this is a bit uncomfortable to me as these sequences were supposed to be high confidence gene predictions (as from the genome paper that annotated them). What would you suggest to do ? Discard them or keep them as these N's do not affect the protein domain characteristic of this gene family ? Thanks in advance for any tip