Assalam o alaikum everyone,
I have fetched CDS sequence from whole genome sequence of dog which is downloaded from NCBI. CDS sequence comes in parts as shown below: e.g.
$ cat FABP5_CDS
>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA
I have further process and concatenate these parts then i have found that CDS not started from start codon (ATG) this is due to 0-based and 1-baed coordinate system (BED and my BAM file is 1-based ).
I have to add 1 base at the start of my CDS part e.g.
Before adding one base:
>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA
After adding one base (A): (now its start from ATG)
>chr29:28363101-28363137
ATGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA
My question is that should i add one base at the start of each CDS part or at the start of first CDS part only ?? I'm too much confused. Any idea how to fix it ??
What genome build is this from?
According to Ensembl FABP5 is a pseudo-gene in Dog (CanFam v.3.1) with 3 exons
FABP5 is not mentioned as pseudogene according to information given in NCBI for dog genome.
Are you certain those are CDS features? They don't start with canonincal start codons, nor do they look like they all have stop codons.
Yes, I'm certain about it. And in the above example all sequences are the parts of a single CDS sequence and there is a stop codon (TAA) at the end of the last part.
Oh its a single CDS? That's an odd way to depict the sequence. Then to answer your question you should only add an A to the first part of the sequence, where the ATG would be.
yes, its a single CDS. I have fetches these sequences from whole genome.Actually i'm confused due to these parts I have coordinates file for extracting a Whole CDS like below and this file format is 1-based.
my point is that why to add a base for only first coordinate why not for all parts ???
Because if, as you say, each sequence is PART of the CDS, and not the CDS itself, genes start with an ATG. If the 0-based numbering affects the sequences afterwards too, you don't know what base needs adding so you can't just put an A in there. You have an additional problem, that if they've all have the 1st position base deleted, you won't know what to replace it with, and if your sequences aren't a multiple of 3 for each, there will be frameshifts in it too.
Thank u for reply
Actually this is not the problem that what base should add because we can find the correct base by changing the the first coordinate e.g.
first coordinate is 28491275 -> 28491274 so by reducing one we can find correct base. I have tested it for ATG and its always A so i put A there.
But I'm not clear that whether I should add one base for others parts or not ?? have You any idea how can i test it ???
I'm still not really seeing the problem - my apologies. Maybe I'm being really stupid.
I'm not sure I can really help you, unless you know whether or not the off-by-one error is affected them all or not a priori. It might be easy to fix depending on your dataset.
The last sequence in your example is around 100 kilobases separated from the first sequence in the sample, so it seems pretty unlikely to me that they're part of the same CDS. I'm no eukaryote expert, but that seems like a lot even taking in to account introns.
Do you have a fasta sequence of the whole, uninterrupted sequence we can see so that we can understand what these sequences represent?
I have genrated consensus FASTA from BAM file. My BAM file is aligned aginst canfam3.1 so i have downloaded annotation file of canfam3.1 from NCBI and used Coordinates for CDS extarction.
If this is a published genome, why not just download the gff or genbank and extract the CDSs from that as 1 continuous sequence?
No, this is not a published genome.
But you said you downloaded it from NCBI?
ohhhh sorry 4 that :o above example not from published genome.
but i have also tried it for dog genome which is available on NCBI same problem for published genome.