Assalam o aliakum everyone,
I have a BAM file of dog genome and I have generated consensus FASTA from it. BAM is aligned against Canfam3.1 so I have used annotation file (gff3)of Canfam3.1 from NCBI for extracting CDS from consensus FASTA. Firstly I have fetched Coordinates of my gene.
Coordinates sample of single CDS:
NC_006611.3 Gnomon CDS 28363101 28363137 . + 0 ID=cds39781;Parent=rna47080;Dbxref=GeneID:477923,Genbank:XP_013964800.1;Name=XP_013964800.1;gbkey=CDS;gene=FABP5;product=fatty acid-binding protein%2C epidermal;protein_id=XP_013964800.1
NC_006611.3 Gnomon CDS 28491275 28491447 . + 2 ID=cds39781;Parent=rna47080;Dbxref=GeneID:477923,Genbank:XP_013964800.1;Name=XP_013964800.1;gbkey=CDS;gene=FABP5;product=fatty acid-binding protein%2C epidermal;protein_id=XP_013964800.1
NC_006611.3 Gnomon CDS 28491806 28491907 . + 0 ID=cds39781;Parent=rna47080;Dbxref=GeneID:477923,Genbank:XP_013964800.1;Name=XP_013964800.1;gbkey=CDS;gene=FABP5;product=fatty acid-binding protein%2C epidermal;protein_id=XP_013964800.1
NC_006611.3 Gnomon CDS 28492441 28492494 . + 0 ID=cds39781;Parent=rna47080;Dbxref=GeneID:477923,Genbank:XP_013964800.1;Name=XP_013964800.1;gbkey=CDS;gene=FABP5;product=fatty acid-binding protein%2C epidermal;protein_id=XP_013964800.1
I have used above coordinates and fetched corresponding sequence from consensus FASTA.
Sequence Sample of single CDS:
>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAAACTGAGAGCACTTTGAAAACAACACAGTTTTCGTGTAATCTGGGAGAGAAGTTTGAAGAAACTACAGCTGATGGCAGAAAAACTCAG
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA
I will Concatenate these parts of the CDS further but As u can see in the above example base A of the start codon (ATG) is missing. How can I fix it?
Now I have multiple questions (I'm not getting that where is problem actually)
Is it happened due to 0-based/1-based coordinate system?
Should I add add one base (off-by-one) at the start of each starting coordinate? (Actually I checked it for first coordinate only I have reduced start coordinate by 1 and it always give base A)
Should I reduce start coordinate by one for each part of the CDS?
How can I check that my bam file is 0-based or 1-based?
how ? how did you get the fasta sequences ?
a bam file is internally 0-Based
a sam file is always 1-based.
Sorry for this late reply !
i have fetched column 4 and 5 from gff3 (annotation) file and made a bed6 file then i have used bedtools getfasta for getting FASTA sequence.
I have downloaded bam file and then generated consensus FASTA from bam file by using samtools. what is the format of my fasta file now ????