Hi all, I have a somewhat basic question.
The inputs I use for analysis are the reference genome (hg38) and my sample vcf file.
I extracted the CDS region of some gene from hg38 gff file.
For example,
128937678-128937818
128936538-128936649
128935037-128935044
128937678-128937818
128936538-128936649
128935591-128935700
128937678-128937818
After that, I extracted the consensus sequence of the cds region from my sample vcf file.
Ultimately, I want to get the amino acid sequence.
I wonder if the nucleotide sequences of the CDS region of a gene extracted above can be combined and converted into amino acid sequences.
For example,
128937678-128937818 -> GAAGTG
128936538-128936649 -> GAGGCATCTCTGA
128935037-128935044 -> GAGCGAG
128937678-128937818 -> ATCTTCGG
128936538-128936649 -> CCTTCGATG
128935591-128935700 -> TTGACAACATCT
128937678-128937818 -> AGCATTTCCTC
Combination -> GAAGTGGAGGCATCTCTGAGAGCGAGATCTTCGGCCTTCGATG TTGACAACATCTAGCATTTCCTC -> Convert to amino acid sequence
Can I get the amino acid sequence like this?
If the GFF format is correct, try gffread with
-y
: (-y write a protein fasta file with the translation of CDS for each record
)do you have to do this for only a few CDS or for plenty of them?
In case only few: copy-paste the DNA seq in a translation tool (eg EMBOSS transeq) .
your example, however, does not really look like a valid CDS (it does not start with an ATG for instance)
This is theoretically not too difficult to do, but I'm guessing since these are discontinuous ranges, they've had exons removed?
How do you define where one the first real CDS starts ends, and the next one begins, if all of your data looks like that?