Question

how to extract coding sequences (CDS) from a bam file or vcf file ?

2

Entering edit mode

5.9 years ago

sunnykevin97 ▴ 990

Hi Is it possible to extract the CDS (Coding sequences) from a aligned bam file or from a vcf file ? If I'm wrong, what is the best way to extract the CDS from a WGS dataset ?

I'm interesting in positive selection scan by comparing with different subgroups

Suggestions please.

alignment SNP • 4.7k views

ADD COMMENT • link updated 5.2 years ago by anna11 ▴ 20 • written 5.9 years ago by sunnykevin97 ▴ 990

1

Entering edit mode

If you have reference fasta and corresponding annotation file with CDS and vcf, you can use getfasta from bedtools suite, to get CDS sequence. You can also use bcftools consensus function to get sequence information using VCF. samtools or bamutils can help you in extracting regions of interest from bam.

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

Thanks for suggestions. I don't have annotation file with CDS for all genomes (except for ref genome) I started in this way ----- 1) I downloaded bam files of different subgroups and I called variants using GATK4 and generated vcfs 2) As of now, I had only one bed file for my reference genome with CDS coordinates then using bedtools I extracted the CDS for reference genome. 3) I don't have annotation files for other genomes how to proceed further analysis ??

"I need to extract the CDS from 22 subgroup genome's, I had only bam files of all these genomes"

My work 1) I'm trying to extract CDS from 22 different subgroup populations 2)Then, I'll perform MSA among these CDS 3) By subjecting MSA alignment file as an Input to PAML, I estimate the dN/dS ratio and construct a positive scan model.

suggestions please.

ADD REPLY • link 5.9 years ago by sunnykevin97 ▴ 990

0

Entering edit mode

Does the bam file have to be converted into a fasta file first in order to use getfasta? Or will it create fasta files after accessing the bam files??

ADD REPLY • link 5.2 years ago by DNAngel ▴ 250

0

Entering edit mode

Hello sunnykevin97 ,

are you interested in:

variants that are located in a CDS?
the consensus sequence of the CDS, which is made by integrate called variants into the reference sequence?
something different?

In any case you need the coordinates of your CDS before you can start.

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Thanks for suggestions,

I'm looking for variants in CDS among (~22) different subgroups

1) I'm trying to extract CDS from different subgroup populations 2)Then, I'll perform MSA among these CDS 3) By subjecting MSA alignment file as an Input to PAML, I estimate the dN/dS ratio and construct a positive scan model.

whether, the approach I'm doing was correct ? or is their any other simplest way to do it ?

ADD REPLY • link 5.9 years ago by sunnykevin97 ▴ 990

score 2 · Answer 1 · 2019-10-02

You can use FastaAlternateReferenceMaker from GATK. You can read in a vcf and also a reference genome, and output a fasta sequence that is the reference genome sequence synthetically mutated to include the variants in the vcf. Obviously this is not a "real" sequence, only as good as your variant caller that produced the vcf.

score 1 · Answer 2 · 2019-09-28

1

Entering edit mode

5.2 years ago

DNAngel ▴ 250

Have you figured this out? I am also trying to extract CDS from my bam files, and I have my reference genome .gff file. I just cannot figure out how to convert my BAM files into appropriate fasta files to use bedtools properly...

ADD COMMENT • link 5.2 years ago by DNAngel ▴ 250