Hi everyone,
I have an alignment of genome sequencing reads mapped to a reference genome. I created a VCF file from this alignment and filtered it depending on the PHRED score (>20). Now, I want to extract every CDS sequences that are annotated on the reference genome but with the variants present on my individual mapped to this genome.
I have a gff3 file with annotations, the fasta file of the reference genome and a VCF file of my individual variants.
I have seen similar questions on which people were using bedtools getfasta to extract sequences but it only returns sequences exons by exons and it does not concatenate them in a full CDS sequence (This tool seems nice to extract transcripts sequences but not CDS).
Does anyone have an idea how to do it ? Should i first create a whole genome consensus sequence from the alignment and then use a tool that extract CDS sequences using this consensus sequence as reference ? (And which tool can do it properly ?)
Thanks a lot,
Maxime Policarpo
just curious, why would you need to do this ?
I would like to make multiple alignments of CDS with the reference sequence and the alternate sequence (For downstream analysis such as dn/ds, analysis of non-syn mutations...)
I'm not sure it is what you want but I have a perl script to extract CDS properly (concatenate them). Is is called
gff3_sp_extract_sequences.pl
and you can find in the GAAS repository.It extracts CDS per default.
Well this is not very convenient because any indel of the individual mapped to the genome will cause the gff3 to not be phased anymore