Extract every CDS sequences from a VCF file
2
1
Entering edit mode
5.2 years ago

Hi everyone,

I have an alignment of genome sequencing reads mapped to a reference genome. I created a VCF file from this alignment and filtered it depending on the PHRED score (>20). Now, I want to extract every CDS sequences that are annotated on the reference genome but with the variants present on my individual mapped to this genome.

I have a gff3 file with annotations, the fasta file of the reference genome and a VCF file of my individual variants.

I have seen similar questions on which people were using bedtools getfasta to extract sequences but it only returns sequences exons by exons and it does not concatenate them in a full CDS sequence (This tool seems nice to extract transcripts sequences but not CDS).

Does anyone have an idea how to do it ? Should i first create a whole genome consensus sequence from the alignment and then use a tool that extract CDS sequences using this consensus sequence as reference ? (And which tool can do it properly ?)

Thanks a lot,

Maxime Policarpo

alignment VCF CDS • 5.4k views
ADD COMMENT
0
Entering edit mode

just curious, why would you need to do this ?

ADD REPLY
0
Entering edit mode

I would like to make multiple alignments of CDS with the reference sequence and the alternate sequence (For downstream analysis such as dn/ds, analysis of non-syn mutations...)

ADD REPLY
0
Entering edit mode

I'm not sure it is what you want but I have a perl script to extract CDS properly (concatenate them). Is is called gff3_sp_extract_sequences.pl and you can find in the GAAS repository.

gff3_sp_extract_sequences.pl --fasta genome.fa --gff annotation.gff -o cds.fa

It extracts CDS per default.

ADD REPLY
0
Entering edit mode

Well this is not very convenient because any indel of the individual mapped to the genome will cause the gff3 to not be phased anymore

ADD REPLY
4
Entering edit mode
5.2 years ago

I found a way to extract the sequences i wanted. For those who wonder how I did :

I first extracted the exon strucrure from the gff3 file and placed it on a file called "Gene.structure"

I then used bedtools getfasta to extract the corresponding sequence on the Reference genome

I then used bcftools mpileup and bcftools consensus to extract the corresponding sequence on the individual mapped.

It worked perfectly fine for every genes that I needed !

ADD COMMENT
2
Entering edit mode
5.2 years ago

i'm late for this one; I wrote something: http://lindenb.github.io/jvarkit/Biostar398854.html

it uses a vcf+gtf+ref fasta to generate all the transcripts.

not fully tested, I'm not even sure it generates what you needed :-D

 java -jar ${JVARKIT_DIST}/biostar398854.jar \
    --gtf input.gtf.gz \
    -R ref.fasta input.vcf > out.fasta
ADD COMMENT
0
Entering edit mode

Hi Pierre !

I will try that if I ever need to make this kind of analyses later and I will tell you if it works ! Thanks a lot,

Max

ADD REPLY
0
Entering edit mode

Thanks for sharing this tools. But i failed in generate the sequence, everything looks well include compile and installation, but I got an empty output.

ADD REPLY
0
Entering edit mode

sorry for that, feel free to submit an issue if you have time: https://github.com/lindenb/jvarkit/issues/

ADD REPLY

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6