Question

Can gffread extract specific region sequences from vcf or fasta file?

0

Entering edit mode

5.9 years ago

MatthewP ★ 1.4k

Hello, I have NGS sequencs result(vcf and fasta) file, I download reference sequence(GFF3 format) from NCBI, this link..
Here shows a few part of GFF3 file:

##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=45949
KX254564.1      Genbank region  1       18117   .       +       .       ID=KX254564.1:1..18117;Dbxref=taxon:45949;Name=MT;gbkey=Src;genome=mitochondrion;mol_type=genomic DNA
KX254564.1      Genbank gene    1       1614    .       +       .       ID=gene-COX1;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding
KX254564.1      Genbank CDS     1       1614    .       +       0       ID=cds-ANP25568.1;Parent=gene-COX1;Dbxref=NCBI_GP:ANP25568.1;Name=ANP25568.1;gbkey=CDS;gene=COX1;product=cytochrome coxidase subunit I;protein_id=ANP25568.1;transl_table=5
KX254564.1      Genbank gene    1673    2992    .       +       .       ID=gene-COX2;Name=COX2;gbkey=Gene;gene=COX2;gene_biotype=protein_coding
KX254564.1      Genbank CDS     1673    2992    .       +       0       ID=cds-ANP25569.1;Parent=gene-COX2;Dbxref=NCBI_GP:ANP25569.1;Name=ANP25569.1;gbkey=CDS;gene=COX2;product=cytochrome coxidase subunit II;protein_id=ANP25569.1;transl_table=5
KX254564.1      Genbank gene    2998    3060    .       +       .       ID=gene-trnG(TCC);Name=trnG(TCC);gbkey=Gene;gene=trnG(TCC);gene_biotype=tRNA
KX254564.1      Genbank tRNA    2998    3060    .       +       .       ID=rna-trnG(TCC);Parent=gene-trnG(TCC);gbkey=tRNA;gene=trnG(TCC);product=tRNA-Gly
KX254564.1      Genbank exon    2998    3060    .       +       .       ID=exon-trnG(TCC)-1;Parent=rna-trnG(TCC);gbkey=tRNA;gene=trnG(TCC);product=tRNA-Gly
KX254564.1      Genbank gene    3063    3124    .       +       .       ID=gene-trnR(TCG);Name=trnR(TCG);gbkey=Gene;gene=trnR(TCG);gene_biotype=tRNA
KX254564.1      Genbank tRNA    3063    3124    .       +       .       ID=rna-trnR(TCG);Parent=gene-trnR(TCG);gbkey=tRNA;gene=trnR(TCG);product=tRNA-Arg
KX254564.1      Genbank exon    3063    3124    .       +       .       ID=exon-trnR(TCG)-1;Parent=rna-trnR(TCG);gbkey=tRNA;gene=trnR(TCG);product=tRNA-Arg

What I want: extract specific parts sequence from vcf or fasta file.
example: all tRNA, all gene, specific gene.

I google and find gffread tool and see -h option, it seems this tool can't handle my job? I hope I can get some help here, thanks.

gffread vcf gff3 • 2.7k views

ADD COMMENT • link updated 5.9 years ago by shubhra.bhattacharya ▴ 140 • written 5.9 years ago by MatthewP ★ 1.4k

0

Entering edit mode

I can write python script get those regions than get sequence from vcf file, but this needs much time, I hope there is tools already.

ADD REPLY • link 5.9 years ago by MatthewP ★ 1.4k

1

Entering edit mode

awk -F "\t" '{if($3=="tRNA")print}' file.gff > new_file.gff

ADD REPLY • link 5.9 years ago by shubhra.bhattacharya ▴ 140

0

Entering edit mode

Hi, if you need the entire information for each of those, you could easily grep them out.

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

score 1 · Accepted Answer · 2018-12-25

1

Entering edit mode

5.9 years ago

shubhra.bhattacharya ▴ 140

I am not sure about gffread but extracting specific gene or all genes etc can be done using bedtools.

https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENT • link 5.9 years ago by shubhra.bhattacharya ▴ 140

0

Entering edit mode

It seems we can't get fasta file directly from NGS sequence VCF result? I actually get my fasta sequence from VCF by writting python script myself. Because we may get more than one variants at the same position in VCF.