Extracting subset from GFF3 file based on sequence headers from a FASTA file
1
1
Entering edit mode
4.5 years ago
adityabandla ▴ 30

I have a FASTA file of gene's like this

>CS1RZP_12510_67 # 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675
ATGGCCGTCGTCACCATGAAGCAGATGCTTGATTCCGGCGTGCATTTCGG
GCATCAGACCCGCCGCTGGAACCCGAAGATGAAGCGCTACATCCTGACCG
AGCGCAACGGGATCTACATCATCGACCTGCGGCAGACGCTCACCTACGTC

I would like to extract the corresponding lines for these specific genes from a GFF3 file which contains info on other genes as well which aren't in the above FASTA.

CS1RZP_12510    Prodigal_v2.6.3 CDS 3   617 134.5   -   0   ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675;conf=100.00;score=133.83;cscore=137.59;sscore=-3.75;rscore=-3.31;uscore=-4.34;tscore=4.54;
CS1RZPR4D_68    Prodigal_v2.6.3 CDS 1000    1218    56.4    +   0   ID=1_2;partial=01;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.612;conf=100.00;score=56.40;cscore=44.46;sscore=11.94;rscore=7.30;uscore=0.10;tscore=4.54;

I am looking for an AWK solution. Grep using -wf is extremely slow

AWK FASTA GFF3 • 2.0k views
ADD COMMENT
0
Entering edit mode

can you give an example how you are going about it when using grep ?

the IDs in your fasta file seems to be a combination of the geneID in the gff and the contig they are on, correct?

Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers?

ADD REPLY
0
Entering edit mode

Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers? Yes, that's right, these are the nucleotide files from prodigal using the -d option

grep -oP '(?<=ID=).*?(?=;)' sample.fa | grep -wf - sample.gff
ADD REPLY
0
Entering edit mode

Does it have to be in a single cmdline?

first creating the list of IDs you need and then query them from your gff will be more efficient I think

ADD REPLY
0
Entering edit mode

Ideally, looking for a one-liner as this will go into a bigger script. Looking to avoid creating temp files or variables as much as possible

ADD REPLY
0
Entering edit mode

fair enough, but not making it any easier ;-)

ADD REPLY
2
Entering edit mode
4.5 years ago
adityabandla ▴ 30

Figured it out with a little help from a friend, if anyone finds this useful

awk '/^>/,NR==FNR{split(substr($1,2),a,"_");split($9,b,";");HEADER[a[1]"_"a[2]]=b[1]} \
{split($9,c,";");if($1 in HEADER && c[1] == HEADER[$1]) print $0}' sample.fa sample.gff
ADD COMMENT

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6