Entering edit mode
4.5 years ago
adityabandla
▴
30
I have a FASTA file of gene's like this
>CS1RZP_12510_67 # 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675
ATGGCCGTCGTCACCATGAAGCAGATGCTTGATTCCGGCGTGCATTTCGG
GCATCAGACCCGCCGCTGGAACCCGAAGATGAAGCGCTACATCCTGACCG
AGCGCAACGGGATCTACATCATCGACCTGCGGCAGACGCTCACCTACGTC
I would like to extract the corresponding lines for these specific genes from a GFF3 file which contains info on other genes as well which aren't in the above FASTA.
CS1RZP_12510 Prodigal_v2.6.3 CDS 3 617 134.5 - 0 ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675;conf=100.00;score=133.83;cscore=137.59;sscore=-3.75;rscore=-3.31;uscore=-4.34;tscore=4.54;
CS1RZPR4D_68 Prodigal_v2.6.3 CDS 1000 1218 56.4 + 0 ID=1_2;partial=01;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.612;conf=100.00;score=56.40;cscore=44.46;sscore=11.94;rscore=7.30;uscore=0.10;tscore=4.54;
I am looking for an AWK solution. Grep using -wf is extremely slow
can you give an example how you are going about it when using grep ?
the IDs in your fasta file seems to be a combination of the geneID in the gff and the contig they are on, correct?
Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers?
Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers? Yes, that's right, these are the nucleotide files from prodigal using the -d option
Does it have to be in a single cmdline?
first creating the list of IDs you need and then query them from your gff will be more efficient I think
Ideally, looking for a one-liner as this will go into a bigger script. Looking to avoid creating temp files or variables as much as possible
fair enough, but not making it any easier ;-)