Question

How to extract genome files based on genome ID

1

Entering edit mode

4.7 years ago

Bioinfonext ▴ 470

I got the good quality genome IDs for 54000 genomes like below:

and I also got all 74000 genome sequence files compressed in fna folder like below

cd fna/

G001284865.fna.bz2  G002910165.fna.bz2  G009390615.fna.bz2
G001284885.fna.bz2  G002910195.fna.bz2  G009390655.fna.bz2

Now could you please help how I can extract the 54000 genome sequence files based on above genome IDs from fna/ folder?

linux R BASH • 2.4k views

ADD COMMENT • link updated 4.7 years ago by bas1993 ▴ 60 • written 4.7 years ago by Bioinfonext ▴ 470

0

Entering edit mode

bunzip2 <genome>.fna.bz2

or are you looking for a 'bash script' to process all files automatically? (if so, this is not clear from your post)

ADD REPLY • link 4.7 years ago by lieven.sterck 15k

1

Entering edit mode

Thanks lieven, I have updated the post, I want to extract 54000 genome files based on genome ID from fna folders which contains 74000 individual genome files in compressed form.

Many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470

0

Entering edit mode

thanks lieven, Yes, it will be great if I can have bash script to uncompress all files automatically.

all compress files is in filtered/ folder and I am thinking to use below loop but sure if it correct?

for i in $(cat filtered/ ); do  bunzip2 "$i".fna.bz2; done

or can I used just

bunzip2  *.fna.bz2

Many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470

1

Entering edit mode

the latter should normally work indeed. (simplest to use this approach)

the bash loop will not work as it is, change it to:

for i in $(ls filtered/*.fna.bz2 ); do  bunzip2 $i; done

ADD REPLY • link 4.7 years ago by lieven.sterck 15k

lieven.sterck · Accepted Answer · 2020-08-12

2

Entering edit mode

4.7 years ago

bas1993 ▴ 60

for i in $(cat list.txt); do mv "$i".fna.bz2 fna/filtered/; done

Where list.txt is your list of high quality genomes and filtered/ is a new directory.

ADD COMMENT • link updated 4.7 years ago by lieven.sterck 15k • written 4.7 years ago by bas1993 ▴ 60

0

Entering edit mode

Thanks a lot, all compressed genome files is in fna/ folder, could it be possible to give path for fna/ folder?

thanks for this help.

Many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470

1

Entering edit mode

You can change the command line above with the full path.

 for i in $(cat list.txt); do mv fna/"$i".fna.bz2 fna/filtered/; done

And if you need to uncompress your genome files also then you can use what Lieven Sterck wrote.

ADD REPLY • link 4.7 years ago by bas1993 ▴ 60

0

Entering edit mode

thank you so much, above script work well, after I created the filtered directory within the fna/ folder.

Many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Thank you so much for all help.

Now I need to create a bast database by using all fna files, can I use this script for that?

#!/bin/bash

files=$(find . -name "*.fna")
create="cat $files > all.fna"
eval $create

makeblastdb -dbtype nucl -in all.fna -out genome_db

I am not sure should I use this code line in this script or not?

eval $create

Many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470

0

Entering edit mode

in the script that you showed above I think you need the line with eval.

If you use the command line below you can see what "eval" does:

help eval

But for creating a blast database with all the fna files you don't really need a script as you could also just type out the two lines that you need (the ones with cat and makeblastdb).

ADD REPLY • link 4.7 years ago by bas1993 ▴ 60

0

Entering edit mode

Ok thanks, I am thinking to make single file using below cat command then makeblastdb command to make the database.

cat *.fna > all.fna

makeblastdb -dbtype nucl -in all.fna   -parse_seqids   -out genome_db

many thanks

ADD REPLY • link 4.7 years ago by Bioinfonext ▴ 470