blasting a nucleotide query file (about 2000 seq) over a group of transcriptomes files
1
1
Entering edit mode
9.1 years ago
Nano ▴ 20

I have a query file containing around 2600 nucleotide sequences. and another group of transctiptomes files (over 150 files) containing hundreds of thousands of sequences in each file. I need to blast the query file over the transcriptomes files to determine which gene in the query file is found (expressed) in the transcriptomes files. Does anyone have any ideas how I can go about it?

gene blast alignment • 2.1k views
ADD COMMENT
0
Entering edit mode

One way is by looping the blast command in a shell script

ADD REPLY
0
Entering edit mode

Do you have any idea or hints on how i can go about that..? I will really appreciate

ADD REPLY
0
Entering edit mode

Make a blast database of your 2600 sequences using makeblastdb. As TriS suggested below, you have to loop in a way it takes each fasta file from your 150 and blast it against 2600.

For eg: your file names are file1.fasta, file2.fasta...........file149.fasta, file150.fasta

for((n=1; n<=150; n=n+1));
do
    blastn -query file"$n".fasta -db <your_2600_sequence_Database> -out file"$n"_blastn.txt ;
done

* I have cut short the blast command, you may have to include other parameters also.

ADD REPLY
0
Entering edit mode
9.1 years ago
TriS ★ 4.7k

building up from Prakki suggestion, lots of good examples are available in Google for for loops

if you want to save each blast in a different file the general idea is:

for ((i=0; i<2600; i++))
  do
    blast your sequence + keep output
  done

so, to make sure, the 2600 are the sequences that you query per EACH of the 150 file. and the 150 are files containing expression data. so you want to know the expression levels of the genes you blasted in each of the 150 files?

IF that's what you want to do a general (slow) idea would be to use a double for loop, in pseudocode:

for i in 1 --> 150
  for j in 1 --> 2600
    blast sequence[j] 
    extract significant (top?) gene names
    check the expression of the gene names in file[i] 
  store in fileOut[i].txt

in this way you have 150 files containing the expression levels of each of the 2600 sequences (IF expression data are available)

ADD COMMENT
0
Entering edit mode

If we use 2600 seq to loop over 150 fasta files, we have to create that many databases also to run. Can also loop makeblastdb to create database file though. :)

ADD REPLY
0
Entering edit mode

Thanks so much guys

ADD REPLY

Login before adding your answer.

Traffic: 1829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6