Hello, everyone! I am new to bioinformatics with biology-and-genetics background. So I desperately need a good advice.
I am trying to retrieve a "spike protein" gene from SRA file. Please let me know what I am doing wrong. And if you have suggestions on how to correct the idea, I would be happy to hear it.
I thought about the following idea:
- if I have a reference sequence of a gene (spike protein nucleotide sequence, in fasta), I will be able to find its position in the raw genomic info (from SRA), if I convert raw data to fasta and then create local blast database, and, finally, perform blast between gene and genome
Here is what I came up with to retrieve a gene sequence: 1) Downloaded S surface glycoprotein in fasta. Saved as "sequence.fasta".
2) Downloaded genome sequencing data. Saved as "SRR17592110".
3) Converted SRR file to fasta by typing:
**fastq-dump --fasta 70 SRR17592110**
4) Created a local database using:
**makeblastdb -in SRR17592110.fasta -dbtype nucl**
5) Performed blast by typing:
**blastn -query sequence.fasta -db SRR17592110.fasta -evalue 1e-6 -num_threads 4 -out blastn_out.txt**
6) Examined blastn_out.txt.
It showed me a lot of hits, but they do not correspond expected positions of the gene in the genome, and there are so many hits I am struggling to understand how to interpret it and find the gene sequence in the given genome.
Could you please give me an advice? How is it usually done when you have a raw genome file and you want to find a desired gene, reference sequence of which you have?
Thank you very much for your suggestion! I will study the options and try them out.
Could you please explain how I can get BAM file out of SRA file? This is unclear for me, to be honest
When you use a next-gen sequence aligner like
bwa
(or even Magic-BLAST) you will be able to get a SAM file after aligning the SRA data to reference genome. SAM file can then be easily converted to BAM (binary representation).thank you very much for your answer!
I don't mean to be intrusive, but I think I need to keep track of the process for those who may be interested in such question.
I performed the Magic-BLAST with S protein DNA sequence against my SRR fasta blast database and I struggle with interpreting the results. I know maybe there are too many questions, but why are here 6 sequences in the output? The interesting thing is 4 of them are the same, as well as the other 2 are the same too.
The command I used:
The output:
Interesting thing is the last 4 are exactly the same as my "sequence.fasta", reference DNA sequence of S protein.
"..." are added for brevity, so as to not share the whole sequence: it is a bit too long.
How do I interpret such an output? Is there any way I can think there is a gene finally found in the SRR?