Hello,
I aligned my fastq files to a fusion genome made by the human GRCh38 and a virus genome that I called chrV. I extracted the starting position of the reads and the ending point, so that I can extract the sequence of interest with:
samtools view -h -q 10 -f67 -F1408 $r_file $r_locus | grep -v @ | cut -f 10
where $r_file is the deduplicated bam file and $r_locus is the string that defines the position of the read, in this case "chrV:254748988-254749089"
. The string that I get out is: "GTGGGGCCGGTCGGGGTACGGCGTGAACGTGCGCCGCGCTTCATGCGGCGATCTCCAGCGTCCAGCCCCGACCGGGGTCGTGGACGACCCGATGCCCCGCG\n"
which I have saved on a file called query.fa
.
I then ran command line blastn against the dabase build upon the reference genome, then against the human genome alone and then in the remote nt collection. I was expecting a hit with the reference genome, zero on the second and a hit on the organism identified by the mapping with BWA (in this case Actinoplanes phage phiAsp2
). Instead, I got nothing:
$ blastn -db ~/refSeq/fusion/fusion38-10k.fa -query ~/Downloads/h32/confirm/query.fa
BLASTN 2.9.0+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: fusion38-10k.fa
196 sequences; 3,469,986,645 total letters
Query=
Length=101
***** No hits found *****
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 256778620122
Database: fusion38-10k.fa
Posted date: Aug 13, 2019 3:54 PM
Number of letters in database: 3,469,986,645
Number of sequences in database: 196
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
$ blastn -db ~/REM_refseq/blastDB/nt -query ~/Downloads/h32/confirm/query.fa
BLASTN 2.9.0+
Database: Nucleotide collection (nt)
52,478,804 sequences; 218,532,353,722 total letters
Query=
Length=101
***** No hits found *****
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 14962859207586
Database: Nucleotide collection (nt)
Posted date: Jun 15, 2019 6:40 PM
Number of letters in database: 218,532,353,722
Number of sequences in database: 52,478,804
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
$ blastn -db nt -remote -query ~/Downloads/h32/confirm/query.fa
BLASTN 2.9.0+
Database: Nucleotide collection (nt)
53,494,200 sequences; 234,541,331,245 total letters
Query=
Length=101
RID: N5H8D8TW014
***** No hits found *****
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 16089245153517
Database: Nucleotide collection (nt)
Posted date: Aug 11, 2019 12:00 AM
Number of letters in database: 234,541,331,245
Number of sequences in database: 53,494,200
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
Also, it is weird that Query=
given that the file contains the entry:
$ cat ~/Downloads/h32/confirm/query.fa
GTGGGGCCGGTCGGGGTACGGCGTGAACGTGCGCCGCGCTTCATGCGGCGATCTCCAGCGTCCAGCCCCGACCGGGGTCGTGGACGACCCGATGCCCCGCG
In addition, if I run blastn online I get a series of hits (different from the expectation):
Score E Max
Sequences producing significant alignments: (Bits) Value ident
CT573213.2 Frankia alni str. ACN14A chromosome, complete sequence 45.5 0.59 96%
CP000249.1 Frankia casuarinae strain CcI3, complete genome 45.5 0.59 96%
CP041061.1 Micromonospora sp. HM134 chromosome, complete genome 44.6 0.59 91%
LR132003.1 Gouania willdenowi genome assembly, chromosome: 3 44.6 0.59 80%
...
What am I getting wrong? why I can't find the same hit with the command line version?
I see
-db ~/refSeq/fusion/fusion38-10k.fa
and-db ~/REM_refseq/blastDB/nt
in the two searches. Or are you referring tont
being used locally and via online blast page?If you click on
Algorithm parameters
option on online blast page you can see defaults blast uses for additional parameters. You may need to reproduce them locally. Most likely you have to add-task blastn-short
.That is
right, fusion38-10k.fa
is built on the reference file i used for the BWA alignment,blastDB/nt
is a local build of the human genome alone and then-db nt -remote
is in remote. I'll try-task blastn-short
even if id did not used that option online...This is entire GenBank so keep that in mind, if
So they are definitely not the same database.