Hey, I just playing around a little with blastn
and so far it worked quite good. I am just picking random sequences from CDS regions (or close to CDS regions) and for most of the sequences I find around 1 to 3 hits. However, for some of those sequences, I find more then 70k or even 100k perfect matches in the human genome. Is this possible or am I doing sth. wrong? I am using the db from the NCBI ftp server. My sequences are 20bp long and I am looking for perfect matches (-dust no -word_size 20
should do that).
This is one of these sequences:
$ echo ">seq\nGGGGTTTCACCATGTTGGCC" | blastn -db GCF_000001405.39_top_level -task blastn -dust no -word_size 20 -outfmt 6 | wc
79092 949104 5306875
Any way to filter those out? I also blasted against a blastdb from the last Ensembl refgenome (
makeblastdb -in Homo_sapiens.GRCh38.dna.primary_assembly.fa -dbtype nucl -parse_seqids
) and checked here. Most of the hits are from the "normal" chromosomes. I always thought the additional contigs (likeKI270755.1
) are the patches or are those appended to the end the chromosomes 1-22,X,Y,MT?That's really interesting. With the official db the amount of hits in Patches, Fixes and unplaced scafflolds is really much higher then with my self-created db. Do you have any idea why this could happen?
Ok, I withdraw that comment. Using the official db I get 73939 hits in the main chromosome (NC_) and 5153 in others. Using my db I get exactly the same 73939 hits int the main chromosomes, but only 102 in the others. So, the main hits are the same. Why do you think these are "Patches, Fixes and unplaced scafflolds", @i.sudbery?
To be hones, I ran the search online, at NCBI and at ensembl, and thats all that came up in the top pages of results. And Ensembl's overview schematic didn't show any hits on the main chromosomes.
Interestingly BLAT doesn't find any hits, but i'm not sure if BLAT is repeat masked.