Has this sequence really more then 70k perfect matches in the human genome?
2
0
Entering edit mode
25 days ago
gernophil ▴ 90

Hey, I just playing around a little with blastn and so far it worked quite good. I am just picking random sequences from CDS regions (or close to CDS regions) and for most of the sequences I find around 1 to 3 hits. However, for some of those sequences, I find more then 70k or even 100k perfect matches in the human genome. Is this possible or am I doing sth. wrong? I am using the db from the NCBI ftp server. My sequences are 20bp long and I am looking for perfect matches (-dust no -word_size 20 should do that). This is one of these sequences:

$ echo ">seq\nGGGGTTTCACCATGTTGGCC" | blastn -db GCF_000001405.39_top_level -task blastn -dust no -word_size 20 -outfmt 6 | wc
   79092  949104 5306875
BLAST • 393 views
ADD COMMENT
0
Entering edit mode
24 days ago

Its perfectly possible that if the sequence comes from a repeat sequence, such as a LINE1 or an Alu (these two make up nearly 34% of the genome between them), that a sequence could appear millions of times.

That said, all the matches I can see to this sequence come from Patches, Fixes and unplaced scafflolds.

ADD COMMENT
0
Entering edit mode

Any way to filter those out? I also blasted against a blastdb from the last Ensembl refgenome (makeblastdb -in Homo_sapiens.GRCh38.dna.primary_assembly.fa -dbtype nucl -parse_seqids) and checked here. Most of the hits are from the "normal" chromosomes. I always thought the additional contigs (like KI270755.1) are the patches or are those appended to the end the chromosomes 1-22,X,Y,MT?

ADD REPLY
0
Entering edit mode

That's really interesting. With the official db the amount of hits in Patches, Fixes and unplaced scafflolds is really much higher then with my self-created db. Do you have any idea why this could happen?

ADD REPLY
0
Entering edit mode

Ok, I withdraw that comment. Using the official db I get 73939 hits in the main chromosome (NC_) and 5153 in others. Using my db I get exactly the same 73939 hits int the main chromosomes, but only 102 in the others. So, the main hits are the same. Why do you think these are "Patches, Fixes and unplaced scafflolds", @i.sudbery?

ADD REPLY
0
Entering edit mode

To be hones, I ran the search online, at NCBI and at ensembl, and thats all that came up in the top pages of results. And Ensembl's overview schematic didn't show any hits on the main chromosomes.

Interestingly BLAT doesn't find any hits, but i'm not sure if BLAT is repeat masked.

ADD REPLY
0
Entering edit mode
24 days ago

I find more then 70k or even 100k perfect matches in the human genome. Is this possible or am I doing sth. wrong? I

yep, already 50288 perfect matches in hs37d5.fasta , even without considering the sequences broken by the ends of line.

$ grep -iEo '(GGCCAACATGGTGAAACCCC|GGGGTTTCACCATGTTGGCC)' hs37d5_all_chr.fasta  | wc -l
50288
ADD COMMENT

Login before adding your answer.

Traffic: 2180 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6