Question

Has this sequence really more then 70k perfect matches in the human genome?

0

Entering edit mode

8 weeks ago

gernophil ▴ 120

Hey, I just playing around a little with blastn and so far it worked quite good. I am just picking random sequences from CDS regions (or close to CDS regions) and for most of the sequences I find around 1 to 3 hits. However, for some of those sequences, I find more then 70k or even 100k perfect matches in the human genome. Is this possible or am I doing sth. wrong? I am using the db from the NCBI ftp server. My sequences are 20bp long and I am looking for perfect matches (-dust no -word_size 20 should do that). This is one of these sequences:

$ echo ">seq\nGGGGTTTCACCATGTTGGCC" | blastn -db GCF_000001405.39_top_level -task blastn -dust no -word_size 20 -outfmt 6 | wc
   79092  949104 5306875

BLAST • 455 views

ADD COMMENT • link updated 8 weeks ago by i.sudbery 20k • written 8 weeks ago by gernophil ▴ 120

score 0 · Answer 1 · 2024-10-27

0

Entering edit mode

8 weeks ago

i.sudbery 20k

Its perfectly possible that if the sequence comes from a repeat sequence, such as a LINE1 or an Alu (these two make up nearly 34% of the genome between them), that a sequence could appear millions of times.

That said, all the matches I can see to this sequence come from Patches, Fixes and unplaced scafflolds.

ADD COMMENT • link 8 weeks ago by i.sudbery 20k

0

Entering edit mode

Any way to filter those out? I also blasted against a blastdb from the last Ensembl refgenome (makeblastdb -in Homo_sapiens.GRCh38.dna.primary_assembly.fa -dbtype nucl -parse_seqids) and checked here. Most of the hits are from the "normal" chromosomes. I always thought the additional contigs (like KI270755.1) are the patches or are those appended to the end the chromosomes 1-22,X,Y,MT?

ADD REPLY • link 8 weeks ago by gernophil ▴ 120

0

Entering edit mode

That's really interesting. With the official db the amount of hits in Patches, Fixes and unplaced scafflolds is really much higher then with my self-created db. Do you have any idea why this could happen?

ADD REPLY • link 8 weeks ago by gernophil ▴ 120

0

Entering edit mode

Ok, I withdraw that comment. Using the official db I get 73939 hits in the main chromosome (NC_) and 5153 in others. Using my db I get exactly the same 73939 hits int the main chromosomes, but only 102 in the others. So, the main hits are the same. Why do you think these are "Patches, Fixes and unplaced scafflolds", @i.sudbery?

ADD REPLY • link 8 weeks ago by gernophil ▴ 120

0

Entering edit mode

To be hones, I ran the search online, at NCBI and at ensembl, and thats all that came up in the top pages of results. And Ensembl's overview schematic didn't show any hits on the main chromosomes.

Interestingly BLAT doesn't find any hits, but i'm not sure if BLAT is repeat masked.

ADD REPLY • link 8 weeks ago by i.sudbery 20k

score 0 · Answer 2 · 2024-10-27

I find more then 70k or even 100k perfect matches in the human genome. Is this possible or am I doing sth. wrong? I

yep, already 50288 perfect matches in hs37d5.fasta , even without considering the sequences broken by the ends of line.

$ grep -iEo '(GGCCAACATGGTGAAACCCC|GGGGTTTCACCATGTTGGCC)' hs37d5_all_chr.fasta  | wc -l
50288