Question

Find reference fasta based on M5/MD5 string

0

Entering edit mode

14 months ago

WouterDeCoster 47k

I have downloaded cram files, but I don't know which exact version of hg38 was used to align the reads. Can you find the corresponding fasta if you have the M(D)5 strings? For now, it seems I can only test and see when it does. Of course the obvious solution is to ask the person who generated it, but that does not always work out. A part of the header looks like:

@SQ     SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd     UR:/scratch/hg38.fa
@SQ     SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c     UR:/scratch/hg38.fa
@SQ     SN:chr3 LN:198295559    M5:76635a41ea913a405ded820447d067b0     UR:/scratch/hg38.fa
@SQ     SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e     UR:/scratch/hg38.fa
@SQ     SN:chr5 LN:181538259    M5:a811b3dc9fe66af729dc0dddf7fa4f13     UR:/scratch/hg38.fa
@SQ     SN:chr6 LN:170805979    M5:5691468a67c7e7a7b5f2a3a683792c29     UR:/scratch/hg38.fa
@SQ     SN:chr7 LN:159345973    M5:cc044cc2256a1141212660fb07b6171e     UR:/scratch/hg38.fa
@SQ     SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616     UR:/scratch/hg38.fa
@SQ     SN:chr9 LN:138394717    M5:1b79085d423b806957b7564497cac5e4     UR:/scratch/hg38.fa
@SQ     SN:chr10        LN:133797422    M5:c0eeee7acfdaf31b770a509bdaa6e51a     UR:/scratch/hg38.fa
@SQ     SN:chr11        LN:135086622    M5:1511375dc2dd1b633af8cf439ae90cec     UR:/scratch/hg38.fa
@SQ     SN:chr12        LN:133275309    M5:96e414eace405d8c27a6d35ba19df56f     UR:/scratch/hg38.fa
@SQ     SN:chr13        LN:114364328    M5:787e7eb2d9187bbc20334062332569d4     UR:/scratch/hg38.fa
@SQ     SN:chr14        LN:107043718    M5:e0f0eecc3bcab6178c62b6211565c807     UR:/scratch/hg38.fa
@SQ     SN:chr15        LN:101991189    M5:f036bd11158407596ca6bf3581454706     UR:/scratch/hg38.fa
@SQ     SN:chr16        LN:90338345     M5:9adbaf8ef0094c71470e87eb18e9b5d4     UR:/scratch/hg38.fa
@SQ     SN:chr17        LN:83257441     M5:f9a0fb01553adb183568e3eb9d8626db     UR:/scratch/hg38.fa
@SQ     SN:chr18        LN:80373285     M5:11eeaa801f6b0e2e36a1138616b8ee9a     UR:/scratch/hg38.fa

reference fasta • 1.1k views

ADD COMMENT • link 14 months ago by WouterDeCoster 47k

1

Entering edit mode

Googling for the checksum value leads to ENA Browser pages that also have the MD5 sums on the page for the relevant chromosomes (for example):

https://www.ebi.ac.uk/ena/browser/view/CM000664
https://www.ebi.ac.uk/ena/browser/view/CM000679

ADD REPLY • link 14 months ago by GenoMax 150k

0

Entering edit mode

Great, that led me to GCA_000001405... However, it is not correct for all chromosomes. For example, that chromosome 13 (https://www.ebi.ac.uk/ena/browser/view/CM000675.2) has an MD5 checksum of a5437debe2ef9c9ef8f3ea2874ae1d82, while the cram I have has 787e7eb2d9187bbc20334062332569d4 :-(

I found someone on Twitter to point me to the right one (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/reference/1KG_ONT_VIENNA_hg38.fa.gz).

Not sure if there could be a better way :)

ADD REPLY • link 14 months ago by WouterDeCoster 47k

score 2 · Answer 1 · 2024-01-19

2

Entering edit mode

14 months ago

Pierre Lindenbaum 165k

"sometimes", the sequences are hosted at the EBI. For example your first sequence with md5 checksum = 6aef897c3d6ff0c78aff06ac189178dd is available (not fasta but plain string) at:

https://www.ebi.ac.uk/ena/cram/md5/6aef897c3d6ff0c78aff06ac189178dd

see REF_PATH and REF_CACHE in the samtools manual.

ADD COMMENT • link 14 months ago by Pierre Lindenbaum 165k

0

Entering edit mode

Aha good start, but doesn't work for each of the chromosomes. Seems the fasta is "special", then. I found someone on Twitter to point me to the right one (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/reference/1KG_ONT_VIENNA_hg38.fa.gz)

ADD REPLY • link 14 months ago by WouterDeCoster 47k