Question

reference genome hs37d5 with PhiX

1

Entering edit mode

6.2 years ago

SOHAIL ▴ 410

Hi everybody,

can anyone guide me from where to download the hs37d5.fa genome merged with phiX genome?

I have downloaded few BAM files containing the tag in header section @SQ SN:PhiX LN:6386 together with other hs37d5/fa contigs. I am looking for which version of the reference sequence was used for these files.

any help would be appreciated!

Regards

(Note: I am aware of phiX genome is used perhaps a control in illumina sequencing.)

reference genome ngs • 4.9k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 148k • written 6.2 years ago by SOHAIL ▴ 410

2

Entering edit mode

perhaps I am wrong but there is no "the reference with PhiX". When I was at the Max Planck Inst. in Leipzig, we made our own by concatenating the human reference and some decoy sequences. I suggest you contact the center in question that produced the BAM file.

ADD REPLY • link 6.2 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

@ Gabriel R,

yes! You are right. The BAM files are from Max Planck Inst. in Leipzig. can you share your ref, please...?

ADD REPLY • link 6.2 years ago by SOHAIL ▴ 410

0

Entering edit mode

I am no longer there, I will ask my former supervisor, let's see if they can stick it somewhere. in the meantime, here are the accessions for the reference: http://cdna.eva.mpg.de/neandertal/Chagyrskaya/bam/README

ADD REPLY • link 6.2 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

@ Gabriel R, Thank you for the special favor... and yes, of course, I have gone through this URL before that you gave: http://cdna.eva.mpg.de/neandertal/Chagyrskaya/bam/README

All the contig names and length information is in accordance with the BAM file that's given in the README except phiX length. For example: In the readme, the accession code for phiX is given: NC_001422.1 length: 5386

but in the downloaded BAM files in the header section:

@SQ SN:phiX LN:6386

with different length. that shows might be different phiX genome is used perhaps. Do you have any idea about that?

ADD REPLY • link 6.2 years ago by SOHAIL ▴ 410

0

Entering edit mode

You can find the sequence of the phiX genome that Illumina uses at this link.

ADD REPLY • link 2.4 years ago by GenoMax 148k

0

Entering edit mode

@genomax

Hi, The illumina genome that you shared has the following info:

From the "genome.dict" file of the illumina phiX genome:

@SQ SN:phix LN:5386 UR:file:/illumina/scratch/iGenomes/PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa    M5:bb9dae7b38a25a45dae8e3179d7c4241

Different than what I mentioned earlier...

edit: if we believe the figure are correct, that's 1000 nt difference (6386-5386)

ADD REPLY • link 6.2 years ago by SOHAIL ▴ 410

0

Entering edit mode

That is the official Illumina phiX sequence. NCBI's version is also the same length (Illumina's has a few SNP's compared to the NCBI reference). Unless you get clarification from the source of your BAM files it would be difficult to explain the difference you see.

ADD REPLY • link 6.2 years ago by GenoMax 148k

GenoMax · Accepted Answer · 2018-12-07

If you want exactly the MPI-EVA version (which includes a circularised phiX and extended reference mtDNA) the construction is similar to the reference used by the 1000 Genomes Project (compare ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/), with minor changes. It's made as follows :

1 Download individual chrs from ensembl ftp (just like 1000g) ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

2a Download the newer version of the mitochondrion (NC_012920, just like 1000g) http://www.ncbi.nlm.nih.gov/nuccore/251831106

2b Copy the first 1000bp of the mitochondrion onto its end. The resulting sequence is named "MT".

3 Download the concatenated decoy sequences from 1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5cs.fa.gz

Also compare their READMEs: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.slides.pdf

4 Download the Human herpes virus (NC_007605, aka EBV) from NCBI, just like 1000g. The sequence is then named "NC_007605". http://www.ncbi.nlm.nih.gov/nuccore/NC_007605

5a Download phiX-174 reference (NC_001422). http://www.ncbi.nlm.nih.gov/nuccore/NC_001422

5b Copy the first 1000bp of phiX onto its end, name the result "phiX".

6 Create a reference (whole_genome.fa) with chrs 1-22, X, Y, extended NC_012920 MT, the non-chromosomal supercontigs, the NC_007605 EBV, the decoy sequences (hs37d5), extended phiX. The order is chosen to match 1000 Genomes (plus phiX), see their fai file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai

Note that two sequences (MT, PhiX) are circular and have been extended to facilate alignment. The correct incantation to wrap these alignments to their correct length is

  bam-rewrap MT:16569 phiX:5386

or

  bam-rmdup -z MT:16569 -z phiX:5386