reference genome hs37d5 with PhiX
1
1
Entering edit mode
6.2 years ago
SOHAIL ▴ 410

Hi everybody,

can anyone guide me from where to download the hs37d5.fa genome merged with phiX genome?

I have downloaded few BAM files containing the tag in header section @SQ SN:PhiX LN:6386 together with other hs37d5/fa contigs. I am looking for which version of the reference sequence was used for these files.

any help would be appreciated!

Regards

(Note: I am aware of phiX genome is used perhaps a control in illumina sequencing.)

reference genome ngs • 4.8k views
ADD COMMENT
2
Entering edit mode

perhaps I am wrong but there is no "the reference with PhiX". When I was at the Max Planck Inst. in Leipzig, we made our own by concatenating the human reference and some decoy sequences. I suggest you contact the center in question that produced the BAM file.

ADD REPLY
1
Entering edit mode

@ Gabriel R,

yes! You are right. The BAM files are from Max Planck Inst. in Leipzig. can you share your ref, please...?

ADD REPLY
0
Entering edit mode

I am no longer there, I will ask my former supervisor, let's see if they can stick it somewhere. in the meantime, here are the accessions for the reference: http://cdna.eva.mpg.de/neandertal/Chagyrskaya/bam/README

ADD REPLY
0
Entering edit mode

@ Gabriel R, Thank you for the special favor... and yes, of course, I have gone through this URL before that you gave: http://cdna.eva.mpg.de/neandertal/Chagyrskaya/bam/README

All the contig names and length information is in accordance with the BAM file that's given in the README except phiX length. For example: In the readme, the accession code for phiX is given: NC_001422.1 length: 5386

but in the downloaded BAM files in the header section:

@SQ SN:phiX LN:6386

with different length. that shows might be different phiX genome is used perhaps. Do you have any idea about that?

ADD REPLY
0
Entering edit mode

You can find the sequence of the phiX genome that Illumina uses at this link.

ADD REPLY
0
Entering edit mode

@genomax

Hi, The illumina genome that you shared has the following info:

From the "genome.dict" file of the illumina phiX genome:

@SQ SN:phix LN:5386 UR:file:/illumina/scratch/iGenomes/PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa    M5:bb9dae7b38a25a45dae8e3179d7c4241

Different than what I mentioned earlier...

edit: if we believe the figure are correct, that's 1000 nt difference (6386-5386)

ADD REPLY
0
Entering edit mode

That is the official Illumina phiX sequence. NCBI's version is also the same length (Illumina's has a few SNP's compared to the NCBI reference). Unless you get clarification from the source of your BAM files it would be difficult to explain the difference you see.

ADD REPLY
4
Entering edit mode
6.0 years ago
kelso ▴ 40

If you want exactly the MPI-EVA version (which includes a circularised phiX and extended reference mtDNA) the construction is similar to the reference used by the 1000 Genomes Project (compare ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/), with minor changes. It's made as follows :

1 Download individual chrs from ensembl ftp (just like 1000g) ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

2a Download the newer version of the mitochondrion (NC_012920, just like 1000g) http://www.ncbi.nlm.nih.gov/nuccore/251831106

2b Copy the first 1000bp of the mitochondrion onto its end. The resulting sequence is named "MT".

3 Download the concatenated decoy sequences from 1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5cs.fa.gz

Also compare their READMEs: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.slides.pdf

4 Download the Human herpes virus (NC_007605, aka EBV) from NCBI, just like 1000g. The sequence is then named "NC_007605". http://www.ncbi.nlm.nih.gov/nuccore/NC_007605

5a Download phiX-174 reference (NC_001422). http://www.ncbi.nlm.nih.gov/nuccore/NC_001422

5b Copy the first 1000bp of phiX onto its end, name the result "phiX".

6 Create a reference (whole_genome.fa) with chrs 1-22, X, Y, extended NC_012920 MT, the non-chromosomal supercontigs, the NC_007605 EBV, the decoy sequences (hs37d5), extended phiX. The order is chosen to match 1000 Genomes (plus phiX), see their fai file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai

Note that two sequences (MT, PhiX) are circular and have been extended to facilate alignment. The correct incantation to wrap these alignments to their correct length is

  bam-rewrap MT:16569 phiX:5386

or

  bam-rmdup -z MT:16569 -z phiX:5386
ADD COMMENT
0
Entering edit mode

This answer is referring to programs found in biohazard-tools at this link.

ADD REPLY
0
Entering edit mode

Thanks, @Kelso for the kind response.

ADD REPLY
0
Entering edit mode

Hey, I wanted to ask if your re-alignment was successful. What percentage of data could you retrieve?

ADD REPLY

Login before adding your answer.

Traffic: 2454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6