Converting Sam Files To Bam Files - Reproduce Results Nature Paper: Transcriptome Genetics Using Second Generation Sequencing In A Caucasian Population
3
7
Entering edit mode
12.8 years ago
Sander Timmer ▴ 710

I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian population http://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html

I downloaded their SAM files from the groups website: http://funpopgen.unige.ch/data/ceu60

I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/

The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):

  1. samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
  2. samtools sort first.bam first.bam.sorted
  3. samtools index first.bam.sorted
  4. samtools index aln-sorted.bam

When I then run the following command to get BED files I end up with an empty bed file and errors:

  1. bamToBed -i reads.bam > reads.bed *this is using BEDtools Gives the following error:

    terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check

As the BAM to BED file conversation seems not to be working properly I decided to check if I can use samtools to get basepair abundances using samtools pileup. I run the following but get no output:

samtools pileup -cf human_b36_male.fa reads.bam

To make sure the downloaded SAM files are correctly I tried to load the original SAM files into IGV I don't see any reads aligning anywhere to the genome. I already reconstructed another genome set using the fasta files that are closest to the mentioned Ensembl release (54) they mention in the paper (fasta file from the 1000 Genomes FTP again - but maybe this is not correct, though cannot find any better suiting Fasta files that might represent the correct Ensembl release/genome build) but this doesn't help as still none of the reads seem to align to the reference genome (or at least no reads show up in the viewer).

Anyone with some tips about the issues I've generating BED files from the published SAM files. Also any comments about why the IGV isn't showing any reads might be helpful for me understanding what's going wrong.

UPDATE Put 2 SAM files into my public directory so that people don't have to download the full 50GB: http://www.ebi.ac.uk/~swtimmer/RNAseq/

UPDATE2

So now I can generate directly BED files from the SAM files but I'm still wondering which exact reference genome they used (fasta file) so that I can generate BAM files and just because I'm curious now.

bam sam samtools bedtools • 9.1k views
ADD COMMENT
2
Entering edit mode

Is it possible they aligned to contigs or something? If I google for AL662826.11, it appears to be something like this: http://www.ncbi.nlm.nih.gov/nuccore/20870241

Maybe you could post a single bam file somewhere for people to look at, because nobody wants to download 50 gigs.

ADD REPLY
1
Entering edit mode

@brentp, sorry should have mentioned that this is how I started: [swtimmer@ebi-001 RNAseq]$ samtools view -bS 22087.sam.gz | bedtools bamtobed > reads.bed -bash: bedtools: command not found [samopen] no @SQ lines in the header. [samread1] missing header? Abort!

But have to use "a" fasta reference file to get this working. Paper states they used NCBI36 but I think this is where things go wrong as I think I use somehow a different reference genome (the BAM files tells me that none of the reads are aligned to any of the annotations.

ADD REPLY
0
Entering edit mode

How about posting the SAM header (or at least a few lines)? Do the sequence names in the fasta file match those used in the SAM file?

ADD REPLY
0
Entering edit mode

you can shorten your commands to: samtools view -bS first.sam | bedtools bamtobed > reads.bed does that work for you?

ADD REPLY
0
Entering edit mode

Did you try 'samtools view aln.bam | less'? Check that everything looks okay, that you get the correct chromosome names (matching the fasta file). Also, check that the fasta file doesn't use too long lines, many tools use a fixed buffer for line length.

ADD REPLY
0
Entering edit mode

@Sean Davis I'm looking into this, I have the feeling that this is the source of my problem. When looking at my BAM files: samtools view 3125_8.sam.gz.bam | awk '{print $3}' | sort | uniq -c 27653784 *

So that is not what we expected to see....

First 2 lines of the SAM file: [swtimmer@ebi-001 RNAseq]$ more 31258.sam IL63125:8:58:1625:1479 67 clone::AL662826.11:1:145431:1 1261 0 37M * 0 247 AAAAGGAGTAGGCAGGAAAACAGTC AATTATGGATTC ?BBCBBBB@<BBBBCB@A@?B?B>@A@B@BABB?B@? MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:4 H1:i:0 IL6_3125:8:37:57:1851 131 clone::AL662826.11:1:145431:1 1458 0 37M * 0 262 GTGAA

ADD REPLY
0
Entering edit mode

Is that fai file gzipped? Do you need to unzip that first, or did you already?

ADD REPLY
0
Entering edit mode

@Madelaine, updated the question with a link to 2 SAM files: http://www.ebi.ac.uk/~swtimmer/RNAseq/

ADD REPLY
5
Entering edit mode
12.8 years ago
brentp 24k

I checked one of the files from their site. The problem is they have no header. Bamtools, used by bedtools expects a header.

You can just make a BED file your self via the following:

zless 1184_1.sam.gz |  awk 'BEGIN{OFS=FS="\t"}(! and($2, 4)){ print $3,$4 - 1, $4 + length($10), $0 }' > out.bed

adding columns as you need them. Currently it prints chrom, start, end for aligned reads.

They did align to contigs because the chromosome names look like clone::AL662826.11:1:145431:1

ADD COMMENT
1
Entering edit mode

If you can get a hold of the genome they used, you can use picard CreateSequenceDictionary to use as a header.

ADD REPLY
1
Entering edit mode

When using the awk script above, assure that you don't have any indels, otherwise these loci will be wrong. $10 is the read sequence and not the reference sequence, it maps to.

ADD REPLY
1
Entering edit mode

@brentp and @madelaine I just cannot really find out what specific genome build/contigs they used. I tried now several ncbi36 versions but they all seem to fail. Maybe I should just ask the author for a more specific description of the reference they used

ADD REPLY
0
Entering edit mode

Will try to run this to get BED files. Thanks, looks like a solution. But still I'd be interested to know if its possible to get proper working BAM files out of these SAM files.

ADD REPLY
0
Entering edit mode

right, or samtools -t genome.fasta

ADD REPLY
0
Entering edit mode

When using the awk script above, assure that you don't have any indels, otherwise these loci will be wrong. $10 is the read sequence and the reference sequence, it maps to.

ADD REPLY
0
Entering edit mode

@brentp -- samtools -t genome.fasta? I thought there might be a way to do this in samtools, but I couldn't remember. Are you missing a word in there?

ADD REPLY
0
Entering edit mode

Oh, hey, I just found it: samtools view -bt genome.fa.fai sample.sam

Much easier.

ADD REPLY
2
Entering edit mode
12.8 years ago
 zcat 1184_8.sam.gz | \
 perl -F'\t' -ane \
    '$strand=($F[1]&16)?"-":"+";\
     $length=1;\
     $tmp=$F[5];\
     $tmp =~ s/(\d+)[MD]/$length+=$1/eg;\
     print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";'\
 > 1184_8.bed
ADD COMMENT
3
Entering edit mode

But nevertheless.... to get a bam file, you'll need the header.

ADD REPLY
2
Entering edit mode

Can you make this a little uglier? Heh.

ADD REPLY
1
Entering edit mode

Sorry... first version didn't check the strand correctly. Now it should work correctly. :)

ADD REPLY
1
Entering edit mode

Well, it works, right? Of course you can map the reads again to get the header, or spend hours in creating the header. I just prefer the most painless way of creating a bed file out of the sam file.

Even, if it is ugly. ;)

I was just writing it down, without spending a lot of time with writing it nicely. BUT... I named the variables 'length' and 'strand', so it should be possible to understand it.... perhaps.

ADD REPLY
1
Entering edit mode
12.7 years ago
Line Skotte ▴ 10

I generated a header the following way:

cut -f3 1382_1.sam | uniq > tmp_refnames cut -d':' -f5 tmp_refnames > tmp_reflengths paste -d 'n' $(yes headerstart|head -201)>tmp_headerstart2 paste -d ':' tmp_headerstart2 tmp_refnames > tmp_refnames2 paste -d 'n' $(yes headerinsert|head -201)>tmp_headerinsert2 paste tmp_refnames2 tmp_headerinsert2 > tmp_refnames3 paste -d: tmp_refnames3 tmp_reflengths > headerfile

where headerstart is a textfile containing '@SQ SN' and headerinsert is a textfile containing 'LN'. Not very elegant, but gives you a header list with the refnames and reflengths. It turns out that they mapped to 201 different ref sequences.

Among them is e.g.

@SQ SN:chromosome:NCBI36:15:1:100338915:1 LN:100338915 @SQ SN:chromosome:NCBI36:Y:2709521:57443437:1 LN:57443437 @SQ SN:chromosome:NCBI36:6:1:170896992:1 LN:170896992 @SQ SN:chromosome:NCBI36:4:1:191263063:1 LN:191263063 @SQ SN:chromosome:NCBI36:5:1:180837866:1 LN:180837866

etc.

which i guess is the build36 but chopped up into chrs.

to add the created header to the sam files: cat headerfile 1382_1.sam > 1382_1h.sam

looking at the data: samtools view 1382_1h.sam -S -b >1382_1.bam samtools sort 1382_1.bam 1382_1sort samtools index 1382_1sort.bam

samtools pileup 1382_1sort.bam | less

looking with tview gives some problems if you wish to use the goto function due to the long refnames!!!

Hope this helps you with your problem.

Best, Line Skotte

ADD COMMENT
0
Entering edit mode

ups i forgot some newlines above:

cut -f3 13821.sam | uniq > tmprefnames cut -d':' -f5 tmprefnames > tmpreflengths paste -d '\n' $(yes headerstart|head -201)>tmpheaderstart2 paste -d ':' tmpheaderstart2 tmprefnames > tmprefnames2 paste -d '\n' $(yes headerinsert|head -201)>tmpheaderinsert2 paste tmprefnames2 tmpheaderinsert2 > tmprefnames3 paste -d: tmprefnames3 tmpreflengths > headerfile

ADD REPLY
0
Entering edit mode

cut -f3 13821.sam | uniq > tmprefnames

cut -d':' -f5 tmprefnames > tmpreflengths

paste -d '\n' $(yes headerstart|head -201)>tmp_headerstart2

paste -d ':' tmpheaderstart2 tmprefnames > tmp_refnames2

paste -d '\n' $(yes headerinsert|head -201)>tmp_headerinsert2

paste tmprefnames2 tmpheaderinsert2 > tmp_refnames3

paste -d: tmprefnames3 tmpreflengths > headerfile

ADD REPLY
0
Entering edit mode

Sorry about the missing newlines above, hope that you can put them in youself. Before every 'cut' and 'paste'.

ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6