I want to reproduce the results that people achieved in the following Nature paper: Transcriptome genetics using second generation sequencing in a Caucasian population http://www.nature.com/nature/journal/vaop/ncurrent/full/nature08903.html
I downloaded their SAM files from the groups website: http://funpopgen.unige.ch/data/ceu60
I downloaded a reference fasta and fai file from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/
The main problems seem to exist that I'm not able to convert these SAM files into proper "working" BAM files so that I can get BED files that is the input format for FluxCapacitor (http://flux.sammeth.net/). I tried using the following steps (as there is no "proper" header in the SAM files I've to do some additional steps):
- samtools view -bt human_b36_male.fa.gz.fai first.sam> first.bam
- samtools sort first.bam first.bam.sorted
- samtools index first.bam.sorted
- samtools index aln-sorted.bam
When I then run the following command to get BED files I end up with an empty bed file and errors:
bamToBed -i reads.bam > reads.bed *this is using BEDtools Gives the following error:
terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check
As the BAM to BED file conversation seems not to be working properly I decided to check if I can use samtools to get basepair abundances using samtools pileup. I run the following but get no output:
samtools pileup -cf human_b36_male.fa reads.bam
To make sure the downloaded SAM files are correctly I tried to load the original SAM files into IGV I don't see any reads aligning anywhere to the genome. I already reconstructed another genome set using the fasta files that are closest to the mentioned Ensembl release (54) they mention in the paper (fasta file from the 1000 Genomes FTP again - but maybe this is not correct, though cannot find any better suiting Fasta files that might represent the correct Ensembl release/genome build) but this doesn't help as still none of the reads seem to align to the reference genome (or at least no reads show up in the viewer).
Anyone with some tips about the issues I've generating BED files from the published SAM files. Also any comments about why the IGV isn't showing any reads might be helpful for me understanding what's going wrong.
UPDATE Put 2 SAM files into my public directory so that people don't have to download the full 50GB: http://www.ebi.ac.uk/~swtimmer/RNAseq/
UPDATE2
So now I can generate directly BED files from the SAM files but I'm still wondering which exact reference genome they used (fasta file) so that I can generate BAM files and just because I'm curious now.
Is it possible they aligned to contigs or something? If I google for AL662826.11, it appears to be something like this: http://www.ncbi.nlm.nih.gov/nuccore/20870241
Maybe you could post a single bam file somewhere for people to look at, because nobody wants to download 50 gigs.
@brentp, sorry should have mentioned that this is how I started: [swtimmer@ebi-001 RNAseq]$ samtools view -bS 22087.sam.gz | bedtools bamtobed > reads.bed -bash: bedtools: command not found [samopen] no @SQ lines in the header. [samread1] missing header? Abort!
But have to use "a" fasta reference file to get this working. Paper states they used NCBI36 but I think this is where things go wrong as I think I use somehow a different reference genome (the BAM files tells me that none of the reads are aligned to any of the annotations.
How about posting the SAM header (or at least a few lines)? Do the sequence names in the fasta file match those used in the SAM file?
you can shorten your commands to: samtools view -bS first.sam | bedtools bamtobed > reads.bed does that work for you?
Did you try 'samtools view aln.bam | less'? Check that everything looks okay, that you get the correct chromosome names (matching the fasta file). Also, check that the fasta file doesn't use too long lines, many tools use a fixed buffer for line length.
@Sean Davis I'm looking into this, I have the feeling that this is the source of my problem. When looking at my BAM files: samtools view 3125_8.sam.gz.bam | awk '{print $3}' | sort | uniq -c 27653784 *
So that is not what we expected to see....
First 2 lines of the SAM file: [swtimmer@ebi-001 RNAseq]$ more 31258.sam IL63125:8:58:1625:1479 67 clone::AL662826.11:1:145431:1 1261 0 37M * 0 247 AAAAGGAGTAGGCAGGAAAACAGTC AATTATGGATTC ?BBCBBBB@<BBBBCB@A@?B?B>@A@B@BABB?B@? MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:4 H1:i:0 IL6_3125:8:37:57:1851 131 clone::AL662826.11:1:145431:1 1458 0 37M * 0 262 GTGAA
Is that fai file gzipped? Do you need to unzip that first, or did you already?
@Madelaine, updated the question with a link to 2 SAM files: http://www.ebi.ac.uk/~swtimmer/RNAseq/