Issues building indexed .bam files with SAMtools
1
0
Entering edit mode
6.4 years ago
lleipnitz • 0

Hello all! I am fairly new to bioinformatics and I have been struggling to treat some NGS files I have just received. Here is the thing: I have 35 samples sequenced using the RAD-seq methodology and I have been trying to replicate the pipeline that the company which sequenced my samples used to treat them (we asked for a bioinformatics support). I do this because i) I want to learn how to use the softwares properly and ii) not all samples in the project belong to my thesis and so I want to process my files separately (the bioinformatics support treated the total of 95 samples in the same pipeline).

Briefly, their pipeline uses Velvet to assemble a genome (in my case, a group of contigs generated using the reads of the sample with the highest number of reads) and then align the reads of all individuals against the assembled genome (reference) using Bowtie. I succeeded in assembling the genome and aligning the samples against the reference, but then problems arise when running SAMtools (e.g. producing indexed .bam files). The idea is to process the samples correctly so I can generate .bcf files to call the SNP variants obtained for each individual.

Here are the commands I have used up to where it goes wrong:

Velvet 1.2.10:

velveth SampleOne_ref 19 -fastq.gz -short .SampleOne.fastq.gz (hash was defined at 19 because of lack of memory when using lower hash values)
velvetg SampleOne_ref -cov_cutoff 5 -mas_coverage 500 -min_contig_lgth 31 -exp_cov auto

Bowtie 2-2.3.4.1:

bowtie2-build SampleOne_ref.fa SampleOne_ref (building reference using SampleOne)

bowtie2-align-s --no-unal --sensitive-local -x SampleOne_ref.* -U SampleTwo.fastq -S SampleOne_ref_QuerySampleTwo.sam

SAMtools-1.8:

samtools dict SampleOne_ref.fa

samtools faidx SampleOne_ref.fai

samtools view -b SampleOne_ref_QuerySampleTwo.sam -o SampleOne_ref_QuerySampleTwo.bam

samtools sort -l 0 -n SampleOne_ref_QuerySampleTwo.bam -o SampleOne_ref_QuerySampleTwo.sorted.bam

samtools index SampleOne_ref_QuerySampleTwo.sorted.bam --> [E::hts_idx_push] Chromosome blocks not continuous
samtools index: failed to create index for "SampleOne_ref_QuerySampleTwo.sorted.bam"

And this is where I'm stuck. I did check the files to see if they were ok, and apparently, they are.

Any clues about where I went wrong?

Thank you so much in advance! Any help will be greatly appreciated!

snp Assembly genome alignment RAD-seq • 4.3k views
ADD COMMENT
2
Entering edit mode

Why are you using -l 0 in sort? And yeah, like h.mon says, -n sorts by name, which precludes indexing. Are you sure you want to sort by name?

ADD REPLY
0
Entering edit mode

I used "-l 0" because I thought I had to specify that the output should be uncompressed. And no, I am not sure I want to sort by name. Actually now I think I don't have to sort by name because, as h.mon said, to index the files they'd have to be sorted by position, not by name.

Thank you very much for answering!

ADD REPLY
2
Entering edit mode
6.4 years ago
h.mon 35k

You can only index bam files sorted by position, so if you sort them by name (samtools sort -n), you can't index it.

edit: I think assembling a mock reference using just one library isn't optimal: if you assemble the reference using all libraries, you will probably recover more loci at the end.

ADD COMMENT
0
Entering edit mode

Velvet doesn't work well if there are way too many reads, so he might not be better off combining a massive number of samples together.

ADD REPLY
0
Entering edit mode

I thought I should sort by name because then the reads would be in the same order in both the reference and sample aligned to it. So I probably have mistaken sorting by name for sorting by position. If I index both the reference and the .bam file, can they be used in samtools mpileup?

As for the reference using only the individual with most reads, I am just replicating the method used by the company that made the sequencing. I have no idea if making a reference using more than one sample would be better. I thought the way I did it was ok because the individuals represent species we think are closely related and recently originated, so I would not lose much data in the alignment (though approximately half of the reads of each sample do not align to the reference).

I will try sorting them by position! Thank you very much for your reply!

EDIT: I tried sorting before indexing (not using the -n option) and it worked! Also checked each .sam and .bam file using samtools "quickcheck" and no files returned errors, so I assume they are readable.

Tried running samtools "mpileup" but the following error returned: [mpileup] 2 samples in 2 input files <mpileup> Set max per-file depth to 4000 [E::sam_parse1] missing SAM header [W::sam_read1] Parse error at line 1

This is the command line used:

samtools mpileup -ug SampleOne_ref.fa.fai SampleOne_ref_QuerySampleTwo.sorted.bam -o SampleOne_ref_QuerySampleTwo_mpileup

The error is very literal, I know, but is it because I should convert the reference (.fa) into a .sam file or because the header is missing in the .bam file?

ADD REPLY
1
Entering edit mode

hello lleipnitz, U can try with the below command

samtools mpileup -d 250 -f reference.fa SampleOne_ref_QuerySampleTwo.sorted.bam > Output.mpileup.txt

Try hopes this works for you.

ADD REPLY
0
Entering edit mode

Hello mks002! I tried your command and it worked! Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6