Dear community,
I have huge BAM files (mapped using BBTools+coordinate-sorted via SAMTOOLS), which I used for variant calling. Now I would like to manually check specific positions/ variants in the mapped data using Tablet and/or IGV. Unfortunately, the BAM files are extremely large, rendering it impossible to download them from the HPC I am using and check them on my laptop.
My reference is a huge multi-fasta file containing thousands of transcripts.
To make the files more handy, I was aiming to extract only those portions of the BAM file, that correspond to my transcripts (~chromosomes) of interest.
I tried
samtools view -bh input.sorted.bam chr1 chr2 > output.bam
as suggested in a previous post (Subset Bam File According To List Of Contigs) but no reads were showing up in IGV or Tablet, when using either the complete fasta reference or a subset as genome input.
Also found this post: How To Split A Bam File By Chromosome. I donwloaded and installed bamtools, because the
bamtools split
option was suggested. Running bamtools split --help, I did not find any option to limit the output of the function to a specific subset of the reference contigs; splitting into all reference contigs will be an extreme amount of .bam subfiles.
After trying to find an option on my own, I would now be very thankful for any handy suggestions!
Thanks in advance,
Ella
What is the output of
samtools view -H input.sorted.bam
? My guess is that your chromosomes are simply not called chr1, chr2... but maybe 1,2, or so...because the command itself is fine.Thank you for the fast answer.
The output looks like this:
Is the problem the complexity of the long headers in my multi-fasta file?
This header says that there is a single chromosome/contig in the reference you aligned against, and this is called
Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3
. THere is no chr1, chr2...I´m sorry for my very unclear message, it contains thousands of lines with the format that I texted, each line looking like this:
"Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3" is the header of one of the fastas in my multi-fasta reference.
Sorry, I hope I explained it better now.
My guess is you may have created a headerless BAM after the subset. Be sure to include the header. Do you see the headers when you do
samtools view -H subset.bam
?Thank you for the fast answer.
I think I probably did a mistake in subsetting, as I did probably not use the whole header but only a part of the header. Using samtools view -H subset.bam produces a huge output, so it seems that subsetting didn´t work. I´ll try to subset using the whole header in single quotes and check again!
I´m not able to create a subset. :( I tried tons of versions of samtools view, for example the following:
On the one hand, I tried to use my subsetted fasta file as input, using the -T option (How to subset bamfile based on chromosome IDs):
On the other hand, I tried to use the fasta headers of interest in a text file or give all headers in quotation marks as suggested here: Subset Bam File According To List Of Contigs
And even to extract the bam corresponding to only one of the contigs I´m interested in; using single long headers in single quotation marks (e.g. 'Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3') via
samtools view -bh bamfile.sam-fixmate-sorted-deduped.bam 'Pp3c17_14230V3.4 pacid=32906262 locus=Pp3c17_14230 annot-version=v3.3' > test.bam
Either I get some error, or I get a test.bam, which, when I run
samtools view -H subset.bam
, still contains all the headers, as the source bam file.No clue what I am doing wrong...
Result file will still contain the original headers. Issue appears to be that you are unable to use those long names with spaces.
Okay, I understand. I was also not able to succeed with the BED file. I will now reduce the headers in my reference to the IDs (e.g. "Pp3c17_14230V3.4") and re-run mapping and calling with those. I hope I will finally succeed then...
You may be able to do this by using the tool
reformat.sh
from BBTools without remapping.Can you try:
This should truncate the reference names after the first space.
In future you can add the
trd=t
flag to yourbbmap.sh
command to shorten the header names. BBMap passed through long names as is as opposed to other aligners (most?) that truncate the names after first space.I think I even used
reformat.sh
earlier today, but don´t remember with what arguments. I now tried what you suggested, but got an error:In terms of adding
trd=t
to thebbmap.sh
, thank you for that information, cool! I started to re-run scripts now with short-header-fasta as reference anyways; do you think the one or other option (using fasta with shortened headers vs.bbmap.sh
withtrd=t
option) would have other advantages or disadvantages?No advantage. In case you wanted to retain the long headers you could reformat the BAM afterwards.
Not sure why you got the error above. Perhaps you are using a non-US locale.
It worked now, after re-mapping with short-headed FASTA! :) Thanks a lot again for your time.
Hi, just a little feedback: I tried reformat.sh as you suggested, as well, and it also fixed the problem :) Thanks again!
A better option may be to create small BAM's using the region you are interested in. You can do that by providing co-ordinates or a BED file containing regions of interest.
Thank you for the suggestion. I think I did something similar some years ago. Will try it! So hard to get back into the topic after a long period of only wetlab...