Entering edit mode
3.1 years ago
bpf24
•
0
Hi,
I assume I am doing something really thick but cannot for the life of me restrict my bam files to chromosome only fields and remove scaffolds. I have run samtools as suggested in other posts but every time I check the file, all the headers are still present with their reads. Below are my commands leading up to it from bowtie2 alignment
samtools view -b in.sam > out.bam
samtools sort in.bam -o out.bam
samtools index in.bam
samtools view -b in.bam 1 > out.bam
or
samtools view -b -o out.bam in.bam `seq 1 22`
or even creation of index.bed file from fa
awk '/^[0-9]*\t/ {prin^C("%s\t0\t%s\n",$1,$2);}' genome.fa.fai > out.bed
samtools view -L out.bed -o out.bam in.bam
Every time I view any of these files (samtools view -H out.bam) all the headers and reads are outputted.
Any help would be very appreciated!
Cheers,
Ben
what are you doing here ? whay are your producing out.bam twice and indexing in.bam at the end ?
Just haven't piped one into another; first is to convert sam to bam, second is to sort by co-ordinate as well as create an index. Without these steps, samtools view outputs an error when I try to grab chromosome related entries only
You don't need to do this. Firstly, it's pointless producing a compressed BAM when you're instantly decoding it again and outputting a new BAM. But more importantly, it's totally redundant anyway as sort can read in SAM directly.
The old idiom of SAM -> BAM and then BAM -> sorted BAM is something that should have died out many years ago, but is sadly oft repeated in bioinformatics memes.
the dictionary in the BAM/SAM will not be removed; All lines starting with @SQ will be kept
show us an example of read
Below is the output. I haven't included further lines of scaffold entries but you get the idea. So am i getting confused with samtools view -H, and the reads have only been preserved for chromosome 1-22 and not scaffold etc.? How do I check this?
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
your first problem is : your BAM is corrupted:
as I said above, those lines will never be removed even if you select a region. Those @SQ lines are here to help tools to check that you're working with compatible references.
I said
I don't want to see the header. I want to see a read validating what you said:
aaaahhhh apologies, I was mistaken with what I was looking at. Once I printed the reads using the command below I can see it has removed them. My mistake and thanks for your help!!!