I am now trying to run bam-readcounts. The sample I use is from human. But it takes almost 2-3 days to run one bam file. Is it normal? The size of my file is 8,000,000 KB. I am a little worried the time it takes is too long.
I am now trying to run bam-readcounts. The sample I use is from human. But it takes almost 2-3 days to run one bam file. Is it normal? The size of my file is 8,000,000 KB. I am a little worried the time it takes is too long.
Above you are running the tool on the entire genome which is not necessary and slow, you only need the positions where you have variants.
Here is a suggestion, given you have a BAM file called your.bam which is indexed by samtools
and a list of variants your.vcf
:
## Use the variant file to create a BED file with the variants for which `bam-readcounts`extracts information:
grep -v '^#' your.vcf \
| awk 'OFS="\t" {print $1, $2-1, $2+1}' \
| sort -k1,1 -k2,2n \
| bedtools merge -i - > your.bed
## Use samtools to pass only genomic positions to bam-readcount where variants are located:
samtools view -b -L your.bed your.bam > your_subset.bam
bam-readcount (...command...) your_subset.bam
Sorry for troubling again. But I indeed have met with an annoying problem.
When I tried to run fpfilter
by Varscan
, I got results like:
171661 variants in input file
171466 had a bam-readcount result
47692 had reads1>=2
0 passed filters
171661 failed filters
195 failed because no readcounts were returned
425 failed minimim variant count < 4
34 failed minimum variant freq < 0.05
13936 failed minimum strandedness < 0.01
47716 failed minimum reference readpos < 0.1
171466 failed minimum variant readpos < 0.1
47716 failed minimum reference dist3 < 0.1
171466 failed minimum variant dist3 < 0.1
0 failed maximum reference MMQS > 100
0 failed maximum variant MMQS > 100
0 failed maximum MMQS diff (var - ref) > 50
29 failed maximum mapqual diff (ref - var) > 50
2209 failed minimim ref mapqual < 15
9102 failed minimim var mapqual < 15
1051 failed minimim ref basequal < 15
134 failed minimim var basequal < 15
0 failed maximum RL diff (ref - var) > 0.25
And when I run bam-readcount
, there are all warnings like:
Couldn't find the generated tag.
Couldn't find single-end mapping quality. Check to see if the SM tag is in BAM.
I do have researched for the warnings, and it was saying that I could ignore it if I don't use the result of sing-end counts.
But I am not quite sure if it is the warnings leading to the failing of my filter.
And this is some of my output of readcount
:
chr1 14539 T 4 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.0
0 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:4:18.25:31.50:0.00:
4:0:0.00:0.02:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
Could you please give me some advice on the failed filtering?
I haven't used that tool in a while and actually gave up on it because I never found proper documentation on things like this fpfilter
. The tool is not maintained anymore and pretty slow unless one invests time in custom parallelization. It served its purpose to the community but now I would really use a more recent and maintained tool.
And may be I'd better try more recent tools too. Luckily, I have just found a review for this area and I think it will help a lot. I used to try Mutect2, but may be because I only offer two bam files ( a tumor bam and a normal bam), the result I got seems not exclude those SNPs. And I now guess it will do better if I could offer gnomad files as well. I am going to try this tool again with gnomad files ( it is hard to download really, area problem).
Anyway, thank you for your help! It do help a lot!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Consider using
mosdepth
(https://github.com/brentp/mosdepth ) instead. It is multi-threaded. 8G is not that large a bam file so 2-3 days sounds excessive.bam-readcounts
is a single threaded application?I am not sure whether
bam-readcounts
is a single threaded application, but the time it cost is much longer than 2-3 days. And I am going to trymosdepth
now. Thanks for your advice!mosdepth and varscan are not compatible, don't even try.
plit the file into chromosomes and use GNU parallel to parallelize.see answer belowOh! Is it that? Then I'll try
GUN parallel
. Thanks a lot !Actually, I am a little worried. The size I got with
bam-readcounts
is as large as 200,000,000 KB with an input of 8,000,000. I am not sure if it is normal. The code I use is quite simple, as: