Hi everyone :)
for RNASeq analyses, I have a reference file, containing multiple transcript sequences (it´s a subset of the NCBI human hg38 transcriptome). I found, that some of the transcripts are not even covered by a single read (especially if there are several transcript variants) and would like to exclude them from the file. Is there a way how I could filter those sub-FASTAS of my FASTA, that are covered by less than X reads?
I tried to search for an answer, but didn´t find any helpful posting, yet.
Thanks a lot in advance and have a nice day :)
Seconding this. It will produce a table with the transcript name and the counts for that transcript. From there it is just parsing out the transcripts that have a zero and then remove them from the original fasta file, either with
grep
,samtools faidx
or similar approaches. Does that make sense to you? Or you load the file into R and use the Biostrings package for the subsetting.Thank you as well ATpoint, sounds like this was exactly what I was looking for :) I will try that!
Thanks for your fast answer.
I´m using BBMap for mapping. As I did not use the "ambig=" option, it should by default map the multi-mapping reads to the first encountered best site.
Great, thanks for the suggestion! I will have a closer look to salmon :)
Hi GenoMax, I was able to install and use salmon, as suggested by you. :) After a storage node crash at the computing center I am using, I have to re-install salmon. Unfortunately error after error after error occurs and I did not yet manage to set it up again. I think the cmake installation is causing the main problem... Is there any other program or R package to simply give me the number of reads, mapped to each fasta of my reference file (using a BAM file as input)? Or is there a way to do it via samtools?
Thanks again for your help!
Consider using
conda
to re-installsalmon
.You can try
samtools idxstats
with a sorted/indexed BAM file.