Removing uncovered transcripts from multi FASTA reference file
1
0
Entering edit mode
3.2 years ago
ella • 0

Hi everyone :)

for RNASeq analyses, I have a reference file, containing multiple transcript sequences (it´s a subset of the NCBI human hg38 transcriptome). I found, that some of the transcripts are not even covered by a single read (especially if there are several transcript variants) and would like to exclude them from the file. Is there a way how I could filter those sub-FASTAS of my FASTA, that are covered by less than X reads?

I tried to search for an answer, but didn´t find any helpful posting, yet.

Thanks a lot in advance and have a nice day :)

RNASeq NGS Mapping reference FASTA • 1.4k views
ADD COMMENT
1
Entering edit mode
3.2 years ago
GenoMax 147k

that some of the transcripts are not even covered by a single read

What did you use to align the data and how did you handle multi-mapping reads?

You should consider using a program like salmon (LINK) that will distribute reads across a set of transcripts. It is fast and is the appropriate way of handling data that came from a set of alternately spliced transcripts.

ADD COMMENT
1
Entering edit mode

Seconding this. It will produce a table with the transcript name and the counts for that transcript. From there it is just parsing out the transcripts that have a zero and then remove them from the original fasta file, either with grep, samtools faidx or similar approaches. Does that make sense to you? Or you load the file into R and use the Biostrings package for the subsetting.

ADD REPLY
0
Entering edit mode

Thank you as well ATpoint, sounds like this was exactly what I was looking for :) I will try that!

ADD REPLY
0
Entering edit mode

Thanks for your fast answer.

I´m using BBMap for mapping. As I did not use the "ambig=" option, it should by default map the multi-mapping reads to the first encountered best site.

Great, thanks for the suggestion! I will have a closer look to salmon :)

ADD REPLY
0
Entering edit mode

Hi GenoMax, I was able to install and use salmon, as suggested by you. :) After a storage node crash at the computing center I am using, I have to re-install salmon. Unfortunately error after error after error occurs and I did not yet manage to set it up again. I think the cmake installation is causing the main problem... Is there any other program or R package to simply give me the number of reads, mapped to each fasta of my reference file (using a BAM file as input)? Or is there a way to do it via samtools?

Thanks again for your help!

ADD REPLY
1
Entering edit mode

Consider using conda to re-install salmon.

to simply give me the number of reads, mapped to each fasta of my reference file (using a BAM file as input)?

You can try samtools idxstats with a sorted/indexed BAM file.

ADD REPLY

Login before adding your answer.

Traffic: 2268 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6