Salmon/Sailfish for repeat element quantification?
2
2
Entering edit mode
8.6 years ago
umn_bist ▴ 390

If, instead of providing a transcriptome fasta, I use a consensus repeat element fasta (from UCSC/Repbase), would I be able to (accurately) count repeat element expression levels? Is there something conceptually/technically different between counting transcript and repeat elements?

Sailfish https://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon https://combine-lab.github.io/salmon/

RNA-Seq DE • 4.2k views
ADD COMMENT
1
Entering edit mode

Take a look at @Devon's answer here: Alignment and mapping

ADD REPLY
0
Entering edit mode

Thank you. Just to make sure that I am understanding, according to @Devon, Salmon can map reads to the reference (in my case, consensus repeat element) without having to worry about read variants confounding my quantification results. Then I will be able to quantify repeats using Salmon, which is a huge time saver.

I'm curious, though, when counting repeats, it seems multimapped reads must be carefully weighed/removed. Is this something I should be cautious of when using Salmon. At any rate, this sounds too good to be true. Thank you again @genomax2 for your help.

ADD REPLY
0
Entering edit mode

You should consider the difference between mapping/alignment carefully.
"Mapping" to repeat regions could lead to issues with multi-mapping (as an aside we should start calling reads "multi-aligners", if they are being aligned). I have not tried this myself (someone else can comment who has) but as these things go test first. Mapping and regular alignment with a sample or two before committing yourself. I do not think Sailfish was designed with repeats in consideration (there does not appear to be any reference to repeats in Sailfish paper).

ADD REPLY
1
Entering edit mode

I have a tool in development for this purpose (the github repo name isn't exactly obfuscated...). At the moment it doesn't do the Salmon/Kallisto/Sailfish/etc. style equivalence class construction followed by EM (just "unique" counts of alignments to repeat families/classes/etc.), but I'm hoping to add that in in the next couple weeks. I'll try to post again when/if I get that working correctly. We have a large group working on repeats so we have a pressing need for this internally as well.

ADD REPLY
1
Entering edit mode

Part way through writing the aforementioned tool a colleague mentioned that a program called TEtranscripts already exists and essentially does all of this. Indeed, it already does what I was planning and umn_bist seems to want. However, it's painfully slow, in part because it wants you to feed it all of your samples at once (because it runs DESeq on the resulting counts). I've made a fork of the original package and wrote a new tool for it called TEcount. It's like rsem/sailfish/etc. for repeat elements. So perform an alignment to the genome with STAR (or whatever else you like), allow a bunch of multimappers, and feed each resulting BAM file individually into TEcount. The output is 4 tables of counts that are compatible with DESeqDatasetFromHTseqCount() in R: one for genes, one for TE elements quantified by name, one for TE elements quantified by family, and finally a table of TE elements quantified by class. For me, analysing family-level changes tends to be the most fruitful, but your mileage may vary.

BTW, the program is still a bit slow (it is pure python, so that's what happens), but since you can run each sample in parallel you can still get on with things in a reasonable amount of time.

I'll make a PR against the original tetoolkit repository so hopefully this can get included in the official package.

ADD REPLY
0
Entering edit mode

Hi,

I'm trying to use you tool TEcount in my RNA-seq data using this code:

TEcount -b file.bam --GTF hg19UCSC.gtf --TE hg19repeakmasker.gtf --prefix outputfile

but I get this error:

chr1 hg19_rmsk exon 10001 10468 1504.000000 + . gene_id "(CCCTAA)n"; transcript_id "(CCCTAA)n"; TE GTF format error! There is no annotation at line 1. Error in building gene/TE index

I got GTF files from UCSC Table Browser and they look like this:

gema@Envy-G:~$ cat hg19UCSC.gtf | head -n 5

chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";

gema@Envy-G:~$ cat hg19repeatmasker.gtf | head -n 5

chr1 hg19_rmsk exon 16777161 16777470 2147.000000 +. gene_id "AluSp"; transcript_id "AluSp"; chr1 hg19_rmsk exon 25165801 25166089 2626.000000 -. gene_id "AluY"; transcript_id "AluY"; chr1 hg19_rmsk exon 33553607 33554646 626.000000 +. gene_id "L2b"; transcript_id "L2b"; chr1 hg19_rmsk exon 50330064 50332153 12545.000000 +. gene_id "L1PA10"; transcript_id "L1PA10"; chr1 hg19_rmsk exon 58720068 58720973 8050.000000 -. gene_id "L1PA2"; transcript_id "L1PA2";

ADD REPLY
0
Entering edit mode

I solved it, the TE gtf file didn't contain info for class and family, I used the file TE hg19_rmsk_TE.gtf provided in testdata_GTF folder and worked perfectly.

ADD REPLY
0
Entering edit mode

You wouldn't be able to tell how accurate the counts are, unless you verify using an independent method (e.g. Northern Blot). Just a thought for the overall experimental design.

ADD REPLY
0
Entering edit mode
8.6 years ago
SES 8.6k

This is such a complex question that it is difficult to answer directly. Part of the challenge is the fact that it is an understudied area that is lacking specific tools or models to approach the problem. The tools you mention are definitely more appropriate for the question than traditional approaches that involve mapping followed by some gene-based model to determine expression. I would start with those you mention and compare to a simple k-mer based approach to quantify expression. Sorry I did not give more details, but I'm not familiar with the underlying assumptions of those programs.

Aside from the technical issues, the biology of transposon expression is much more complicated, so we need to take some caution with interpreting the results. With genes we often try to associate expression levels with a phenotype but that is quite hard to do with transposons. You might intuitively think that expression would be positively correlated with new insertions or copies, but that is not the case in most systems, and may even be inversely correlated. I'm not sure what you are specifically trying to address but the meaning of the results would be dependent on the study system, tissue type, and the type of transposon. It is hard to paint broad strokes with this type of data, but hopefully it will be easier in the future.

ADD COMMENT
0
Entering edit mode
8.6 years ago
thackl ★ 3.0k

In general I like the idea - and actually I used similar approaches myself to at least assess repeat copy numbers etc.. A major problem, however, is that the consensus repeat sequences often represent clusters with some sequences of only about 75%-85% identity. Transcriptome quantification methods (read or kmer based) are way to strict to cover this diversity. So you will only be able to map a fraction of the reads to the corresponding repeat sequences.

ADD COMMENT
0
Entering edit mode

The consensus issue is a good point, that is another factor that would complicate interpretations.

ADD REPLY

Login before adding your answer.

Traffic: 1801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6