Proper Way To Map Rna-Seq Data Against A Single (Or Small) Number Of Genes
2
3
Entering edit mode
10.7 years ago
Jason ▴ 60

I have a large Illumina RNA-Seq dataset, and I have already mapped it to the reference genome using STAR and done quantification. But now I want to look at expression of GFP which is not native to the species (as this is a transgenic mouse).

I imagine the 'proper' way to do this is to create a new reference genome with the GFP gene added as an extra chromosome. But this would then require a lot of duplicated work, space, and time.

What I tried to do is create a new reference index with the single GFP gene, and then align against that, but STAR creates a 1.5GB index for this single gene, and what if I want to do this with more genes? This seems to using STAR outside the type of work it was originally designed for. Or is this in fact the correct approach?

EDIT:

Am I missing anything obvious here, like using BLAST or BLAT (I don't have any experience with these older tools)? Thanks.

gene rnaseq mapping alignment • 8.7k views
ADD COMMENT
0
Entering edit mode

Is GFP fused to something or is it being expressed by itself? You might just try bowtie2 or bwa, which should have smaller indexes and be fast enough for your purposes.

BTW, do you have the unmapped reads (this is an option for STAR)?

ADD REPLY
0
Entering edit mode

Expressed by itself. Does that make a difference? And no, I didn't save the unmapped reads from the original mapping.

ADD REPLY
0
Entering edit mode

Only in that if it were fused to something else then you might get somewhat better results by putting the fusion protein in. Otherwise, no, that doesn't matter too much. Too bad you didn't save the unmapped reads, that would have made life simple :)

ADD REPLY
0
Entering edit mode

Wouldn't that affect the alignment rate, so the counts from the native genes wouldn't be comparable to the GFP counts?

ADD REPLY
0
Entering edit mode

Hi,

I have a similar question, I have a TE fasta file (that I got from bedtools) looking like that:

>Chr1:11896-11976(+)
CCCTTTCTTAGCAAATTGATCATCATCGCCATCATCACCATCATCATTATCATCATCATGATCAGTCGATAAATTTAGTC
>Chr1:16882-17009(-)
TTACACCCCATACCTTCCTAGTTTTATCTATGTACGTAGCAGCTTTTTAAAACGACCAAATTCTTAGCATTTCTCTATGGCTATAGGACAGTACGTTGTATAGAAAAGTTTAAATTGAAAAACAAAA
>Chr1:17023-18924(+)
TTAGGAAATACATTTTAAATAT...

How can I index this 'genome' with STAR?

I would like to map reads on that. The TEs are in the original complete fasta file, maybe finding them out after mapping on the whole genome is a better way?

Cheers,
Mathieu

ADD REPLY
1
Entering edit mode

Please post things like this as new questions.

I would recommend that you do the following:

  1. Delete the TE fasta file, you don't need it.
  2. Align against the whole genome.
  3. Use the BED file that you used with BEDtools to subset the alignments according to whether the overlap one of your TEs.

Doing it that way will produce fewer false positives and a higher overall alignment rate.

ADD REPLY
0
Entering edit mode

Why All The Capitals haha ;-) ?

ADD REPLY
1
Entering edit mode
10.7 years ago
seidel 11k

I've done this with bowtie to count GFP or the ERCC spike-in controls. A bowtie index of GFP and a few other genes came out to 4 MB.

ADD COMMENT
0
Entering edit mode

I didn't think the indices would be so much smaller, but I guess the burroughs-wheeler transform of a small sequence is itself small (unlike the seed hash tables of STAR). Thanks!

ADD REPLY
0
Entering edit mode
9.1 years ago

My understanding is that RNAstar indexing allows multiple fasta files being indexed in genome dir. Probably you can keep both the host genome and GFP as individual fasta (and corresponding gtf) files in genome dir and index them. Check if STAR uses GFP reference.

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6