RNA-seq of 3'-UTRs
0
0
Entering edit mode
4.6 years ago
ntsopoul ▴ 60

Hi, I would like to align trimmed and filtered FASTQ files to the mm9 (mus musculus) reference genome to analyze 3'-UTR via STAR aligner.

My downstream applications did not work and I wonder whether I have used the right genome and the right annotation file.

I downloaded and used from ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/

Mus_musculus.GRCm38.dna_sm.primary_assembly.fa (DNA, FASTA) Mus_musculus.NCBIM37.67.gtf

would it be better to use cDNA? Mus_musculus.NCBIM37.67.cdna.all.fa.gz. (cDNA, FASTA)?

Thanks for helping

RNA-Seq Assembly rna-seq alignment • 1.5k views
ADD COMMENT
0
Entering edit mode

What downstream application did you use and what was the error it produced ? cDNA would also contain 5'-UTR and coding regions. I think it would be better to extract the 3'-UTR regions from the genome using the gtf file and run salmon on them to count the reads mapping to these 3'-UTRs.

You can also try featureCounts and specify "three_prime_utr" as the feature to count on your sam/bam files generated by STAR

ADD REPLY
0
Entering edit mode

Hi and thanks for the quick answer! the downstream application was APAlyzer (https://bioconductor.org/packages/release/bioc/html/APAlyzer.html). It is a program to determine the polyadenylation site of a gene (many genes can have several polyadenylation sites) It takes your BAM files and compares them against a list of known APA sites to find out which polyA-site was used by which gene. However, I get back a data frame full of NA and 0. Since the pipeline worked before with BAM files that I downloaded from GEO I wanted to repeat the same with other files that I have in FASTQ format. So I trimmed and filtered with trimmomatic, checked quality and aligned the reads with STAR. To this point it seems to work... I was not sure whether I used the right genome for alignment. Can't I use the whole genome for alignment?

ADD REPLY
0
Entering edit mode

I think I found the problem. The list of known APA sites has the chromosome names in UCSC format e.g. "chr1" and my alignment produced chromosome names with ensemble format e.g. "1". Is there something else that I have to be aware of if I use the one or the other format?

Thanks!

ADD REPLY
0
Entering edit mode

You have already used the whole genome for alignment but you used Mus_musculus.GRCm38.dna_sm.primary_assembly.fa which is soft masked genome assembly, If there is no particular reason to use masked assembly you should use Mus_musculus.GRCm38.dna.primary_assembly.fa which has no masking.

There are some nomenclature differences in the files present on different websites but everything else is pretty much the same and please use add comment/reply to update.

ADD REPLY
0
Entering edit mode

thanks a lot! Where you think the masking could cause problems?

ADD REPLY
0
Entering edit mode

I was wrong above in suggesting to use the unmasked genome. See this.

STAR does not discriminate between soft-masked (lowercase) or unmasked genome, I am not sure about other aligners. So it should not create any differences.

ADD REPLY
0
Entering edit mode

ntsopoul please stop using the answer field for discussions. Use ADD REPLY and ADD COMMENT. That keeps the thread logically organized.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6