Entering edit mode
4.6 years ago
ntsopoul
▴
60
Hi, I would like to align trimmed and filtered FASTQ files to the mm9 (mus musculus) reference genome to analyze 3'-UTR via STAR aligner.
My downstream applications did not work and I wonder whether I have used the right genome and the right annotation file.
I downloaded and used from ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/
Mus_musculus.GRCm38.dna_sm.primary_assembly.fa (DNA, FASTA) Mus_musculus.NCBIM37.67.gtf
would it be better to use cDNA? Mus_musculus.NCBIM37.67.cdna.all.fa.gz. (cDNA, FASTA)?
Thanks for helping
What downstream application did you use and what was the error it produced ? cDNA would also contain 5'-UTR and coding regions. I think it would be better to extract the 3'-UTR regions from the genome using the gtf file and run salmon on them to count the reads mapping to these 3'-UTRs.
You can also try featureCounts and specify "three_prime_utr" as the feature to count on your sam/bam files generated by STAR
Hi and thanks for the quick answer! the downstream application was APAlyzer (https://bioconductor.org/packages/release/bioc/html/APAlyzer.html). It is a program to determine the polyadenylation site of a gene (many genes can have several polyadenylation sites) It takes your BAM files and compares them against a list of known APA sites to find out which polyA-site was used by which gene. However, I get back a data frame full of NA and 0. Since the pipeline worked before with BAM files that I downloaded from GEO I wanted to repeat the same with other files that I have in FASTQ format. So I trimmed and filtered with trimmomatic, checked quality and aligned the reads with STAR. To this point it seems to work... I was not sure whether I used the right genome for alignment. Can't I use the whole genome for alignment?
I think I found the problem. The list of known APA sites has the chromosome names in UCSC format e.g. "chr1" and my alignment produced chromosome names with ensemble format e.g. "1". Is there something else that I have to be aware of if I use the one or the other format?
Thanks!
You have already used the whole genome for alignment but you used Mus_musculus.GRCm38.dna_sm.primary_assembly.fa which is soft masked genome assembly, If there is no particular reason to use masked assembly you should use Mus_musculus.GRCm38.dna.primary_assembly.fa which has no masking.
There are some nomenclature differences in the files present on different websites but everything else is pretty much the same and please use add comment/reply to update.
thanks a lot! Where you think the masking could cause problems?
I was wrong above in suggesting to use the unmasked genome. See this.
STAR does not discriminate between soft-masked (lowercase) or unmasked genome, I am not sure about other aligners. So it should not create any differences.
ntsopoul please stop using the answer field for discussions. Use
ADD REPLY
andADD COMMENT
. That keeps the thread logically organized.