![Dear BioStars community,
I used STAR 2.7.5a to do alignment for my RNA-seq data, the script is as below:
genomegenerate:
STAR --runThreadN 6 --runMode genomeGenerate --genomeSAindexNbases 9 \
--genomeDir /home/xg60a/Ermin/03align_out/lsalit_STAR_genome \
--genomeFastaFiles /home/xg60a/Ermin/00ref/Lsali.fna \
--sjdbGTFfile /home/xg60a/Ermin/00ref/Lsali02.gtf \
--sjdbOverhang 149
align out:
STAR --runThreadN 5 --limitBAMsortRAM 1101674636 --genomeDir /home/xg60a/Ermin/03align_out/lsalit_STAR_genome \
--readFilesCommand zcat --readFilesIn /home/xg60a/Ermin/02clean_data/B1_1_val_1.fq.gz, /home/xg60a/Ermin/02clean_data/B1_2_val_2.fq.gz \
--outFileNamePrefix /home/xg60a/Ermin/03align_out/B1t_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts
However, after doing alignment, there are 6249540(93.12%) uniquely mapped genes but the number of no_feature
genes is as many as 6438119 which even more than uniquely mapped genes. Can anybody tell me, what is the problem? Thanks a lot in advance!
Thanks a lot! Im sorry Im pretty new in RNA-seq analysis but could you explain more about how I can fix with the correct gtf(I download the reference genome(fastq anf gff) from NCBI and I have the raw RNA-seq data(fastq) from the sequencing company) or how I can know the library prep was flawed.
I also attached the reference I used. I tried to use genome references from different strains of this species(L.salivarius) but all generated the similar results. https://www.ncbi.nlm.nih.gov/genome/?term=Lactobacillus+salivarius+str.+Ren
Additionally, since the genome annotation file is gff format, I also used gffread to convert gff file to gtf file. After converting, the filesize dramatically changed from 1,356,913 to 587,242. I'm wondering if some info lost during conversion. Below is the script I used for conversion:
gffread .gff -T -o .gtf
I also tried different version of cufflink but all looked the same. Thanks in advance!
A mismatching gtf is most often the problem, but if they match, then the problem is probably in the library prep itself, and you can't fix that with software.
Since this is a bacteria, it's more amenable to eyeballing. Use IGV to see where the reads are falling compared to the gtf.
Thanks a lot. I‘ve tried to use IGV to visualize the bam file and am not sure whether there's something wrong with steps. Following are my steps: 1) downloaded fna and gff file from NCBI.
2) created genome file and build index
3) loaded sorted.gff by loading from file
**It showed like this. The ref genes look normal.
4) After loading my bam file, results showed like this.
I'm not sure how I can look into whether it's due to mismatching gtf based on the results I currently can obtain from IGV even after googleing this problem with no hints. But when I went through the annotaiton file(gtf), I found most of genes belong to CDS instead of exon. I'm wondering if this is the reason why the genes can't be mapped to reference annotation(STAR maps only exon?). Could you tell me more? Thank you so much!**
I also tried to use kallisto and only returned 97 lines of results which were very few since I were assuming there would be more than hundreds of lines.
Plus, after looking back fastqc results, it showed that the difference between A/T and G/C is larger than 10% and also there're a lot of overrepresented sequences which belong to L.gasseri after I blasted (the reference genome I used). ! enter image description here
Do you think those suggested that the problem is library prep itself? Really appreciate for your kind reply!!
Do you expect your organism to have transcripts with exons and introns? Bacteria generally don't. The IGV looks a little weird, generally bacterial read should not be aligning with massive gaps.
My bad, just revised and updated my previous comments. Yes, bacteria generally don't have introns. But as u can see in my previous comments, in the annotation file of this bacteria, most of genes belong to cds instead of exons. I'm wondering if this is the reason why the genes can't be mapped to reference annotation(STAR maps only exon?) And do you think the massive gaps could be due to incorrect library pre or any other reasons? Thank you!
You might have to play with the settings of featureCount to make sure that it is counting things aligning to CDS regions.
Thank you so much! The problem has been resolved!