Entering edit mode
5 months ago
Pegasus
▴
130
Hi all,
I know this issue has been discussed extensively, but I am still having trouble resolving it.
Here's the process I followed:
Bulk RNA-Seq Data for bacterial strain, with a reference-transcriptom avilable on NCBI:
- Reads were filtered including adapter removal.
- Reference Transcriptome: Downloaded from NCBI (file name: B_rna_from_genomic.fna).
- Index the reference
mkdir -p idx
REF= B_rna_from_genomic.fna
DX=salmon.idx
salmon index -t ${REF} -i idx/${IDX}
- Run salmon
salmon quant --index idx/salmon.idx -l A --validateMappings -1 pc1.filtered_R1.fq -2 pc1.filtered_R2.fq -o salmon-results/pc1
However, the mapping rate across all samples is approximately 12%:
Mapping Rate: 12.8398%
Is there any issue in the code? Any help would be greatly appreciated.
Have you checked a sampling of reads using
blast+
to see if there is any likelihood of contamination?Hi @GenoMax,
Thank you for your reply,
Yes, and there is no contamination, indeed using STAR vs the genome-reference resulted in 95% mapping rate across the samples. The reference-trsnscriptome associated with same strain was downloaded from NCBI. So I guess both genome and transcriptome are reliable.
Can you see where the reads are mapping in STAR? You can try running featureCounts to see how many aligned reads map to annotated transcripts.
My thoughts are most reads are originating from unannotated parts of the genome (unannotated because the transcriptome assembly is incomplete).
I run featurecount using STAR sorted Bam files as input,
As you can see in the screenshot (featureCount summury), the number of the Unassigned_Unmapped is 0, for all samples. Unassigned_NoFeatures: between 1m to 2.5 m reads
Strange… seems like the majority of reads are still aligning within annotated transcripts. Others may have further suggestions, but my remaining suggestions are as follows:
I extrcted the gene-to-transcript from the gtf file which looks like
While the headers in the transcriptome.fasta (rna_from_genomic.fna) looks like
(lcl|JAFJXZ010000010.1_rrna_JYU28_03025_3 [locus_tag=JYU28_03025] [db_xref=RFAM:RF00177] [product=16S ribosomal RNA] [location=complement(1764..4363)] [gbkey=rRNA]).
Both gtf and transcriptom were downloaded from same NCBI source.
Then how can I (Extract a reference transcriptome yourself from the genome FASTA and GTF, and use that.)
I also noticed that the size of the:
cds_from_genomic.fna = 23.8 MB
rna_from_genomic.fna =43 KB (which suppose to be the transcriptome file)
I belive there is something wrong in the transcriptome file!!
Thanks for you guidnece
You can extract the transcripts using a utility called
gffread
(part ofcufflinks
package) starting from the genome + GTF as shown in: Extracting transcript sequences with gene name (gffread)See update.
Technically not wrong. Looks like RNA features file only contains rRNA/tRNA etc
Use this file for
salmon
: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/589/695/GCF_017589695.1_ASM1758969v1/GCF_017589695.1_ASM1758969v1_cds_from_genomic.fna.gzThank you GenoMax it works perfectly with CDS_from_genomics. file
However, is it a common practice to use this file instead the actusal transcriptom.file?
Thanks again
Since this is a bacterial genome and there are no exon/introns to account for it should be fine to use the Coding (CDS) sequences. See https://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation/#CDS and https://www.uniprot.org/help/cds_protein_definition