cds_from_genomic.fna = 23.8 MB

Question

Low mapping rate using SALMON

0

Entering edit mode

4 months ago

Pegasus ▴ 120

Hi all,

I know this issue has been discussed extensively, but I am still having trouble resolving it.

Here's the process I followed:

Bulk RNA-Seq Data for bacterial strain, with a reference-transcriptom avilable on NCBI:

Reads were filtered including adapter removal.
Reference Transcriptome: Downloaded from NCBI (file name: B_rna_from_genomic.fna).
Index the reference

     mkdir -p idx
       REF= B_rna_from_genomic.fna
       DX=salmon.idx
       salmon index -t ${REF} -i idx/${IDX}

Run salmon

salmon quant --index idx/salmon.idx -l A --validateMappings -1 pc1.filtered_R1.fq  -2 pc1.filtered_R2.fq  -o salmon-results/pc1

However, the mapping rate across all samples is approximately 12%:

Mapping Rate: 12.8398%

Is there any issue in the code? Any help would be greatly appreciated.

RNA-SEQ • 849 views

ADD COMMENT • link updated 4 months ago by GenoMax 148k • written 4 months ago by Pegasus ▴ 120

1

Entering edit mode

Have you checked a sampling of reads using blast+ to see if there is any likelihood of contamination?

ADD REPLY • link 4 months ago by GenoMax 148k

0

Entering edit mode

Hi @GenoMax,

Thank you for your reply,

Yes, and there is no contamination, indeed using STAR vs the genome-reference resulted in 95% mapping rate across the samples. The reference-trsnscriptome associated with same strain was downloaded from NCBI. So I guess both genome and transcriptome are reliable.

ADD REPLY • link 4 months ago by Pegasus ▴ 120

1

Entering edit mode

Can you see where the reads are mapping in STAR? You can try running featureCounts to see how many aligned reads map to annotated transcripts.

My thoughts are most reads are originating from unannotated parts of the genome (unannotated because the transcriptome assembly is incomplete).

ADD REPLY • link 4 months ago by dsull ★ 7.0k

0

Entering edit mode

I run featurecount using STAR sorted Bam files as input,

As you can see in the screenshot (featureCount summury), the number of the Unassigned_Unmapped is 0, for all samples. Unassigned_NoFeatures: between 1m to 2.5 m reads

enter image description here

ADD REPLY • link 4 months ago by Pegasus ▴ 120

1

Entering edit mode

Strange… seems like the majority of reads are still aligning within annotated transcripts. Others may have further suggestions, but my remaining suggestions are as follows:

Try running R1 and R2 individually in single-end mode in salmon
Try redownloading the reference transcriptome (in case the file was truncated)
Extract a reference transcriptome yourself from the genome FASTA and GTF, and use that.

ADD REPLY • link 4 months ago by dsull ★ 7.0k

0

Entering edit mode

I extrcted the gene-to-transcript from the gtf file which looks like

JYU28_00015     unassigned_transcript_4229
JYU28_00020     unassigned_transcript_4230
JYU28_00025     unassigned_transcript_4231
JYU28_00030     unassigned_transcript_4232
JYU28_00035     unassigned_transcript_4233
JYU28_00040     unassigned_transcript_4234
JYU28_00045     unassigned_transcript_4235
JYU28_00050     unassigned_transcript_4236
JYU28_00055     unassigned_transcript_4237

While the headers in the transcriptome.fasta (rna_from_genomic.fna) looks like

(lcl|JAFJXZ010000010.1_rrna_JYU28_03025_3 [locus_tag=JYU28_03025] [db_xref=RFAM:RF00177] [product=16S ribosomal RNA] [location=complement(1764..4363)] [gbkey=rRNA]).

Both gtf and transcriptom were downloaded from same NCBI source.

Then how can I (Extract a reference transcriptome yourself from the genome FASTA and GTF, and use that.)

I also noticed that the size of the:

cds_from_genomic.fna = 23.8 MB

rna_from_genomic.fna =43 KB (which suppose to be the transcriptome file)

I belive there is something wrong in the transcriptome file!!

Thanks for you guidnece

ADD REPLY • link 4 months ago by Pegasus ▴ 120

1

Entering edit mode

You can extract the transcripts using a utility called gffread (part of cufflinks package) starting from the genome + GTF as shown in: Extracting transcript sequences with gene name (gffread)

See update.

ADD REPLY • link 4 months ago by GenoMax 148k

1

Entering edit mode

I belive there is something wrong in the transcriptome file!!

Technically not wrong. Looks like RNA features file only contains rRNA/tRNA etc

>lcl|NZ_JAFJXZ010000012.1_trna_JYU28_RS07340_30 [locus_tag=JYU28_RS07340] [product=tRNA-Ala] [location=916279..916354] [gbkey=tRNA]
>lcl|NZ_JAFJXZ010000012.1_trna_JYU28_RS07345_31 [locus_tag=JYU28_RS07345] [product=tRNA-Lys] [location=916367..916439] [gbkey=tRNA]
>lcl|NZ_JAFJXZ010000012.1_trna_JYU28_RS07545_32 [locus_tag=JYU28_RS07545] [product=tRNA-Leu] [location=955410..955494] [gbkey=tRNA]
>lcl|NZ_JAFJXZ010000012.1_ncrna_JYU28_RS07820_33 [gene=ssrS] [locus_tag=JYU28_RS07820] [db_xref=RFAM:RF00013] [product=6S RNA] [ncRNA_class=other] [location=1014083..1014277] [gbkey=ncRNA]
>lcl|NZ_JAFJXZ010000012.1_trna_JYU28_RS07975_34 [locus_tag=JYU28_RS07975] [product=tRNA-Leu] [location=1041619..1041703] [gbkey=tRNA]
>lcl|NZ_JAFJXZ010000012.1_rrna_JYU28_RS10195_35 [gene=rrf] [locus_tag=JYU28_RS10195] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=1522312..1522428] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000014.1_rrna_JYU28_RS10250_36 [locus_tag=JYU28_RS10250] [db_xref=RFAM:RF00177] [product=16S ribosomal RNA] [location=complement(<1..331)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000014.1_ncrna_JYU28_RS10280_37 [gene=ffs] [locus_tag=JYU28_RS10280] [db_xref=RFAM:RF01854] [product=signal recognition particle sRNA large type] [ncRNA_class=SRP_RNA] [location=complement(4382..4648)] [gbkey=ncRNA]
>lcl|NZ_JAFJXZ010000014.1_trna_JYU28_RS10535_38 [locus_tag=JYU28_RS10535] [product=tRNA-Ser] [location=complement(56497..56585)] [gbkey=tRNA]
>lcl|NZ_JAFJXZ010000014.1_rrna_JYU28_RS10565_39 [gene=rrf] [locus_tag=JYU28_RS10565] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=complement(63129..63245)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000014.1_rrna_JYU28_RS10570_40 [locus_tag=JYU28_RS10570] [db_xref=RFAM:RF02541] [product=23S ribosomal RNA] [location=complement(63346..>63837)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000015.1_rrna_JYU28_RS10810_41 [gene=rrf] [locus_tag=JYU28_RS10810] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=complement(65593..65709)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000015.1_rrna_JYU28_RS10815_42 [gene=rrf] [locus_tag=JYU28_RS10815] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=complement(65812..65928)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000015.1_rrna_JYU28_RS10820_43 [gene=rrf] [locus_tag=JYU28_RS10820] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=complement(66185..66301)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000015.1_rrna_JYU28_RS10825_44 [gene=rrf] [locus_tag=JYU28_RS10825] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=complement(66311..66428)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000015.1_rrna_JYU28_RS10830_45 [locus_tag=JYU28_RS10830] [db_xref=RFAM:RF02541] [product=23S ribosomal RNA] [location=complement(66580..>66692)] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000016.1_rrna_JYU28_RS10835_46 [locus_tag=JYU28_RS10835] [db_xref=RFAM:RF00177] [product=16S ribosomal RNA] [location=<1..111] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000016.1_rrna_JYU28_RS10840_47 [gene=rrf] [locus_tag=JYU28_RS10840] [db_xref=RFAM:RF00001] [product=5S ribosomal RNA] [location=1007..1123] [gbkey=rRNA]
>lcl|NZ_JAFJXZ010000017.1_rrna_JYU28_RS10845_48 [locus_tag=JYU28_RS10845] [db_xref=RFAM:RF00177] [product=16S ribosomal RNA] [location=complement(<1..64)] [gbkey=rRNA]

Use this file for salmon: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/589/695/GCF_017589695.1_ASM1758969v1/GCF_017589695.1_ASM1758969v1_cds_from_genomic.fna.gz

ADD REPLY • link 4 months ago by GenoMax 148k

0

Entering edit mode

Thank you GenoMax it works perfectly with CDS_from_genomics. file

However, is it a common practice to use this file instead the actusal transcriptom.file?

Thanks again

ADD REPLY • link 4 months ago by Pegasus ▴ 120

1

Entering edit mode

Since this is a bacterial genome and there are no exon/introns to account for it should be fine to use the Coding (CDS) sequences. See https://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation/#CDS and https://www.uniprot.org/help/cds_protein_definition

ADD REPLY • link 4 months ago by GenoMax 148k