Problem during read counting with DEXSeq
0
0
Entering edit mode
5.1 years ago
deepak18 ▴ 10

I am currently trying to analyze RNA-seq data using DEXSeq. I am facing problem during the step where reads are counted. The output of read counting step generated a file which has very few transcripts which have count more than 0 (around 20,000). The output looks like this:

ENSG00000000003:001 0
ENSG00000000003:002 0
ENSG00000000003:003 0
ENSG00000000003:004 0
ENSG00000000003:005 0
ENSG00000000003:006 0
ENSG00000000003:007 0
ENSG00000000003:008 0
ENSG00000000003:009 0
ENSG00000000003:010 0

The mapping was performed using STAR. The BAM files generated after mapping were sorted by coordinate. Following code was used during mapping:

Indexing

STAR --runThreadN 12 --runMode genomeGenerate --sjdbGTFfile Homo_sapiens.GRCh38.98.gtf --genomeDir /home/erpl/star/indexing --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

Mapping

STAR --runThreadN 20 --genomeDir /home/erpl/RNA-seq_Alignment_tools/star/indexing --sjdbGTFfile /home/erpl/RNA-seq_Alignment_tools/star/indexing/Homo_sapiens.GRCh38.98.gtf --readFilesIn C24_1_1.fq C24_1_2.fq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/output_17.10.19/C24_1

For DEXSeq first we have to prepare annotation file followed by read counting using 2 python scripts provided with the R package. Following is the code that I have used to reach to this step:

Counting

python /home/erpl/R/x86_64-pc-linux-gnu-library/3.6/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.98.gtf Homo_sapiens.GRCh38.98.gff
python /home/erpl/R/x86_64-pc-linux-gnu-library/3.6/DEXSeq/python_scripts/dexseq_count.py Homo_sapiens.GRCh38.98.gff -p yes -f bam C24_1Aligned.sortedByCoord.out.bam DEXSeq_C24_1.txt

So is there some problem with the raw data itself or am I missing something here?

RNA-Seq DEXSeq • 1.5k views
ADD COMMENT
0
Entering edit mode

What are your alignment statistics with STAR?

ADD REPLY
0
Entering edit mode

Over 90% uniquely mapped reads with all samples:

  Number of input reads |   32604356
                  Average input read length |   300
                                UNIQUE READS:
               Uniquely mapped reads number |   30430485
                    Uniquely mapped reads % |   93.33%
                      Average mapped length |   297.43
                   Number of splices: Total |   38026407
        Number of splices: Annotated (sjdb) |   37697238
                   Number of splices: GT/AG |   37644236
                   Number of splices: GC/AG |   296935
                   Number of splices: AT/AC |   38377
           Number of splices: Non-canonical |   46859
                  Mismatch rate per base, % |   0.47%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.84
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.71
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1060773
         % of reads mapped to multiple loci |   3.25%
    Number of reads mapped to too many loci |   7889
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 1041968 % of reads unmapped: too short | 3.20% Number of reads unmapped: other | 63241 % of reads unmapped: other | 0.19% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

ADD REPLY
0
Entering edit mode

You have really high alignment rate but really low counts... something has gone wrong with the counting imo. Can you try another program such as HTSeq or FeatureCount to compare the counts?

ADD REPLY
0
Entering edit mode

I repeated with Salmon. Indexing was done using transcriptome fasta file. I used the following code:

Generating Transcriptome Fasta file

gffread -w Homo_sapiens_GRCh38.98.transcriptome.fa -g /home/erpl/RNA-seq_Alignment_tools/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa /home/erpl/RNA-seq_Alignment_tools/star/indexing/Homo_sapiens.GRCh38.98.gtf

Generating index and quantification

salmon index -t Homo_sapiens_GRCh38.98.transcriptome.fa -i transcripts_index

salmon quant -i transcripts_index -l OSR -1 /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/C24_1_1.fq -2 /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/C24_1_2.fq --validateMappings -o salmon_mapping_C24_1

After doing this also I am getting TPM values more than 0 for about 40,000 transcripts only. The alignment statistics were:

[2019-10-31 18:36:45.510] [jointLog] [info] Computed 123,408 rich equivalence classes for further processing

[2019-10-31 18:36:45.510] [jointLog] [info] Counted 554,627 total reads in the equivalence classes 

[2019-10-31 18:36:45.706] [jointLog] [info] Number of mappings discarded because of alignment score : 3,500,247
[2019-10-31 18:36:45.706] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 1,161,212
[2019-10-31 18:36:45.706] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2019-10-31 18:36:45.706] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings : 523,255
[2019-10-31 18:36:45.725] [jointLog] [warning] Only 554627 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2019-10-31 18:36:45.725] [jointLog] [info] Mapping rate = 1.70108%

[2019-10-31 18:36:45.725] [jointLog] [info] finished quantifyLibrary()
[2019-10-31 18:36:45.725] [jointLog] [info] Starting optimizer
[2019-10-31 18:36:45.836] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2019-10-31 18:36:45.842] [jointLog] [info] iteration = 0 | max rel diff. = 240.632
[2019-10-31 18:36:46.211] [jointLog] [info] iteration = 100 | max rel diff. = 10.0656
[2019-10-31 18:36:46.573] [jointLog] [info] iteration = 200 | max rel diff. = 1.62938
[2019-10-31 18:36:46.947] [jointLog] [info] iteration = 300 | max rel diff. = 0.115172
[2019-10-31 18:36:47.309] [jointLog] [info] iteration = 400 | max rel diff. = 0.0475986
[2019-10-31 18:36:47.672] [jointLog] [info] iteration = 500 | max rel diff. = 0.213213
[2019-10-31 18:36:48.044] [jointLog] [info] iteration = 600 | max rel diff. = 0.140714
[2019-10-31 18:36:48.074] [jointLog] [info] iteration = 609 | max rel diff. = 0.00628406
[2019-10-31 18:36:48.080] [jointLog] [info] Finished optimizer
[2019-10-31 18:36:48.080] [jointLog] [info] writing output
ADD REPLY
0
Entering edit mode

Salmon uses different algorithms for alignment/quantification but nonetheless your alignment rate is extremely low. Which indicates there's something going on with your library. Try to view the BAM files from the previous alignment to see what's mapping and compare to salmon. Or do as I suggested before and use HTseq to count.

Something very strange is going on.

ADD REPLY
1
Entering edit mode

I was able to do it successfully with DEXSeq. The problem was the missing -r flag. Thanks for the valuable suggestions

ADD REPLY

Login before adding your answer.

Traffic: 2225 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6