Question

Problem with extracting read counts from RNA seq file

0

Entering edit mode

6.5 years ago

popescuiofelia ▴ 10

Hi! I am trying to extract the reads from an RNA seq file after having trimmed it with Trimmomatic v38 and mapping it with bowtie2 with 95% overall allignment rate with this command in RSeQC:

$ infer_experiment.py -i trialoutput.sorted.bam -r Dictyannots.bed 
Reading reference gene model Dictyannots.bed ... Done
Loading SAM/BAM file ...  Finished
Total 0 usable reads were sampled
Unknown Data type

The genome I use for reference is found here dicty reference genome and the annotation files I tried look like this:

1) dicty_annots.bed

DDB0232428  1890  3287   DDB_G0267178   0      +    1890        3287        0,0,0   1   1397    0 
DDB0232428  3848  4855  DDB_G0267180    0   +   3848    4855    0,0,0   1   1007    0 
DDB0232428  5505  7769  DDB_G0267182    0   +   5505    7769    0,0,0   1   2264    0

2) Dictyannots.bed

1   1889    3287    gene:DDB_G0267178   .   +   ensembl CDS 0   transcript_id "transcript:EAL73826"; gene_id "gene:DDB_G0267178";
1   1889    3287    gene:DDB_G0267178   .   +   ensembl exon    .   transcript_id "transcript:EAL73826"; gene_id "gene:DDB_G0267178"; 
1   3847    4855    gene:DDB_G0267180   .   +   ensembl CDS 0   transcript_id "transcript:EAL73827"; gene_id "gene:DDB_G0267180";

The second annotation file was converted from gff3 to bed by cufflinks.

Do you have any suggestions why this might happen?

I have also tried to load the sorted bam file with either of the annotations in gtf format in subread's package featurecounts with this command:

featureCounts -a Dictyannot.gtf -t exon -g gene_id -o counts.txt trialoutput.sorted.bam

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S trialoutput.sorted.bam                         ||
||                                                                            ||
||             Output file : counts.txt                                       ||
||                 Summary : counts.txt.summary                               ||
||              Annotation : Dictyannot.gtf (GTF)                             ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Dictyannot.gtf ...                                    ||
||    Features : 30575                                                        ||
||    Meta-features : 13731                                                   ||
||    Chromosomes/contigs : 37                                                ||
||                                                                            ||
|| Process BAM file trialoutput.sorted.bam...                                 ||
||    Single-end reads are included.                                          ||
||    Assign reads to features...                                             ||
||    Total reads : 2484495                                                   ||
||    Successfully assigned reads : 0 (0.0%)                                  ||
||    Running time : 0.04 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
|| Summary of counting results can be found in file "counts.txt.summary"      ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

RNA-Seq RSeQC featureCounts software error • 3.3k views

ADD COMMENT • link updated 6.5 years ago by h.mon 35k • written 6.5 years ago by popescuiofelia ▴ 10

1

Entering edit mode

Looks like your BAM and GTF files individually does not have any issue. Sometimes this happens when the chromosome names are not matching between BAM and GTF files (Example, "chr12" or "12").

ADD REPLY • link 6.5 years ago by EagleEye 7.6k

0

Entering edit mode

This is the header from the sam file that was converted to bam:

    SRR1593424.2165755  0   DDB0216524|DDB_G0267364 414 1   50M *0  0   ATCGCTTAGTCGTGTCCGTCGTAACAACATCGCTAAAGAAATCTATGGTT  ?=?DDDDDHFHFH<?@@GHI@C8??EF@DDDDDB;?FFGCH@B=FBB9CG  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU

SRR1593424.1488972  0   DDB0216524|DDB_G0267364 429 1   50M *0  0   CCGTCGTAACAACATCGCTAAAGAAATCTATGGTTCTGAAGTACTCTTAC  BCCDFFFFHHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIFHIIIIIII  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU

I think the names of the chromosome are the same ("1"). Thank for your reply! Do you have any other suggestions?

ADD REPLY • link 6.5 years ago by popescuiofelia ▴ 10

1

Entering edit mode

Sorry other than that I could not see any problem. It is is hard to figure out when you do not get any errors. can you post few lines from your GTF file ?

ADD REPLY • link 6.5 years ago by EagleEye 7.6k

0

Entering edit mode

Hi! I have managed to find the problem, it was actually the chromosome names. However,now I face another problem as I am trying to map the sequence to the open reading frame reference file which looks like this:

>DDB0216524|DDB_G0267364 |DNA coding sequence|gene: DDB_G0267364_RTE on chromosome: 1 position 173832 to 174272
ATGGTCGTACCAGACTGGTTACAAACTGATGTTGCAATCAAACTCAACACTCAAACCGAT
ATCACCGAGTGTCGTGAAGCCATCACC
>DDB0216528|DDB_G0267372 |DNA coding sequence|gene: DDB_G0267372_RTE on chromosome: 1 position 193516 to 194159
ATGAAAATCACAAACAACACCACAAATATTAAGCAACACAAATGCCTACAAAAAATAAGC

When I load the reference ORF in IGV instead of chromosomes I can see gene names (as DDB0216524|DDB_G0267364). Do you have any idea how to get around this problem?

PS. In this case DDB0216524 is not the chromosome name.

Thank you for your help and time!

ADD REPLY • link updated 6.5 years ago by h.mon 35k • written 6.5 years ago by popescuiofelia ▴ 10

1

Entering edit mode

No, the chromosome names in that bam are not "1". Look up what each column of a sam file is. And that's not the header of the sam file either.

ADD REPLY • link 6.5 years ago by swbarnes2 14k

score 0 · Answer 1 · 2018-10-18

This is the header from the sam file that was converted to bam:

SRR1593424.2165755  0   DDB0216524|DDB_G0267364 414 1   50M *0  0   ATCGCTTAGTCGTGTCCGTCGTAACAACATCGCTAAAGAAATCTATGGTT  ?=?DDDDDHFHFH<?@@GHI@C8??EF@DDDDDB;?FFGCH@B=FBB9CG  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU
SRR1593424.1488972  0   DDB0216524|DDB_G0267364 429 1   50M *0  0   CCGTCGTAACAACATCGCTAAAGAAATCTATGGTTCTGAAGTACTCTTAC  BCCDFFFFHHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIFHIIIIIII  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU

This is not the header from the bam file, this is a snippet from the alignment section. This clearly shows your chromosome are not named 1, 2, 3, etc, they are named SRR1593424.2165755, SRR1593424.1488972, and so on. Whereas in your bed and gff3 files (currently in your post you call then both bed), chromosome names are DDB0232428 (from the gff3 file) and 1 (from the bed file).

edit: shame on me, as EagleEye and swbarnes2 pointed out, I swapped reference and read columns. However, my remark is still correct, as the chromosome names still don't match between annotation and fasta. In the bam file, chromosome reference is named DDB0216524|DDB_G0267364, and named DDB0232428 (dicty_annots.bed) or 1 (Dictyannots.bed).