Large majority of RNASeq reads mapping to intergenic regions - common issues excluded
0
0
Entering edit mode
3.9 years ago
npb27 • 0

Hi all,

I'm having an issue during alignment of metatranscriptome reads to my reference genome - most of my reads align to no annotated feature. I particularly hope to get as much info out of the small number of reads as possible as I'm working with a low coverage due to low organism abundance.

The data is from a mucosal metatranscriptome with a known parasite. Note the parasite strain doesn't necessarily match the reference genome, but it is definitely the same species

Procedure:

  • Reads came quality trimmed with adaptors removed from sequencing company
  • Removed rRNA reads using sortmerna (only a tiny portion of reads aligned)
  • Classified reads taxonomically using kraken2
  • Aligned reads classified as the parasite of interest to the reference genome using STAR
  • Quantified reads mapping to genes using STAR's --quantmod genecounts and also independently with htseq-count

Most of the reads from all my samples map to regions without annotated genes. As an example, for a single sample here is the output from STAR:

$star --runMode alignReads --runThreadN 11 --genomeDir $STAR_INDEX_DIR --readFilesIn $IN_FILE_1 $IN_FILE_2 --outFileNamePrefix  $outdir/$file/$file --chimSegmentMin 20 --quantMode GeneCounts --sjdbGTFfile $GFF --outSAMstrandField intronMotif --readFilesCommand zcat --sjdbGTFfeatureExon CDS

                             **Started job on |       Feb 17 11:59:18
                         Started mapping on |       Feb 17 11:59:34
                                Finished on |       Feb 17 11:59:47
   Mapping speed, Million of reads per hour |       80.34
                      Number of input reads |       290120
                  Average input read length |       300
                                UNIQUE READS:
               Uniquely mapped reads number |       236500
                    Uniquely mapped reads % |       81.52%
                      Average mapped length |       271.79
                   Number of splices: Total |       8
        Number of splices: Annotated (sjdb) |       0
                   Number of splices: GT/AG |       7
                   Number of splices: GC/AG |       1
                   Number of splices: AT/AC |       0
           Number of splices: Non-canonical |       0
                  Mismatch rate per base, % |       1.94%
                     Deletion rate per base |       0.00%
                    Deletion average length |       1.32
                    Insertion rate per base |       0.00%
                   Insertion average length |       1.05
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       6950
         % of reads mapped to multiple loci |       2.40%
    Number of reads mapped to too many loci |       0
         % of reads mapped to too many loci |       0.00%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 46670 % of reads unmapped: too short | 16.09% Number of reads unmapped: other | 0 % of reads unmapped: other | 0.00% CHIMERIC READS: Number of chimeric reads | 2439 % of chimeric reads | 0.84%**

Note 236500 reads uniquely mapped

Here is the output from htseq-count:

$ htseq-count -r name -t CDS A6Aligned.out.sort.sam $GFF > test

__no_feature    199731
__ambiguous     0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  14088

Therefore 84% of reads mapped to no-feature regions

I think I've ruled out obvious problems:

  • I think my data is stranded based on viewing alignment in a genome browser (artemis)
  • The chromosome names are definitely identical between my SAM and GFF files (checked with comm)
  • I think the coordinates in the GFF are correct for my reference genome (I checked by translating a few sample CDS from the genome fasta)

Any suggestions? My only conclusion so far is that the genome could be extremely poorly annotated and that most of the annotated genes are missed.

I also tried BLASTing a sample of reads against nr and most hit proteins from a closely related species

All help is much appreciated

Sample GFF lines: (I'm aware it's weirdly formatted, as it's not yet available from NCBI so I generated this version from an embl file sent from a collaborator, any suggestions for formatting a good GTF would be much appreciated)

NODE_2867.1     feature CDS     2066    2365    .       -       0       colour=7;gene_id=TGA_001047400;transcript_id=TGA_001047400.1;product=PH domain containing protein%2C putative;transl_table=1;translation=MSLSFKEGWMTKQGGLIKTWKRRYFVLTANSLIYYSKPGSKEKGRINIQDATPETAPDCKRQPAFKLNTPNRVFYIVTDNPNDTQDWISTIRYIKSRDA
NODE_2867.1     feature CDS     3932    5161    .       -       0       colour=10;gene_id=TGA_001047500;transcript_id=TGA_001047500.1;ortholog=TVAG_077490.1 TVAG_077490.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=Clan MA%2C family M48%2C Ste24 endopeptidase-like metallopeptidase;transl_table=1;translation=MFVPLFIALIFIETQFEVYLHHRQRQKVIESTEAPEVFREFYTADEFSAAREYEMDKSYFKMISIFYSGIIWIVMIPLIPRLWKLIKIQGEYKKSVIFAILFASLILSSKIPFRYYNTFVLEEKHGFNKSTIQLFVMDNVKIFAILVVYLIILVPLFTKIYKVAGKYFVPIGCTVYIIIVVITQLIFPTLILPLFTKLTPLQEGPIFDAVMKLSNETQFPVTEMYSADDSKRSSHQNAMVFGLFTKKIAIADTLLNVSTPDTIAAVVGHEIGHSKHHHIIKILGIEVAEALIVFALLYIVMNTDKVFKDLGFDEKPFIVGLVLFYYLYSPVDSLMQLPVNTCIRYFEFQADHYSASRGLPLDIALLKLAHDNKMAIEPDSLYFSFYHSHPTIPQRVKRIREIFAEIKKE
NODE_2867.1     feature CDS     5844    6728    .       +       0       colour=10;gene_id=TGA_001047600;transcript_id=TGA_001047600.1;ortholog=TGA_000389200.1 TGA_000389200.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=hypothetical protein%2C conserved;transl_table=1;translation=MNITDVANGTINFNYSDIKATFITDFVLASIFIIFGIIYTLLTIRQTCEEKGEVQGQFEAQIMYIAAVFCKSLGLFVTGIFLAFKTPTGSDDEFWNKYSVLPGGIPGYVTAIAYCFIFFSWCSVCLDSLEKDSVKFYARSKWVLLTLISGIVVLFLVMIICMQACKDYGTFHKIEASVAILRDLIIAFLFILYLHRIFKLFEEPCSGFSSPETKLLFICITLIVSLIMRPISIGIYTLWVTKYSEFSTSYLIIFLVEIVITEIIPIGSIGITRLSGYNQRAKQNEDVAQFLAIN
NODE_2867.1     feature CDS     7462    8934    .       +       0       colour=10;gene_id=TGA_001047700;transcript_id=TGA_001047700.1;ortholog=TVAG_224560.1 TVAG_224560.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=hypothetical protein%2C conserved;transl_table=1;translation=MNDYKFDSDLSDFAIEQRWILSKEPQFLEYDPQISVTDSSIPNFGIHFAEFLQEINQTIAAADGPYLNEVLYNLLQFCNKNLIKLIKEKINITFLIELLDQFSIQDIVSAIVSSQIQSNINILMKIIALLSQFKKIAEYISTTNFLHVIIPIFIECSPQEAYIFSVLIFTYIPLYSKIKDGLMGFTDVIDPVTSNLYHKNGIDWSLDFLHYMILFVPDINHENILNIINPILKKCIPKDAESTERFNWILNFCSTILTKNLESFKYMSTVIPFIFDNVPILFSQTSNLYDNVKISIIRFLTLCMISSRDENICNFALNLSVIEFRRLFNRETLTPKERHVLLKFLREISHINVNEFYVWFECEKDFANFINYVLNNLTNIEKILAYQIMSNLINSKYEKSMIELIMDNLNIISTGCDLFDSDNQPISKIFVDIIFNILTRYNNLVNSDLITILTDNDMCDSFTDSDLNDDEEYSTKINTICNIISNLVDQ

Sample SAM lines:

A00783:559:HNTM7DSXY:3:2159:13114:9455  339     NODE_2381.1     492     3       143M7S  =       377     -258    AACGAACGTAACACACTCGGTTCCCGCGAAGTTCAGACAGCCGTCCGTCTCGTTCTTCCAGGCGATCTTGCCAAGCACGCTGTTTCTGAAGGTGGCAAGGCCGTCGCTAAGTTCAACGCTGAGTAAACTATTTATCCATTTTTGTCCGAA       FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF       NH:i:2  HI:i:2  AS:i:285        nM:i:3
A00783:559:HNTM7DSXY:3:2159:13114:9455  163     NODE_2381.1     377     3       150M    =       492     267     TAAGCAGGTCCACCCAGATATCGGTATCTCCAACAAGGCTATGATGGTCATGAACTCCTTCGTTACAGATATCTTCGAGAGAATTGCCCAGGAAGGTGCTCGTCTCGTCGATATGAACGAACGTAACACACTCGGTTCCCGCGAAGTTCA       FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:FF       NH:i:2  HI:i:1  AS:i:286        nM:i:3
A00783:559:HNTM7DSXY:3:2159:13114:9455  419     NODE_2381.1     377     3       150M    =       492     258     TAAGCAGGTCCACCCAGATATCGGTATCTCCAACAAGGCTATGATGGTCATGAACTCCTTCGTTACAGATATCTTCGAGAGAATTGCCCAGGAAGGTGCTCGTCTCGTCGATATGAACGAACGTAACACACTCGGTTCCCGCGAAGTTCA       FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:FF       NH:i:2  HI:i:2  AS:i:285        nM:i:3
A00783:559:HNTM7DSXY:3:2159:13123:1204  83      NODE_36664.1    163     255     77M73S  =       58      -182    ATGACAACAGTCCACTCCTACACAAACGACCAGGTTGTCGCTGATACAATGCACAAGGATCTCCGCCGTGCCCGTGCTGCTGGCATGAACATCATCCCAACATCCACAGGTGCCGCTATCGCTCTTCCAAAGGTTTGCCACGGCCTTCCA       FFFFF:FF,:FFFFFFFFF,F,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF       NH:i:1  HI:i:1  AS:i:219        nM:i:3

Happy to send bigger samples as needed

rna-seq alignment STAR htseq • 1.2k views
ADD COMMENT
0
Entering edit mode

You seem to have done your due diligence so that likely leaves your sample as a possible problem. You may have a difficult sample that did not produce good libraries (or may have DNA contamination, this you would have to ask expt people to check). If you are sure that was not the case then missing annotations would likely be the only cause left.

ADD REPLY
0
Entering edit mode

Many thanks for this. I thought that this may be the case, but definitely wanted to get a second opinion in case I'm missing something obvious.

DNA contamination could well be an issue because the samples were not DNAse treated (the amount of biomaterial was too low). However, almost all the reads seem to match coding genes from a related species so I would guess that this is (hopefully) minimal. I'll see if any significant improvement to the annotation can be made

ADD REPLY

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6