fastqc report for degradome reads
0
0
Entering edit mode
6.7 years ago
Sam ▴ 150

Dear All

I have some problem with my degradome data, the company provided me two files, raw reads and mappable reads (trimmed reads without the adapter), this is the fastQC report of mappable reads for overrepresented sequences:

Sequence    Count   Percentage  Possible Source
AGTTCTACAGTCCGACGATCAGTTCTACAGTCCGACGATCAGTTCTACAGT 8897    0.29136760032893066 Illumina DpnII expression Sequencing Primer (95% over 21bp)
CAGAGTTCTACAGTCCGACGATCCAGAGTTCTACAGTCCGACGATCCAGAG 5043    0.16515306378091463 Illumina DpnII expression Sequencing Primer (100% over 23bp)
GCGACCCCAGGTCAGGCGGGACCACCCGCTGAGTTTAAGCATATCAATAAG 4503    0.14746861911668818 No Hit
GAGTTCTACAGTCCGACGATCGAGTTCTACAGTCCGACGATCGAGTTCTAC 4241    0.1388883885573783  Illumina Small RNA Adapter 1 (95% over 22bp)
AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTCCGACGATCAGAGTTC 3681    0.1205489644611435  Illumina Small RNA Adapter 1 (96% over 26bp)
CAGAGTTCTACAGTCCGACGATCCAGAGTTCTACAGTCCAACGATGGAATT 3234    0.10591017415575608 Illumina DpnII expression Sequencing Primer (100% over 23bp)

But the main issue is for sequence length, so the raw reads length is 51 bp, that mean RNA seq + adapter length = 51 bp. but in clean reads (mappable.fq.gz= trimmed reads without adapter ) again about 20 percent of sequence ( 500 000 ) has the SAME length which is 51 bp ( same length before trimming), how it could be possible?

Furthermore, in mappable files about 50 percent of reads could NOT match to target even with 2 allowed mismatch. The length of sequences is 10 to 51 bp

I think these data are not clean reads and there are some adapter remnants in sequences, is it right?

Thanks

fastqc adaptor • 4.1k views
ADD COMMENT
1
Entering edit mode

If you are worried about remnants of adapters/extraneous sequences then pass your data through a scan/trim program. I recommend bbduk.sh from BBMap suite.

That said, grab a sample of those reads that don't match the target and blast them at NCBI. Should immediately give you an idea of what you are looking at.

again about 20 percent of sequence ( 500 000 ) has the SAME length which is 51 bp ( same length before trimming), how it could be possible

For standard libraries that just means those read did not have any illumina adapters. In general, you should not have any adapter sequence present as long as the size of insert is longer than the length of sequencing.

Edit: I am not directly familiar with degradome and if it uses some specific modification then the above observation will not apply.

ADD REPLY
0
Entering edit mode

I rechecked one of the clean reads with bowtie with this command

  bowtie genome_index -f mappable.fasta  -v 2 --un mappable_not_match_genome.fasta
# reads processed: 4942822
# reads with at least one reported alignment: 2515790 (50.90%)
# reads that failed to align: 2427032 (49.10%)
Reported 2515790 alignments to 1 output stream(s)

These are some of the unaligned reads with -v 2 (most of unaligned seqs have 51 nucleotide lenght)

>HISEQ-MFG:552:HB0KLADXX:1:1101:1444:2142
ATTAGCAGCTACAACCTGCCTTTGCTAACAGAGAATATGGGAAAGGTTAGC
>HISEQ-MFG:552:HB0KLADXX:1:1101:2718:2049
AAGGGAAAGAAGAAAAAAAATGGAAGTGGATTCCTGGTTGTGTGACTGTTG
>HISEQ-MFG:552:HB0KLADXX:1:1101:3521:2098
TACAGTTCCCGACGATCCGACGATCAAGGGCTGTAACTATATTTATGCGTG
>HISEQ-MFG:552:HB0KLADXX:1:1101:3644:2098
CAAGAGTGTAGCAGACTCAAGGCTGCCTCAGCATGTAGAGAAGGGTAATGC
>HISEQ-MFG:552:HB0KLADXX:1:1101:4552:2102
AGAGTTCTACAGTTCGACGATCTACAGTCCGACGATCATATTGAACCTTGG
>HISEQ-MFG:552:HB0KLADXX:1:1101:4997:2205
AAAAAATGAACCAGATTGCAAGATTTTCCTGATGAGCTTAGCGCTGGAATT
>HISEQ-MFG:552:HB0KLADXX:1:1101:6095:2122
AGTCCGACGATCACAGAGTTCTACAGTCCGACGATCAGTCCGACGATCACA
>HISEQ-MFG:552:HB0KLADXX:1:1101:7806:2053
GCTGAGAATGTGGATGTTGAAGTAGCTGAACAAGAGGAAGATGGTGTGGAA
>HISEQ-MFG:552:HB0KLADXX:1:1101:11121:2058
AGAGTTCTACAGAAGATGCACTCTGAGGGATCGGAGATGAGCCTCGAAGGA
>HISEQ-MFG:552:HB0KLADXX:1:1101:11521:2165
GTTCTACAGTCCGACGATCCAGTCCGACGATCGTTCTACAGTCCGGTGGAA
>HISEQ-MFG:552:HB0KLADXX:1:1101:12264:2099
GTTCTACAGTCCGACGATCAATTTCTCGTACAAAGAAAGTAAAAATTGTGG
>HISEQ-MFG:552:HB0KLADXX:1:1101:14284:2125
AAAAGGAAGGAGAAGAAGAAGAAGAAGAAGAGAAGAAGAGATGTCGTGGCA
>HISEQ-MFG:552:HB0KLADXX:1:1101:16220:2214
AAATTTTTAGGTTTAACATTTCTCGGTTGTGCTAGCCTTCTAGGGCAAGAT
>HISEQ-MFG:552:HB0KLADXX:1:1101:16850:2155
CAATATGTCAGGAAGGAAAGAAACAGTTTTGGATCTGGCGAAGTTCGTGGA
>HISEQ-MFG:552:HB0KLADXX:1:1101:17060:2064
TTGAAGAAATTAGAGTGCTCAAAGCAAGCCTACGCTCTGGATACATTTGCA
>HISEQ-MFG:552:HB0KLADXX:1:1101:1284:2275
GACGATCGATCCAAAATGGAAGTGGATTCCTGGTCGAGGTACTGTTAATAG
>HISEQ-MFG:552:HB0KLADXX:1:1101:2240:2318
AGTTTGATCTTTGAAGTTTCACATTGTCTGTACACTTCTTTAGGCATGGAA
>HISEQ-MFG:552:HB0KLADXX:1:1101:2223:2331
CAGAGTTCTACAGTCCGACGATCCAGAGTTCTACAGTCCGACGATCCAGTG
>HISEQ-MFG:552:HB0KLADXX:1:1101:2545:2393
TCTACAGTCCGACGATCATCCAGAGTTCTACAGTCCGACGATCCAACAGTG
>HISEQ-MFG:552:HB0KLADXX:1:1101:2978:2446
ACATCCTATGACTGCTATACAATTTGAAGGAACAAGGAAGAGACGTTGGAA
ADD REPLY
0
Entering edit mode

I checked some of these and it looks like they are mapping to the grape genome. So unless you are not working with grape I am not sure why they did not align in first place.

ADD REPLY
0
Entering edit mode

actually the target genome is grape but the issue is low mapping rate , I rechecked for all 3 degradome libs and the average mapping rate with 2 mismatch is just 50 percent, I think the problem is this 51b reads because in unaligned fragment most of sequences have 51 base length , for instance in one these clean libs, there is about 1 M reads with 51 length but 800 000 of them could not match to genome with -v 2 !! strange point is a lot of these reads could not match with any organism in blast. it could be some kind of artifact ? I use same index for degradom and small RNA libs but the small RNA mapping rate is 80% (-v 1)

ADD REPLY
0
Entering edit mode

It makes more sense now. bowtie v1 is appropriate for small RNA because it does un-gapped alignments. If you are interested in allowing for splicing (or at least split-mapping) you may need to try a different aligner (that can also do gapped alignments, bowtie2, STAR, bbmap etc).

Those reads that don't match anything in blast could be some kind of experimental artifact/contamination.

ADD REPLY
0
Entering edit mode

these degradome reads is kind of small RNA (length distribution is between 10 to 51 bp)

ADD REPLY
0
Entering edit mode

Are you saying that you can't/don't want to allow gaps in alignments?

ADD REPLY
0
Entering edit mode

I just followed bowtie2 guide line " It is particularly good at aligning reads of about 50 up to 100s or 1,000s" if we allowed gap alignment for short reads , it not produce some misalignment?

ADD REPLY
0
Entering edit mode

bowtie v.1 and v.2 are different programs. You used bowtie v.1 for the stat above and that is good for small RNA.

You can try bowtie v.2 independently (you will need to recreate the indexes if I recall right) and see what you get. If you have reads that capture a splice (e.g. two parts going to two exons) then you need gapped alignments.

ADD REPLY
0
Entering edit mode

I lunched bowtie2 but no significant difference in results :

4942822 (100.00%) were unpaired; of these:
    2470651 (49.98%) aligned 0 times
    1304877 (26.40%) aligned exactly 1 time
    1167294 (23.62%) aligned >1 times
50.02% overall alignment rate
ADD REPLY
0
Entering edit mode

I would have expected bowtie2 to show more alignments. Did you use default parameters or change any? Are you willing to try a different aligner?

ADD REPLY
0
Entering edit mode

I used bowtie v.2 with default, but I think its due to contamination because I blast these unaligned reads in NBCI and some of these reads could not match with any organisms but the other could be matched but just in partial part for instance 30 bp of 51 bp aligned with the grape genome. I got a bit wired answer from the company, "it's expected that longer, degraded RNA fragments are generally harder to map to the genome than smaller fragments"

but the size of these degradome reads is 10 to 51 base, NOT so long, approximately as same as small RNA,

ADD REPLY
1
Entering edit mode

Scan/trim your original data set using the instructions I linked for bbduk.sh in thread above. Once that is done use bbmap.sh aligner with following options specific for small RNA alignments to see if you can get better mapping

bbmap.sh in=trimmed.fq.gz out=result.bam ref=path_to_your_genome.fa ambig=all vslow perfectmode maxsites=1000
ADD REPLY
0
Entering edit mode

should I use default adapter list with bbduk .sh?

ADD REPLY
0
Entering edit mode

If the experiment used one of the standard illumina/commercial kits then yes.

If you have paired-end data then trim both reads together. Modify bbmap.sh mapping command to include in1= in2= to account for the two files.

ADD REPLY
0
Entering edit mode

I lunched the bbduk.sh with this flags

./bbduk.sh in=mappable_trimmed.fq ref=adapters.fa ktrim=r k=15 mink=6 hdist=1 hdist2=0

results :

Input:                      4960986 reads       154714879 bases.
KTrimmed:                   2052393 reads (41.37%)  58431509 bases (37.77%)
Total Removed:              1828395 reads (36.86%)  58431509 bases (37.77%)
Result:                     3132591 reads (63.14%)  96283370 bases (62.23%)

lots of reads already removed, what do you think ?

Thanks

ADD REPLY
0
Entering edit mode

If those reads had predominantly adapter then that is what we wanted.

Is there any expected size range for degradome reads (e.g. miRNA expected to be 21-25 bp)? If so you could also enforce minlen= to drop reads that go below that length.

Now try the bbmap.sh mapping with options noted above.

ADD REPLY
0
Entering edit mode

worse result , I think it's due to perfectmode flag but with bowtie v.2 and bowtie v.1 mapping rate increased about 30 percent

Read 1 data:        pct reads   num reads   pct bases      num bases

    mapped:              22.4778%      704136    20.8520%       20077026
    unambiguous:         14.4562%      452855    16.4978%       15884629
    ambiguous:            8.0215%      251281     4.3542%        4192397
    low-Q discards:       0.3287%       10297     0.3635%         349988

    perfect best site:   22.4778%      704136    20.8520%       20077026
    semiperfect site:    22.4778%      704136    20.8520%       20077026

bowtie 2 mapping rate :

3132591 reads; of these:
  3132591 (100.00%) were unpaired; of these:
    760038 (24.26%) aligned 0 times
    1312111 (41.89%) aligned exactly 1 time
    1060442 (33.85%) aligned >1 times
75.74% overall alignment rate

bowtie v.1

reads processed: 3132591
# reads with at least one reported alignment: 2234691 (71.34%)
# reads that failed to align: 897900 (28.66%)
Reported 2234691 alignments to 1 output stream(s)

what is your opinion ?

ADD REPLY
0
Entering edit mode

Can you take out the perfect match mode from bbmap? That will allow split mapping though so keep that in mind.

Did you re-do bowtie v.1 and v.2 mapping with bbduk.sh trimmed data? If not then we can't compare the results the way you posted them above. You would want to re-align the trimmed data again.

BTW: How many samples do you have? Is this happening with just one or all samples? What is the next step in the analysis?

ADD REPLY
0
Entering edit mode

obviously I used same re-trimmed data for bowtie v.1 and 2 and mapping with bbduk.sh. I have 3 samples but this problem (low mapped rate )is common for all of them, now we confirmed low rate mapped is due to some artifact and adapter contamination, right ?

ADD REPLY
0
Entering edit mode

Would you mind re-doing the bbmap mapping removing the perfect match restriction?

We have addressed the adapter contamination but not explained the low mapping as yet. I have a feeling that you do have some artifacts (or chimeric reads) based on the blast I remember looking at for a small number of reads I had tried. There is definitely plant DNA in your reads that do not map based on blast.

ADD REPLY
0
Entering edit mode

without perfect mod flag 85% mapped but with great error rate

Read 1 data:        pct reads   num reads   pct bases      num bases

    mapped:              85.7951%     2687610    94.1883%       90687622
    unambiguous:         57.6539%     1806061    73.4055%       70677282
    ambiguous:           28.1412%      881549    20.7828%       20010340
    low-Q discards:       0.0896%        2808     0.0830%          79937

    perfect best site:   22.4704%      703906    20.8424%       20067721
    semiperfect site:    22.4836%      704319    20.8599%       20084609

    Match Rate:               NA           NA    39.6071%       84853722
    Error Rate:          73.4259%     1981805    60.3855%      129369045
    Sub Rate:            70.7170%     1908688     2.5421%        5446231
    Del Rate:             7.1905%      194076    57.6698%      123550917
    Ins Rate:             4.8398%      130629     0.1736%         371897
    N Rate:               0.3037%        8196     0.0074%          15772
ADD REPLY
1
Entering edit mode

Good. This is in line with what I expected to see. I would not worry about the error rate since that is not quite what it seems to mean. 57% unambig alignments is better than bowtie.

Can you load the alignments in IGV and check to see how the three aligners are doing visually? Do you see many more split alignments with bbmap and bowtie v.2 compared to bowtie v.1 and bbmap with perfect matches?

ADD REPLY
0
Entering edit mode

OK, I'II try to do that, but I have a problem with sending the post just 5 posts per 6hr, as you know I elaborate a lot on your nice website :)

ADD REPLY
0
Entering edit mode

That limit will increase as you participate more on the site over time. Just edit the last post and add more information to it if you are not able to create a new one.

ADD REPLY
0
Entering edit mode

here is the IGV visualization, upper case is bbmap out put.

https://ibb.co/c0D5Kx

ADD REPLY
0
Entering edit mode

In that small snapshot the alignments look reasonable. There are gapped alignments with bbmap as expected.

I am not sure what your downstream analysis pipeline looks like but you should now proceed with that (use both bbmap and bowtie mapped data in parallel) and see what you get.

ADD REPLY
0
Entering edit mode

as you know the issue was contaminated clean reads which provided by company and due to low mapping rate I tried to re-trim this clean reads, right now we confirm that we have a lot of artifact and adapter remanent, right ? (because company told me we are pretty sure there is not any kind of contamination in clean read libs !!!) This is another snapshot , it seems with bowtie2 we have better results in this case https://ibb.co/buVuex

ADD REPLY
0
Entering edit mode

right now we confirm that we have a lot of artifact and adapter remanent, right ? (because company told me we are pretty sure there is not any kind of contamination in clean read libs !!!)

Your data seems to have adapter contamination but that is not completely unusual. Depending how well the libraries turned out and/or the insert size you have. This is a characteristics of your sample/libraries (unless the company made the libraries and did not do that part well, even then they could have been ultimately limited by quality of your samples).

ADD REPLY
0
Entering edit mode

This is the company answer "Our lab is confident the mapping rate issue does not relate to any adaptor contamination, but with regard to this phenomenon you're observing, it's expected that longer, degraded RNA fragments are generally harder to map to the genome than smaller fragments." but after re-trimming just according to public available primer and adapter seq, mapping rate improve to over 70% BTW, Thanks in advance for your comments and help

ADD REPLY
0
Entering edit mode

I found some problem with bbmap , in bbmap output there are plenty amount of seqs with N which is mapped to genome but in bowtie2 output we don't have this problem. could you please guide me about that ? https://ibb.co/hsBte7 https://ibb.co/nsL8CS Thanks

ADD REPLY
0
Entering edit mode

can I ask your opinion about bowtie output ?

ADD REPLY

Login before adding your answer.

Traffic: 1863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6