RNA-seq alignment rate is too low.
2
0
Entering edit mode
6.1 years ago
zhul09 • 0

Hello

Recently, I have read a paper (Tiedt, S., et al. (2017). RNA-seq identifies circulating miR-125a-5p, miR-125b-5p and miR-143-3p as potential biomarkers for acute Ischemic stroke. Circulation research, CIRCRESAHA-117). Some detail of this paper is followed: PMID: 28724745 DOI: 10.1161/CIRCRESAHA.117.311572 Pubmed GEO database: SRA: SRP133275

I wanted to get the expression matrix of miRNA after stroke in human circulating blood. I got these files (SRA format) from Pubmed GEO database. Trimmed them with Trimmomatic software, and used the Hisat2 software to align the reads to the genome. However, the alignment is too low as followed.

6644136 reads; of these:
  6644136 (100.00%) were unpaired; of these:
    6631500 (99.81%) aligned 0 times
    2981 (0.04%) aligned exactly 1 time
    9655 (0.15%) aligned >1 times
    0.19% overall alignment rate

Here is the shell script:

hisat2 -p 4 --dta -x ./indexes/genome_tran -U ./samples/ SRR6761159.fastq -S ./temp/ SRR6761159.sam

The indexes file is “genome_tran.[1-8].ht2”.

The alignment is too low. Does anyone have any suggestions on how to address this problem? Thank you.

rna-seq alignment assembly • 3.7k views
ADD COMMENT
2
Entering edit mode

If this is miRNA data then you should not be using HISAT2 for alignments. You would want ungapped alignments and bowtie v.1 would be more appropriate.

ADD REPLY
0
Entering edit mode

Thank you. I will try it now.

ADD REPLY
0
Entering edit mode

Hi, genomax. Thanks for your help. Last two days, I used the bowtie software, but it was still similar to the above result. Was the index file appropriate? It was downloaded from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/GRCh38_no_alt.zip Is there an index file specially for miRNA alignment? I tried the miRbase website, downloaded the hairpin.fa and mature.fa files, built them to *.ebwt format index files with bowtie-build. Then I repeated the alignment process again. It didn’t work either. I think this may be because these .fa format files are not human reference files.

ADD REPLY
0
Entering edit mode

Have you cleaned the data downloaded from SRA? It may still have adapter sequences in it. Sequence data from miRBase generally has U bases which have to be changed to T before you can do the alignments. Have you done that before creating the bowtie indexes?

ADD REPLY
0
Entering edit mode

Hi, genomax. Thanks for your help. Do you know how to get adapter sequences? I got some from the fastqc result report file, using trimmomatic to trim reads, aligning the trimmed reads to genome. Here is the report.

# reads processed: 5131255
# reads with at least one reported alignment: 198985 (3.88%)
# reads that failed to align: 4932270 (96.12%)
Reported 198985 alignments

The alignment rate increased, but it was still too low.

ADD REPLY
0
Entering edit mode

You know that a .gtf file is not a genome, right?

ADD REPLY
0
Entering edit mode

Thank you. That is a typo. I have corrected it. The indexes file is ht2 format file.

ADD REPLY
0
Entering edit mode
6631500 (99.81%) aligned 0 times

99 % aligned zero times (not at all aligned) ? How much sure are you about the data?

ADD REPLY
0
Entering edit mode

This result was given by hisat2. It seems that these reads were not mapped to the genome at all. I am not sure about the data. These files are downloaded from pubmed GEO database.

ADD REPLY
0
Entering edit mode

Which Hisat2 version are you using?

ADD REPLY
0
Entering edit mode

HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo) Operate system: Manjaro Linux 64 bit. This software works fine when I process another RNAseq data of mice with indexes files for mice.

ADD REPLY
0
Entering edit mode

That sounds really strange. Now I almost think there is a mixup with the samples. Could you try selecting a couple of random reads and blasting the first say 30-40 nucleotides (online tool here) just to make sure there is not a mixup with the samples?

ADD REPLY
0
Entering edit mode

Have you tried using a mir reference? i.e. a reference only containing the mirs sequences? You can use MiRbase download to get this data: http://www.mirbase.org/ftp.shtml

ADD REPLY
2
Entering edit mode
6.0 years ago
ATpoint 86k

Hi zhul09,

the adapter sequence you are looking for is TGGAATTCTCGGGTGCCAAGG. When you run fastqc before and after trimming, you'll see that the adapter content will be vanished. I quickly downloaded one of the files of this dataset (SRR6761198) and can confirm that aligning against the combined mature- and hairpin files from mirbase is problematic:

$ bowtie -S mirbase.fa SRR6761198-trimmed.fastq out.sam
# reads processed: 3790622
# reads with at least one reported alignment: 16997 (0.45%)
# reads that failed to align: 3773625 (99.55%)
Reported 16997 alignments

whereas against the full hg38:

$ bowtie --threads 16 -S hg38 SRR6761198-trimmed.fastq out.sam
# reads processed: 3790622
# reads with at least one reported alignment: 1632354 (43.06%)
# reads that failed to align: 2158268 (56.94%)
Reported 1632354 alignments

I only had a brief look at the methods, and this seems to be a plasma sample, so circulating small RNAs. Not my expertise at all, but maybe it is expected to have this alignment result, and the simple presence of miRs in this sample is already informative in terms of biomarkers, but this is something you have to know or discuss with your supervisor. I could imagine (just thinking aloud) that upon stroke the necrotised cells release brain-specific miRs which normally one would not find in the blood. So given that was true, finding some of them among the aligned reads from a blood plasma sample (even though it might only be like 100 reads or 0.01%) could already serve as a valuable information. You should see if within those reads that align to mirbase, you find the miRs back that the paper describes as potential biomarkers to increase your confidence. Hope this helps!

ADD COMMENT
0
Entering edit mode

Thank you, ATpoint! According to the sequence you gave, I got the similar result as yours. How do you get the adapter sequence? The reference file downloaded from mirBase website didn't specify its species, Can this file be used for all kinds of species ? I am confused about this.

ADD REPLY
0
Entering edit mode

I got the sequence by searching google. Alternatively, if you download bbmap, in its main directory there is a folder "resources" that contains a file adapter.fa with all common adapter sequences. As fir the mirBase, I actually did not select for human miRs, that is true. The file on the main page is probably the merge for all species, but this is only guessing, I never productively used miRbase as this is not my field. Probably getting the sequences only for human is better in this case.

ADD REPLY
0
Entering edit mode
6.1 years ago
rbagnall ★ 1.8k

I think your indexes file is incorrect. -x <hisat2-idx> The basename of the index for the reference genome.

You need to index the genome with hisat2-build command first. From the manual, hisat2-build builds a HISAT2 index from a set of DNA sequences.

ADD COMMENT
0
Entering edit mode

Thank you. That is a typo. The indexes file(ht2 format) was downloaded from "ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_tran.tar.gz". Since it was already a file in ht2 format, I didn't use hisat2-build to rebuild the index file.

ADD REPLY

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6