Dear colleagues.
As a beginner in bioinformatics, i am trying to recreate a paper dealing with transcriptome of Helicobacter pylori.
Authors state that
We primarily analysed H. pylori strain 26695 grown to midlogarithmic phase (ML2/1 libraries), or under acid stress at pH 5.2 (AS2/1) resembling the host environment.
1.First, i downloaded the respective experiments (ML2/1), etc. In each experiment there were 10 runs so i checked all of the boxes
- Fastp was run to perform QC. The initial state of the data is given below :
- Then, ends and the first 25 bases were trimmed so the output was
- Using bowtie 2 and the genomic annotation taken from
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000008525.1/
with file source: all, file type: genome sequences (FASTA). The following line is ran:
bowtie2-build GCF_000827025.1_genomic.fna index_prefix
Then using the several index_prefix files created the alignment step is performed:
bowtie2 -x index_prefix -U trimmed_ml_minus.fq -S bowtie2_alignments_ML_minus.sam
51591 reads; of these: 51591 (100.00%) were unpaired; of these: 49613 (96.17%) aligned 0 times 494 (0.96%) aligned exactly 1 time 1484 (2.88%) aligned >1 times 3.83% overall alignment rate
Could someone explain, what am i doing incorrectly?
Thanks in advance.
P.S.
Dear colleagues, thank you for your inputs.
Minimap increases the number of aligned reads to about ~50% (Hisat2 yielded almost the same percentage of aligned read as bowtie2). However, even after fastp trimming there is a big problem with polyA adapter content:
I tried to cut all trailing A's of length >=10 using cutadapt, however, as a result of this procedure unsatisfactory (failure=x) outputs are given with respect to the per base sequence content and sequence length distribution.
What is the best trade off in this scenario?
Did you see folowing warning on the genome page you linked? Perhaps this is not the best genome to use for alignments.
As noted earlier, this data is ~ 15 yr old at this point i.e. from the dawn of NGS tech. It is not that great as you have discovered. You have not said why you are specifically using this data since there are many other recent H. pylori datasets that you will find at SRA (e.g. go to https://sra-explorer.info/ and search). It would be best to use one of those datasets unless you must use this one. In that case move on with what you have (~50% alignments, they may be crappy though).
"Failures" on FastQC test criteria are not a signal that the data is non-usable (caveat: in this specific case part of it seems to be true). You can move on with the rest of the analysis keeping the above in mind.