Question

Helicobacter pylori low-rate alignment

0

Entering edit mode

10 days ago

Ariadna ▴ 20

Dear colleagues.

As a beginner in bioinformatics, i am trying to recreate a paper dealing with transcriptome of Helicobacter pylori.

Authors state that

We primarily analysed H. pylori strain 26695 grown to midlogarithmic phase (ML2/1 libraries), or under acid stress at pH 5.2 (AS2/1) resembling the host environment.

1.First, i downloaded the respective experiments (ML2/1), etc. In each experiment there were 10 runs so i checked all of the boxes SRA downloads

Fastp was run to perform QC. The initial state of the data is given below :

Then, ends and the first 25 bases were trimmed so the output was

Using bowtie 2 and the genomic annotation taken from

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000008525.1/

with file source: all, file type: genome sequences (FASTA). The following line is ran:

bowtie2-build GCF_000827025.1_genomic.fna index_prefix

Then using the several index_prefix files created the alignment step is performed:

bowtie2 -x index_prefix -U trimmed_ml_minus.fq -S bowtie2_alignments_ML_minus.sam

51591 reads; of these:
  51591 (100.00%) were unpaired; of these:
    49613 (96.17%) aligned 0 times
    494 (0.96%) aligned exactly 1 time
    1484 (2.88%) aligned >1 times
3.83% overall alignment rate

Could someone explain, what am i doing incorrectly?

Thanks in advance.

P.S.

Dear colleagues, thank you for your inputs.

Minimap increases the number of aligned reads to about ~50% (Hisat2 yielded almost the same percentage of aligned read as bowtie2). However, even after fastp trimming there is a big problem with polyA adapter content: Post-fastp fastqc output

I tried to cut all trailing A's of length >=10 using cutadapt, however, as a result of this procedure unsatisfactory (failure=x) outputs are given with respect to the per base sequence content and sequence length distribution.

What is the best trade off in this scenario?

RNA-seq transcriptomics • 424 views

ADD COMMENT • link updated 8 days ago by Ram 45k • written 10 days ago by Ariadna ▴ 20

1

Entering edit mode

using bowtie 2 and the genomic annotation taken from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000008525.1/

Did you see folowing warning on the genome page you linked? Perhaps this is not the best genome to use for alignments.

Status: RefSeq GCF_000008525.1 is suppressed
This record was removed by RefSeq staff. Please contact info@ncbi.nlm.nih.gov for further details. Reason: many frameshifted proteins

ADD REPLY • link 9 days ago by GenoMax 150k

1

Entering edit mode

What is the best trade off in this scenario?

As noted earlier, this data is ~ 15 yr old at this point i.e. from the dawn of NGS tech. It is not that great as you have discovered. You have not said why you are specifically using this data since there are many other recent H. pylori datasets that you will find at SRA (e.g. go to https://sra-explorer.info/ and search). It would be best to use one of those datasets unless you must use this one. In that case move on with what you have (~50% alignments, they may be crappy though).

as a result of this procedure unsatisfactory (failure=x) outputs are given with respect to the per base sequence content and sequence length distribution.

"Failures" on FastQC test criteria are not a signal that the data is non-usable (caveat: in this specific case part of it seems to be true). You can move on with the rest of the analysis keeping the above in mind.

ADD REPLY • link 9 days ago by GenoMax 150k

score 1 · Answer 1 · 2025-04-14

1

Entering edit mode

9 days ago

colindaven 7.4k

These are really really old 454 data. They likely have a read length of about 400bp. The bowtie2 aligner is not suitable for these AFAIK.

Try Hisat2 or STAR for RNA-seq, or first minimap2 to at least get an alignment.

Or just get a more modern 2018+ RNA-seq dataset with illumina data and run with Hisat2.

There are also specific tools for bacterial rna-seq like Rockhopper - https://github.com/btjaden/Rockhopper

ADD COMMENT • link 9 days ago by colindaven 7.4k

0

Entering edit mode

They likely have a read length of about 400bp

Looks like the data is ~130 bp.

ADD REPLY • link 9 days ago by GenoMax 150k