Question

SRA Toolkit and bowtie2: 0.00% overall alignment rate - what am I missing?

1

Entering edit mode

10.2 years ago

Bilal Akil ▴ 30

Hi guys. I'm just getting started with bioinformatics and this is my first post on Biostars. Thanks in advance for all your help.

I've been trying to successfully acquire sample reads and align them for quite some time now, but things still seem to be going wrong. I'm using NCBI's Sequence Read Archive (SRA) to find sample data. I've been downloading and formatting that data using their SRA Toolkit, with tools like prefetch and fastq-dump. Then I've tried to align them using bowtie2.

I'll list the exact commands I've used:

prefetch SRR390728
sra-stat -x --quick SRR390728

This is what the statistics look like:

<Run accession="SRR390728" spot_count="7178576" base_count="516857472" base_count_bio="516857472" cmp_base_count="49610016">
  <Size value="193606607" units="bytes"/>
  <AlignInfo>
    <Ref seqId="GPC_000000394.1" name="5" .../>
    <Ref seqId="GPC_000000395.1" name="Y" .../>
    <Ref seqId="NC_000001.9" name="1" .../>
    <Ref seqId="NC_000002.10" name="2" .../>
    ...
    <Ref seqId="NC_000022.9" name="22" .../>
    <Ref seqId="NC_000023.9" name="X" .../>
    <Ref seqId="NC_001807.4" name="M" .../>
  </AlignInfo>
  <QualityCount>
    <Quality value="4" count="1285575"/>
    <Quality value="5" count="6274952"/>
    ...
    <Quality value="30" count="1110"/>
    <Quality value="33" count="4722"/>
  </QualityCount>
  <Databases>
    <Database>
      <Table name="PRIMARY_ALIGNMENT">
        <Statistics source="meta">
          <Rows count="12979096"/>
          <Elements count="467247456"/>
        </Statistics>
      </Table>
      <Table name="REFERENCE">
        <Statistics source="meta">
          <Rows count="616097"/>
          <Elements count="3080436051"/>
        </Statistics>
      </Table>
      <Table name="SEQUENCE">
        <Statistics source="meta">
          <Rows count="7178576"/>
          <Elements count="516857472"/>
        </Statistics>
      </Table>
    </Database>
  </Databases>
</Run>

To me this means there is alignment data in there, particularly in table: PRIMARY_ALIGNMENT. I guess raw sequencing data is also in the SEQUENCE table. fastq-dump documentation states that the SEQUENCE table is used by default, so the --table option is omitted from the first call.

fastq-dump -O ../output SRR390728
fastq-dump --table PRIMARY_ALIGNMENT -O ../output --fasta SRR390728

This leaves us with a couple of files in the output directory:

SRR390728.fastq - a fastq file containing the raw reads from the SRA file; and
SRR390728.fasta - a fasta file containing reads from the SRA file's PRIMARY_ALIGNMENT table.

Next, I move onto using bowtie2 - first by indexing the aligned reads so we can use it as a reference:

bowtie2-build output/SRR390728.fasta output/bowtie/SRR390728

This outputs a whole bunch of files into the output/bowtie directory. Finally, I can try to align the raw reads:

bowtie2 -x output/bowtie/SRR390728 -U output/SRR390728.fastq -S output/SRR390728.sam

30 minutes later, I've got the following depressing output:

7178576 reads; of these:
  7178576 (100.00%) were unpaired; of these:
    7178576 (100.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.00% overall alignment rate

The output sam file looks like this (without the starting line numbers):

       1 @HD VN:1.0  SO:unsorted
       2 @SQ SN:SRR390728.1  LN:36
       3 @SQ SN:SRR390728.2  LN:36
       4 @SQ SN:SRR390728.3  LN:36
..
..
..
12979095 @SQ SN:SRR390728.12979094   LN:36
12979096 @SQ SN:SRR390728.12979095   LN:36
12979097 @SQ SN:SRR390728.12979096   LN:36
12979098 @PG ID:bowtie2  PN:bowtie2  VN:2.2.5    CL:".../bowtie2-2.2.5/bowtie2-align-s --wrapper basic-0 -x output/bowtie/SRR390728 -S output/SRR390728.sam -U output/SRR39         0728.fastq"
12979099 SRR390728.1 4   *   0   0   *   *   0   0   CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCGATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC    ;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&(             YT:Z:UU
12979100 SRR390728.2 4   *   0   0   *   *   0   0   AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGTTCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT    ;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262             YT:Z:UU
12979101 SRR390728.3 4   *   0   0   *   *   0   0   CCAGCCTGGCCAACAGAGTGTTACCCCGTTTTTACTTATTTATTATTATTATTTTGAGACAGAGCATTGGTC    -;;;8;;;;;;;,*;;';-4,44;,:&,1,4'./&1;;;;;;;669;;99;;;;;-;3;2;0;+;7442&2/   
..
..
..
20157672 SRR390728.7178574   4   *   0   0   *   *   0   0   AGAAAAAGGATGAATNNNNNNNNNNNNNNNANNNNNNNNNNNATNNCTTCNNNNGTNNANNNNNNNNNNNNT    ;;;;;;-;;;;;;*)%%%%%%%%%%%%%%%6%%%%%&&&&&&;4&&;;;;&&&&.;&&;&&&&&&&&         &&&&3    YT:Z:UU YF:Z:NS
20157673 SRR390728.7178575   4   *   0   0   *   *   0   0   AGTTTTAATTTTTNATATTTTACTTCATAGTCTTTTACACATTTTAAAATGACCTAAATTAACGACATATCA    ;;;;;;;8;;;;;%;&3;;,;&+;;1:)8+504&5/0776;;;16/10&1/.1;4.;;**4;0&7&&         6&*-&    YT:Z:UU
20157674 SRR390728.7178576   4   *   0   0   *   *   0   0   AATATCACAGCGANCGCTATAGATCGGAAGATCGGTATAGCGGTCGCTGTGATATTAGATCGGAAGAGCGTC    ;;;;;;1;;;;;1%;;;75;55:;:::%5720+,2/;;;;;;;;;;;;<;:4;;;;;:9;;;:;;,4         42&42    YT:Z:UU

Note that some of the pasted text (particularly surrounding the read qualities) might look messed up thanks to the online editor.

It seems that all of those reads have 4 as the value for their flag. The bowtie2 manual describes 4 as: "The read has no reported alignments".

Why not? What am I missing? Why was nothing aligned?

sra-toolkit alignment bowtie2 sra • 5.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

Ram · Accepted Answer · 2015-05-21

4

Entering edit mode

10.2 years ago

Devon Ryan 105k

Try to rely as little as possible on SRA. The fastq files from fastq-dump are fine to use, but I would strongly encourage you to never use anything else.
This dataset is paired-end, you need to use the --split-files option.
Download the version of the human genome you would like to align against and build the bowtie2 indices with that. Alternatively, you can download pre-made indices from iGenomes (the files are huge though, it's often faster to just build things yourself).

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Devon Ryan 105k

0

Entering edit mode

Why do you recommend not relying on SRA? By first impression it looks like everything I need in terms of DNA samples can be found there.

I'm not sure which genome I'd like to align against. At the moment I'm just trying to figure out a good procedure for finding good DNA samples, downloading and formatting the data and getting alignment started.

Also, how did you figure that the dataset in the question was paired-ended? Was it somewhere in the output statistics?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

0

Entering edit mode

Like I said, the fastq files are fine, but trying to get reference sequences and such from SRA entries is likely to go badly, as you've already experienced. The reality is that typically people only include fastq files in their entries and even if they include more the question becomes whether you should really trust that the person uploading things did everything in a way that's useful for you.

You can see that the dataset is paired-end from looking at its entry on SRA's website. It's actually best to simply always specify --split-files.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Devon Ryan 105k

0

Entering edit mode

Worked after using the --split-files option as you suggested, however the results have led to a second question: Paired bowtie2: 91% overall alignment but 0.01% concordant - should I be worried?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30