Hi guys. I'm just getting started with bioinformatics and this is my first post on Biostars. Thanks in advance for all your help.
I've been trying to successfully acquire sample reads and align them for quite some time now, but things still seem to be going wrong. I'm using NCBI's Sequence Read Archive (SRA) to find sample data. I've been downloading and formatting that data using their SRA Toolkit, with tools like prefetch
and fastq-dump
. Then I've tried to align them using bowtie2
.
I'll list the exact commands I've used:
prefetch SRR390728
sra-stat -x --quick SRR390728
This is what the statistics look like:
<Run accession="SRR390728" spot_count="7178576" base_count="516857472" base_count_bio="516857472" cmp_base_count="49610016">
<Size value="193606607" units="bytes"/>
<AlignInfo>
<Ref seqId="GPC_000000394.1" name="5" .../>
<Ref seqId="GPC_000000395.1" name="Y" .../>
<Ref seqId="NC_000001.9" name="1" .../>
<Ref seqId="NC_000002.10" name="2" .../>
...
<Ref seqId="NC_000022.9" name="22" .../>
<Ref seqId="NC_000023.9" name="X" .../>
<Ref seqId="NC_001807.4" name="M" .../>
</AlignInfo>
<QualityCount>
<Quality value="4" count="1285575"/>
<Quality value="5" count="6274952"/>
...
<Quality value="30" count="1110"/>
<Quality value="33" count="4722"/>
</QualityCount>
<Databases>
<Database>
<Table name="PRIMARY_ALIGNMENT">
<Statistics source="meta">
<Rows count="12979096"/>
<Elements count="467247456"/>
</Statistics>
</Table>
<Table name="REFERENCE">
<Statistics source="meta">
<Rows count="616097"/>
<Elements count="3080436051"/>
</Statistics>
</Table>
<Table name="SEQUENCE">
<Statistics source="meta">
<Rows count="7178576"/>
<Elements count="516857472"/>
</Statistics>
</Table>
</Database>
</Databases>
</Run>
To me this means there is alignment data in there, particularly in table: PRIMARY_ALIGNMENT
. I guess raw sequencing data is also in the SEQUENCE
table. fastq-dump
documentation states that the SEQUENCE
table is used by default, so the --table
option is omitted from the first call.
fastq-dump -O ../output SRR390728
fastq-dump --table PRIMARY_ALIGNMENT -O ../output --fasta SRR390728
This leaves us with a couple of files in the output
directory:
SRR390728.fastq
- afastq
file containing the raw reads from the SRA file; andSRR390728.fasta
- afasta
file containing reads from the SRA file'sPRIMARY_ALIGNMENT
table.
Next, I move onto using bowtie2
- first by indexing the aligned reads so we can use it as a reference:
bowtie2-build output/SRR390728.fasta output/bowtie/SRR390728
This outputs a whole bunch of files into the output/bowtie
directory. Finally, I can try to align the raw reads:
bowtie2 -x output/bowtie/SRR390728 -U output/SRR390728.fastq -S output/SRR390728.sam
30 minutes later, I've got the following depressing output:
7178576 reads; of these:
7178576 (100.00%) were unpaired; of these:
7178576 (100.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.00% overall alignment rate
The output sam
file looks like this (without the starting line numbers):
1 @HD VN:1.0 SO:unsorted
2 @SQ SN:SRR390728.1 LN:36
3 @SQ SN:SRR390728.2 LN:36
4 @SQ SN:SRR390728.3 LN:36
..
..
..
12979095 @SQ SN:SRR390728.12979094 LN:36
12979096 @SQ SN:SRR390728.12979095 LN:36
12979097 @SQ SN:SRR390728.12979096 LN:36
12979098 @PG ID:bowtie2 PN:bowtie2 VN:2.2.5 CL:".../bowtie2-2.2.5/bowtie2-align-s --wrapper basic-0 -x output/bowtie/SRR390728 -S output/SRR390728.sam -U output/SRR39 0728.fastq"
12979099 SRR390728.1 4 * 0 0 * * 0 0 CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCGATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC ;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&( YT:Z:UU
12979100 SRR390728.2 4 * 0 0 * * 0 0 AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGTTCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT ;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262 YT:Z:UU
12979101 SRR390728.3 4 * 0 0 * * 0 0 CCAGCCTGGCCAACAGAGTGTTACCCCGTTTTTACTTATTTATTATTATTATTTTGAGACAGAGCATTGGTC -;;;8;;;;;;;,*;;';-4,44;,:&,1,4'./&1;;;;;;;669;;99;;;;;-;3;2;0;+;7442&2/
..
..
..
20157672 SRR390728.7178574 4 * 0 0 * * 0 0 AGAAAAAGGATGAATNNNNNNNNNNNNNNNANNNNNNNNNNNATNNCTTCNNNNGTNNANNNNNNNNNNNNT ;;;;;;-;;;;;;*)%%%%%%%%%%%%%%%6%%%%%&&&&&&;4&&;;;;&&&&.;&&;&&&&&&&& &&&&3 YT:Z:UU YF:Z:NS
20157673 SRR390728.7178575 4 * 0 0 * * 0 0 AGTTTTAATTTTTNATATTTTACTTCATAGTCTTTTACACATTTTAAAATGACCTAAATTAACGACATATCA ;;;;;;;8;;;;;%;&3;;,;&+;;1:)8+504&5/0776;;;16/10&1/.1;4.;;**4;0&7&& 6&*-& YT:Z:UU
20157674 SRR390728.7178576 4 * 0 0 * * 0 0 AATATCACAGCGANCGCTATAGATCGGAAGATCGGTATAGCGGTCGCTGTGATATTAGATCGGAAGAGCGTC ;;;;;;1;;;;;1%;;;75;55:;:::%5720+,2/;;;;;;;;;;;;<;:4;;;;;:9;;;:;;,4 42&42 YT:Z:UU
Note that some of the pasted text (particularly surrounding the read qualities) might look messed up thanks to the online editor.
It seems that all of those reads have 4
as the value for their flag. The bowtie2 manual describes 4
as: "The read has no reported alignments".
Why not? What am I missing? Why was nothing aligned?
Why do you recommend not relying on SRA? By first impression it looks like everything I need in terms of DNA samples can be found there.
I'm not sure which genome I'd like to align against. At the moment I'm just trying to figure out a good procedure for finding good DNA samples, downloading and formatting the data and getting alignment started.
Also, how did you figure that the dataset in the question was paired-ended? Was it somewhere in the output statistics?
Like I said, the fastq files are fine, but trying to get reference sequences and such from SRA entries is likely to go badly, as you've already experienced. The reality is that typically people only include fastq files in their entries and even if they include more the question becomes whether you should really trust that the person uploading things did everything in a way that's useful for you.
You can see that the dataset is paired-end from looking at its entry on SRA's website. It's actually best to simply always specify
--split-files
.Worked after using the
--split-files
option as you suggested, however the results have led to a second question: Paired bowtie2: 91% overall alignment but 0.01% concordant - should I be worried?