Question

SRA to BAM

2

Entering edit mode

9.7 years ago

marina-orlova ▴ 90

Hi everyone

Can you please help me to extract SAM file from SRA?

I took the dataset from here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1208162

Downloaded SRA file.

Then did sra-dump on this (as it is said on the page of dataset that reads are already aligned).

But as a result I got very small file (~50 Mb). When I tried to convert it to sorted BAM file I got:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[sam_header_line_parse] expected '@XY', got [@HD VN:1.3]
Hint: The header tags must be tab-separated.
[samopen] no @SQ lines in the header.

I tried to look at summary information of this SRA file here: http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR951914

And didn't see any information about alignment

Tried to look at alignment information by command:

vdb-dump ./SRR951915.sra | grep "ALIGNMENT_COUNT"

got an error

vdb-dump.2.1.7 int: data bad version while constructing page map within virtual database module - VCursorCellData( col:PANEL at row #1074 ) failed
fastq-dump command led to an error: data bad version while constructing page map within virtual database module - failed SRR951914.sra

Any idea?

sam-dump sra • 8.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by marina-orlova ▴ 90

1

Entering edit mode

9.7 years ago

Evgeniia Golovina ★ 1.3k

Hi, Marina

You mean sam-dump, right?

It seems to me, that you your file is not sam file. It's just raw reads. Look here for your sra file --> http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR951914 (the tab "Reads")

Let me try to look at this dataset.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Evgeniia Golovina ★ 1.3k

0

Entering edit mode

Hi Evgeniia

Yes, you are right, it is raw reads. But fastq-dump also didn't work:

fastq-dump SRR951914.sra > 951914.fastq
2015-04-08T08:50:36 fastq-dump.2.1.7 err: data bad version while constructing page map within virtual database module - failed SRR951914.sra

Maybe sra archive is broken..

I took it from here: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP028%2FSRP028808/SRR951914/

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by marina-orlova ▴ 90

Ram · Accepted Answer · 2015-04-08

4

Entering edit mode

9.7 years ago

Evgeniia Golovina ★ 1.3k

Hi, yesterday I got a good output from fastq-dump.

Your command line is wrong. To run fastq-dump correctly you should know whether your reads are single or paired. In your case, we have single reads, then the command line will be:

./fastq-dump --split-spot SRR951914.sra

For paired reads:

./fastq-dump --split-files *file*.sra

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Evgeniia Golovina ★ 1.3k

0

Entering edit mode

Thank you for your answer, it helped

ADD REPLY • link 9.7 years ago by marina-orlova ▴ 90

0

Entering edit mode

I read the parameter of --split-spot. However, I am still don't quite understand what does it mean? Can you make a more clear interpretation to --split-spot?

Read Splitting                     Sequence data may be used in raw form or split into individual reads  
 --split-spot                      Split spots into individual reads

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 8.8 years ago by Shicheng Guo ★ 9.6k

3

Entering edit mode

What does "spot" mean?

As I understand, a spot contains biological information (the reads themself) and technical information such as adapters, barcodes for multiplexing, etc. More about this here --> What Is A "Spot" In Sra Format

From SRA Handbook:

"The spot descriptor captures information that would allow the user of the SRA to interpret the sequencing data and differentiate between technical and application extents in the read. Reads that are mate pairs are concatenated into a single monolithic “spot” sequence." (http://www.ncbi.nlm.nih.gov/books/NBK54984/)

About fastq-dump options

Let's take an example - SRR385952.sra sample (http://www.ncbi.nlm.nih.gov/sra/?term=SRR385952). You can see that teh sample should contain forward and reverse sequences, each with length = 101. These sequences are joined in the SRA file and need to be split. You can do it by using:

1) --split-spot option: ./fastq-dump --split-spot SRR385952.sra This gives you a single file with the reverse read of each pair below the forward read for that pair

2) --split-files option: ./fastq-dump --split-files SRR385952.sra This outputs two fastq files: one for forward, another - for reverse reads.

You can find more info in this blog post --> https://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/

Hope, it will help.

PS. There is another option - --split-3 - which gives you a pair of fastq files, each corresponding record representing a pair of reads.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 8.8 years ago by Evgeniia Golovina ★ 1.3k