Hi everyone
Can you please help me to extract SAM file from SRA?
I took the dataset from here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1208162
Downloaded SRA file.
Then did sra-dump on this (as it is said on the page of dataset that reads are already aligned).
But as a result I got very small file (~50 Mb). When I tried to convert it to sorted BAM file I got:
[bam_header_read] EOF marker is absent. The input is probably truncated.
[sam_header_line_parse] expected '@XY', got [@HD VN:1.3]
Hint: The header tags must be tab-separated.
[samopen] no @SQ lines in the header.
I tried to look at summary information of this SRA file here: http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR951914
And didn't see any information about alignment
Tried to look at alignment information by command:
vdb-dump ./SRR951915.sra | grep "ALIGNMENT_COUNT"
got an error
vdb-dump.2.1.7 int: data bad version while constructing page map within virtual database module - VCursorCellData( col:PANEL at row #1074 ) failed
fastq-dump command led to an error: data bad version while constructing page map within virtual database module - failed SRR951914.sra
Any idea?
Thank you for your answer, it helped
I read the parameter of
--split-spot
. However, I am still don't quite understand what does it mean? Can you make a more clear interpretation to--split-spot
?What does "spot" mean?
As I understand, a spot contains biological information (the reads themself) and technical information such as adapters, barcodes for multiplexing, etc. More about this here --> What Is A "Spot" In Sra Format
From SRA Handbook:
"The spot descriptor captures information that would allow the user of the SRA to interpret the sequencing data and differentiate between technical and application extents in the read. Reads that are mate pairs are concatenated into a single monolithic “spot” sequence." (http://www.ncbi.nlm.nih.gov/books/NBK54984/)
About fastq-dump options
Let's take an example - SRR385952.sra sample (http://www.ncbi.nlm.nih.gov/sra/?term=SRR385952). You can see that teh sample should contain forward and reverse sequences, each with length = 101. These sequences are joined in the SRA file and need to be split. You can do it by using:
1) --split-spot option: ./fastq-dump --split-spot SRR385952.sra This gives you a single file with the reverse read of each pair below the forward read for that pair
2) --split-files option: ./fastq-dump --split-files SRR385952.sra This outputs two fastq files: one for forward, another - for reverse reads.
You can find more info in this blog post --> https://nsaunders.wordpress.com/2011/12/22/sequencing-for-relics-from-the-sanger-era-part-1-getting-the-raw-data/
Hope, it will help.
PS. There is another option - --split-3 - which gives you a pair of fastq files, each corresponding record representing a pair of reads.