sra to fastq conversion - paired-end file won't split
1
0
Entering edit mode
7.9 years ago
rioualen ▴ 750

Hello,

I'm using the fastq-dump program from sratoolkit suite in order to convert sra files to fastq files. I'm using the --split-files parameter in order to get separate fastq files for paired-end data. However, one of my files won't split...

I'm working on this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41190

First three samples split fine, the latter doesn't:

@SRR400301.1 1204:1:1:1641:935 length=152
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR400301.1 1204:1:1:1641:935 length=152
########################################################################################################################################################
@SRR400301.2 1204:1:1:1708:951 length=152
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGGGCGGTACTGCGGCGCGGGGGGGNAGAGGGTAGATCTCGGGGGGGGGCGGGTGATTAAAAAAAAAAATCGGGGGG
+SRR400301.2 1204:1:1:1708:951 length=152
)333377777@@@@CCCC@@@CCC@C@@@C@@CC@C@@@C58998@@@@@C@@@@@@@@@CC@C@#######################################################################################
@SRR400301.3 1204:1:1:1765:941 length=152
NTGAAACATCTAAGTACCCCGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGGGGAGCTTCGCCTTTCCCTCACGGTACTGGNTCACTATCGGTCAGTCAGGAGTATTTAGCCTTGGAGGATGCTCCCCCCATA
+SRR400301.3 1204:1:1:1765:941 length=152

All 4 samples are registered as single-end 36bp reads in GEO, however it clearly is paired-end 76x2bp. Latter file FastQC shows no exception: https://github.com/rioualen/gene-regulation/blob/master/GSM1010247.png

Any clue what I'm missing here?

Thank you

fastq sra fastq-dump sratoolkit • 2.5k views
ADD COMMENT
0
Entering edit mode
7.9 years ago
Satyajeet Khare ★ 1.6k

The dataset you are working on is a mix of single end and paired end samples. SRR400301 probably ain't a paired end sample. There is only one file corresponding to SRR400301 in ENA.

ADD COMMENT
0
Entering edit mode

Hi, actually they're all paired-end samples, but registered as single-end. The FastQC shows it clearly, but I don't get why this is oddly formatted...

FastQC image: https://github.com/rioualen/gene-regulation/blob/master/GSM1010247.png

ADD REPLY
1
Entering edit mode

I am not sure how FASTQC tells a paired-end sample. The reads here don't look like paired-end reads though.

ADD REPLY
0
Entering edit mode

The shape of the quality graph is typical of paired end reads, you can see the pretty usual quality drop around 76 and 152. It is the same with the other 3 samples. Plus, the samples are registered as paired-end in ENA.

ADD REPLY
0
Entering edit mode

That graph could indicate that they're actually paired, but there's no guarantee of it... that said, 152bp reads are pretty unusual, and it would make more sense to me if they were actually 2x76bp reads that got decompressed incorrectly.

I think the take-home message here is that .sra is a terrible way to store data if you want people to be able to use it in the future.

ADD REPLY
0
Entering edit mode

Yeah SRA is a real pain to deal with, that said I was hoping there would be a way to bypass this problem cause I am really interested in this dataset. I contacted the author and he said the person in charge of the files, formatting etc was no longer working there... Publication of data in these huge databases should be more thoroughly checked!

ADD REPLY

Login before adding your answer.

Traffic: 3064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6