SRA to fastq
2
0
Entering edit mode
8.1 years ago
vimlakany • 0

The command used to convert sra to fastq is fastq-dump --split-3 ERR738423.sra The above sra is single-end data. SRA file size is 2.2GB; using fastq-dump command fastq file obtained is 10.2GB; in ENA fastq file is 7GB. Why there is a huge difference in size?

RNA-Seq • 2.8k views
ADD COMMENT
4
Entering edit mode
8.1 years ago
Satyajeet Khare ★ 1.6k

Hi,

ENA files are smaller that GEO because in line 3 '+' character is not followed by the sequence identifier. For example, in GEO it is

@ENTIRE_SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+ENTIRE_SEQ_ID
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

But in ENA it is

@ENTIRE_SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

At least, that was the case with my datasets.

ADD COMMENT
0
Entering edit mode

But this problem exits in SRA to fastq also.

ADD REPLY
1
Entering edit mode

Yes. I mean it depends on the source of SRA file. Can you 'head' both fastq files and check if there is difference in line 3?

ADD REPLY
2
Entering edit mode
8.1 years ago

Check the content:

  1. Sequence number:

    grep -c '^' reads.fastq
    
  2. Sequence header line format:

    grep '^' reads.fastq | head -n 10
    
ADD COMMENT
0
Entering edit mode

Sequence number: 168124864

Sequence header line @ERR738423.1 HWI-ST365_0182:2:1101:1134:2086#CGATGT length=50 AGTGTCTAAGGGCGCATGGTGGATGCCTTGGCATCGAGAGCCGATGAAGG +ERR738423.1 HWI-ST365_0182:2:1101:1134:2086#CGATGT length=50 @@=D?DDD?DDF1C1FHIGE@GGHGEHHGEIC>B>FHH?AGC>AFHCHGG @ERR738423.2 HWI-ST365_0182:2:1101:1152:2089#CGATGT length=50 CCGAACCCGGAAGCTAAGCCTGCCAGCGCCGATGATACTGCCCCTCCGGG +ERR738423.2 HWI-ST365_0182:2:1101:1152:2089#CGATGT length=50 CCCFFFFFHHGHHJJIIIIJJJJJIJIJIJJGHIJJJJIIIJJFHHFFDD @ERR738423.3 HWI-ST365_0182:2:1101:1095:2121#CGATGT length=50 TCAAGCACACCGCCGAAGCCGCGGCACATCCACCTTGTGGTGGGAGTGGG

Why should we check sequence number and sequence header line format?

ADD REPLY
1
Entering edit mode

i mean compare these infomation between the two files.

ADD REPLY

Login before adding your answer.

Traffic: 2583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6