Question

ncbi sra toolkit, how to modify the fastq format? Need filter information

0

Entering edit mode

8.4 years ago

claude.loverdo • 0

I want to use data from NCBI. The classic fatsq format is:

@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
GTTTGNACCATCTTGACAGACTTCAAAAATTGGCTGGGGTCTAAATTGTTCCCCAAGCTGCCCGGCCTCCCCTTCATCTCTTGTCAAGATCGGAAGAGCA
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CBCFF#2ACFHFHJJIJIJIJJJJJJIGIGHHIJJI??D?@FGIJJJJIIGFHGHIIIEIIIFHHFEEE2?>ABCACDDACCCAAC@>@AA8<22922?C
@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
TGACAAGAGATGAAGGGGAGGCCGGGCAGCTTGGGGAACAATTTAGACCCCAGCCAATTTTTGAAGTCTGTCAAGATGGTGCAAACAGATCGGAAGAGCG
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CCCFFFFFHGHHHGJIJJIJJIJEHHIIJJJGGHIJGIJJIJIJJJHHHGEDEFDEEEDCCDDDDCDDECDDDDDDDDDCCDACDDDDDDDDDBDBDDDB

which I can either get downloading directly in fastq format from NCBI, or using the sra toolkit, with the

fastq-dump --split-spot -X 5 -Z SRR2192406 > test.fastq

(the -X 5 is here only to avoid downloading all while I'm trying to figure out how to get the format I want).

The issue is that I would like another information. In the illumina fastq format, the first line is (below the wikipedia example). @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG The Y is if the read is filtered, N otherwise . For what I understand, it has to do with the proximity of the spots (if some spots are too close together, they are scored N). I need to filter them out, because for my application I really want to decrease all noise sources.

I am still quite new to the NCBI archive, but I hope the original format use for downloading in this archive retains all the information.

My question, it is how to get this information? Is there any way to ask for a different fastq format? Or should I use a different route to access the NCBI data? Thanks!

sequencing next-gen • 3.6k views

ADD COMMENT • link 8.4 years ago by claude.loverdo • 0

2

Entering edit mode

You are looking for the second word of the description line as found in FASTQ files from Illumina instruments. I have never seen this in data downloaded from the short read archive, neither in SRA files from NCBI nor on FASTQ files from EBI. I guess that it is removed during submission. See also Should I use reads with good quality but failed-vendor flag?

ADD REPLY • link 8.4 years ago by piet ★ 1.9k

0

Entering edit mode

Thanks for your replies! The submitters didn't use the fastq format. So I found that : illumina-dump -X 5 -x -Z SRR2192406 give me the data I want. The format is a bit different and I still have to figure out how I'll treat it, but at least I have the information I need.

ADD REPLY • link 8.4 years ago by claude.loverdo • 0

0

Entering edit mode

Actually this is getting closer to what I want: fastq-dump -R pass --split-spot -X 5 -Z SRR2192406 -> it writes only the spots that pass the filter.

ADD REPLY • link 8.4 years ago by claude.loverdo • 0

score 1 · Answer 1 · 2016-06-24

Use the -F or --origfmt option with fastq-dump to get the original illumina fastq headers.

Generally Illumina software failed (N) reads should not be in the data people use/submit to SRA and you can verify that once you dump them in the original format.

If the original submitters had changed the header (from the standard illumina one) then that is what you will get back with the -F option. SRA can only give you what they received from the submitters.