I want to use data from NCBI. The classic fatsq format is:
@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
GTTTGNACCATCTTGACAGACTTCAAAAATTGGCTGGGGTCTAAATTGTTCCCCAAGCTGCCCGGCCTCCCCTTCATCTCTTGTCAAGATCGGAAGAGCA
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CBCFF#2ACFHFHJJIJIJIJJJJJJIGIGHHIJJI??D?@FGIJJJJIIGFHGHIIIEIIIFHHFEEE2?>ABCACDDACCCAAC@>@AA8<22922?C
@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
TGACAAGAGATGAAGGGGAGGCCGGGCAGCTTGGGGAACAATTTAGACCCCAGCCAATTTTTGAAGTCTGTCAAGATGGTGCAAACAGATCGGAAGAGCG
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CCCFFFFFHGHHHGJIJJIJJIJEHHIIJJJGGHIJGIJJIJIJJJHHHGEDEFDEEEDCCDDDDCDDECDDDDDDDDDCCDACDDDDDDDDDBDBDDDB
which I can either get downloading directly in fastq format from NCBI, or using the sra toolkit, with the
fastq-dump --split-spot -X 5 -Z SRR2192406 > test.fastq
(the -X 5 is here only to avoid downloading all while I'm trying to figure out how to get the format I want).
The issue is that I would like another information. In the illumina fastq format, the first line is (below the wikipedia example). @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG The Y is if the read is filtered, N otherwise . For what I understand, it has to do with the proximity of the spots (if some spots are too close together, they are scored N). I need to filter them out, because for my application I really want to decrease all noise sources.
I am still quite new to the NCBI archive, but I hope the original format use for downloading in this archive retains all the information.
My question, it is how to get this information? Is there any way to ask for a different fastq format? Or should I use a different route to access the NCBI data? Thanks!
You are looking for the second word of the description line as found in FASTQ files from Illumina instruments. I have never seen this in data downloaded from the short read archive, neither in SRA files from NCBI nor on FASTQ files from EBI. I guess that it is removed during submission. See also Should I use reads with good quality but failed-vendor flag?
Thanks for your replies! The submitters didn't use the fastq format. So I found that : illumina-dump -X 5 -x -Z SRR2192406 give me the data I want. The format is a bit different and I still have to figure out how I'll treat it, but at least I have the information I need.
Actually this is getting closer to what I want: fastq-dump -R pass --split-spot -X 5 -Z SRR2192406 -> it writes only the spots that pass the filter.