ENA: Same ERR for multiple samples
1
0
Entering edit mode
3.5 years ago
atorreso ▴ 130

I was wondering why some projects stored at the EBI's ENA have the same run associated to multiple samples with different metadata.

For example, in project PRJEB2057 (https://www.ebi.ac.uk/ena/browser/view/PRJEB2057), the run ERR023728 is associated to many samples, which include SAMEA968215 (https://www.ebi.ac.uk/ena/browser/view/SAMEA968215) and was collected in 2006 and SAMEA968084 (https://www.ebi.ac.uk/ena/browser/view/SAMEA968084) which was collected in 1999.

Is it because they were multiplexed? That's the only thing I can think of!

ENA • 1.9k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
GenoMax 147k

This data is from the days of Illumina GAII (~2011) so it probably was not a lot per lane as we are used to now a days. You can look at the 5 entries for ERR023728 (see if you can dump the reads out with original Illumina headers) but my guess would be that this sample ran on 5 lanes (of either same or a different flow cell).

ADD COMMENT
0
Entering edit mode

To expand on the comment above, it appears that multiple samples collected at different times where sequenced at one time. Thus multiple samples are associated with the same sequencing run id.

ADD REPLY
0
Entering edit mode

That means that without a table mapping the lane to specific samples, these fastq files are unusable as there is no way to tell what reads belong to what sample, right?

ADD REPLY
0
Entering edit mode

it is also possible that the samples were all mixed up and sequenced together in that particular run. In that case, one would not be able to separate them at all.

a RUN in SRA is not really a lane, or barcoded section etc. A run corresponds to data that was obtained from the sample(s). There is no requirement that all the data from the instrument be included.

ADD REPLY
0
Entering edit mode

There is only one record for this Accession # at NCBI. We can see that the reads are listed to be in format 116x8x110. Dumping 10000 reads out leads to

$ fastq-dump -X 10000 -F --split-spot ERR023728
Rejected 9905 READS because READLEN < 1
Read 10000 spots for ERR023728
Written 10000 spots for ERR023728

For some reads there are what appear to be index sequences and for most there are none.

@IL11_4948:5:1:7940:1602
CGGCTACAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAGACAAATGAGAGATCAAAAAAAAATCAAAACGACAAATACCATAGCTGGTCGGTAA
+IL11_4948:5:1:7940:1602
BBBBBBBBBBB.BBB@BBBBB@@@BB@B@<BB@<@@@@@>66661&%%#%##&&&((%.8&+&-&5)0&(*$#$$#,3&&-1)).(&'%%*,(,&+,.))40-/1)&
@IL11_4948:5:1:7940:1602
CGCCTTTGG
+IL11_4948:5:1:7940:1602
&########
@IL11_4948:5:1:7940:1602
AGATGTTGTCAGTCTCACCGAGGAGGCTGCGAGACATAAACGAGATGTTGCGTCCTGCTGTCGCGGCAGCATCACAGAGGCTGGACACCGTACGGCTTGTCATGCGCT
+IL11_4948:5:1:7940:1602
(&%'&'()'''&*&%&%%%)(*%''+''%%(((%)'%&()'')'&'&'%'('''%+**''%)'%%%%%%'+&'+%%-%%%'&&'&%'''%'%'%%--,&-%&&&&%%%

You can get them as separate files by doing

$ fastq-dump -X 10000 -F --split-files ERR023728
Rejected 9905 READS because READLEN < 1
Read 10000 spots for ERR023728
Written 10000 spots for ERR023728

As for how to interpret the data it is an interesting question. You can dump entire data out and see if the lane number/instrument serial changes across the dataset.

ADD REPLY
0
Entering edit mode

Thanks for your answers! I thought I would shared the answer I also got from the folks at ENA:

You are correct, this represents multiplexed data. While we no longer accept submissions like this and require that read data be demultiplexed before submission, there are still some such projects in the archive.

I'm not aware this data was ever published in a publication, and without any information regarding sample -> lane, it will be impossible to demultiplex, right?

ADD REPLY
0
Entering edit mode

without any information regarding sample -> lane, it will be impossible to demultiplex, right?

That is assuming samples were in individual lanes. You can test this by dumping the entire data out and checking to see if there are multiple lanes represented.

I am more concerned about it this bit from EBI's response.

this represents multiplexed data

if these files contain data multiplexed using indexes then only way to demux would be to use the index sequence (example I posted above). Problem is not all reads have an index sequence (not sure why that is the case) so you may only be able to demux ones that do. Even then you don't know which sample belongs to which index.

It is a long shot but you could write to the original data submitters and see if they can help solve the puzzle or better yet provide demuxed data.

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6