I was wondering why some projects stored at the EBI's ENA have the same run associated to multiple samples with different metadata.
For example, in project PRJEB2057 (https://www.ebi.ac.uk/ena/browser/view/PRJEB2057), the run ERR023728 is associated to many samples, which include SAMEA968215 (https://www.ebi.ac.uk/ena/browser/view/SAMEA968215) and was collected in 2006 and SAMEA968084 (https://www.ebi.ac.uk/ena/browser/view/SAMEA968084) which was collected in 1999.
Is it because they were multiplexed? That's the only thing I can think of!
To expand on the comment above, it appears that multiple samples collected at different times where sequenced at one time. Thus multiple samples are associated with the same sequencing run id.
That means that without a table mapping the lane to specific samples, these fastq files are unusable as there is no way to tell what reads belong to what sample, right?
it is also possible that the samples were all mixed up and sequenced together in that particular run. In that case, one would not be able to separate them at all.
a RUN in SRA is not really a lane, or barcoded section etc. A run corresponds to data that was obtained from the sample(s). There is no requirement that all the data from the instrument be included.
There is only one record for this Accession # at NCBI. We can see that the reads are listed to be in format
116x8x110
. Dumping 10000 reads out leads toFor some reads there are what appear to be index sequences and for most there are none.
You can get them as separate files by doing
As for how to interpret the data it is an interesting question. You can dump entire data out and see if the lane number/instrument serial changes across the dataset.
Thanks for your answers! I thought I would shared the answer I also got from the folks at ENA:
I'm not aware this data was ever published in a publication, and without any information regarding sample -> lane, it will be impossible to demultiplex, right?
That is assuming samples were in individual lanes. You can test this by dumping the entire data out and checking to see if there are multiple lanes represented.
I am more concerned about it this bit from EBI's response.
if these files contain data multiplexed using indexes then only way to demux would be to use the
index
sequence (example I posted above). Problem is not all reads have an index sequence (not sure why that is the case) so you may only be able to demux ones that do. Even then you don't know which sample belongs to which index.It is a long shot but you could write to the original data submitters and see if they can help solve the puzzle or better yet provide demuxed data.