Some of this information can be obtained from SRA using EntrezDirect:
$ esearch -db bioproject -query PRJEB3246 | elink -target biosample | esummary | xtract -pattern DocumentSummary -element Identifiers,Paragraph
BioSample: SAMEA1531955; SRA: ERS179577 We have generated paired-end sequence data covering the genome of an CEPH female individual to a sequence depth of more than 200-fold using the Illumina HiSeq 2000. This individual is a member of the population samples described in the PhaseI and PhaseII HapMap Projects and is from the CEPH/UTAH pedigree 1463 (abbreviation: CEPH). The DNA identifier for this individual is NA12878. We obtained the DNA sample NA12878 from The Coriell Institute for Medical Research. Starting with 1ug of DNA, and following random fragmentation, we generated a PCR-Free sequencing library with a median insert size of ~300 bp. 100 base sequence reads were generated from both ends of these templates using the Illumina HiSeq 2000. We carried out purity-filtering (PF) to remove mixed reads, where two or more different template molecules are close enough on the surface of the flow-cell to form a mixed or overlapping cluster. No other filtering of the data has been carried out prior to submission. We have also submitted equivalent sequence data for the father and one of the son's of family 1463 (NA12877 and NA12882).
BioSample: SAMEA1531956; SRA: ERS179576 We have generated paired-end sequence data covering the genome of an CEPH male individual to a sequence depth of more than 200-fold using the Illumina HiSeq 2000. This individual is a member of the population samples described in the PhaseI and PhaseII HapMap Projects and is from the CEPH/UTAH pedigree 1463 (abbreviation: CEPH). The DNA identifier for this individual is NA12877. We obtained the DNA sample NA12877 from The Coriell Institute for Medical Research. Starting with 1ug of DNA, and following random fragmentation, we generated a PCR-Free sequencing library with a median insert size of ~300 bp. 100 base sequence reads were generated from both ends of these templates using the Illumina HiSeq 2000. We carried out purity-filtering (PF) to remove mixed reads, where two or more different template molecules are close enough on the surface of the flow-cell to form a mixed or overlapping cluster. No other filtering of the data has been carried out prior to submission. We have also submitted equivalent sequence data for the mother and one of the son's of family 1463 (NA12878 and NA12882).
You can get additional information about the samples. I am showing only part of information below for space reasons.
$ esearch -db sra -query ERS179576 | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR174310,2013-01-03 11:29:28,2013-01-03 11:26:41,207579467,41931052334,207579467,202,27214,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/ERR000/174/ERR174310/ERR174310.sralite.1,ERX150456,CT8624,WGS,RANDOM,GENOMIC,PAIRED,300,0,ILLUMINA,Illumina HiSeq 2000,ERP001775,PRJEB3246,,204921,ERS179576,SAMEA1531956,simple,9606,Homo sapiens,SAMEA1531956,,,,,,,no,,,,,ILLUMINA,ERA166477,,public,EF49CD9FE603F1797081A925E13116D7,1A46A45B2B17FCEC3F94B9222BF1E754
For second sample
$ esearch -db sra -query ERS179577 | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR174324,2013-01-03 13:14:21,2013-01-03 13:13:12,223571196,45161381592,223571196,202,28183,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/ERR000/174/ERR174324/ERR174324.sralite.1,ERX150470,CT8595,WGS,RANDOM,GENOMIC,PAIRED,300,0,ILLUMINA,Illumina HiSeq 2000,ERP001775,PRJEB3246,,204921,ERS179577,SAMEA1531955,simple,9606,Homo sapiens,SAMEA1531955,,,,,,,no,,,,,ILLUMINA,ERA166477,,public,9F957F28050A45993205BA2069486360,007F48E27E5E8611582D3B9F888EC498
It is not necessary to have adapter sequence in the data. These libraries were likely of very good quality (they were made by Illumina) and likely have little or no adapter. These were likely made using TruSeq. You can also use BBTools to try and identify adapter sequence: How to figure out adapter sequence for Illumina reads?
Thank you, I tried to find out the adapter information using Illumina documentation and was having a hard time. BBTools worked like a charm.
If you check https://www.ebi.ac.uk/ena/browser/view/PRJEB3246?show=reads and then "Sample Accession" then you get info from which individual it is. SAMEA1531956 is NA12877 and SAMEA1531955 is NA12878, that is the two individuals I see in PRJEB3246. Probably you can parse that out from the XML file or query their API somehow to automate such operations.
Thank you for the reply. From GenoMax's reply and your comments I understand that the files correspond to 2 individuals (NA12878 and NA12877). I am curious to know why the fastq files for a single individual has been split into 14 and 18 separate experiments? Is it because of their depth and if we were to use them for analysis how should they be used?
Yes, that is common for whole genome experiments or in general larger experiments that require depth which cannot be provided by a single lane. Nothing to worry about, you can just
cat
the individual files per R1 and R2 together.Thank you for the clarification.