Hi everyone, I have what seems a conceptual or a strange problem with some fastq I have just recieved.
Theoretically, I assumed that when recieven those fastq from a paired-end NGS analyisis from small-RNA fraction, there would be some kind of identification in both pairs of fastqs for each sample, identificating which of them is the forward one and which is the reverse one.
I have these data:
/160712_700470R_0449_BHVHH7BCXX/ which has inside:
7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz
and
/160713_700470R_0450_BHVKV5BCXX/ which has inside:
7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz
as you all can see, both directories have the same number of fastq representing each sample, labelled with the same name. Theoretically, I should assume that the forward fastq is one of them, and the reverse is the other one, in each pair of samples.
Taking a look inside each pair of fastq, they appear to be different, as you can see:
/
160712_700470R_0449_BHVHH7BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head
@700470R:449:HVHH7BCXX:1:1107:1493:1874 1:N:0:TGACCA
NGCGCCGCGGCTGGACGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+
#<<DDIIIIIIHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:449:HVHH7BCXX:1:1107:1465:1915 1:N:0:TGACCA
NGCGACCTCAGATCAGACGAAGATCGGAAGAGCACACGTCTGAACTCCAGT
+
#<<DDHIHHHIFEHIHHIIIIIIHIHIIGIIIIIGIIHIIIIHIIIGIIIG
@700470R:449:HVHH7BCXX:1:1107:1971:1937 1:N:0:TGACCA
/160713_700470R_0450_BHVKV5BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head
@700470R:450:HVKV5BCXX:1:1101:1664:1955 1:N:0:TGACCA
NTTGGTCCCCTTCAACCAGCTGTAGATCGGAAGAGCACACGTCTGAACTCC
+
#<<DDHIHIHIIIIIIHIIIHIIIIIIIIIIIIIIIIIHIIIIIIHIIIII
@700470R:450:HVKV5BCXX:1:1101:1940:1935 1:N:0:TGACCA
NGGAATGTAAAGAAGTATGTACAGATCGGAAGAGCACACGTCTGAACTCCA
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:450:HVKV5BCXX:1:1101:2588:1943 1:N:0:TGACCA
NCGTACCGTGAGTAATAATGCGAGATCGGAAGAGCACACGTCTGAACTCCA
I was assuming to see any kind of guide in the header of each read, as in the final section of each header's read (1:N:0:TGACCA ) there should be an identification determining whether this is a forward strand (1), or a reverse strand (2), but surprisingly, there's a 1 in both of them. So I kinda freaked out...
Debating about this with my pillow, I reached two possible conclusions:
1) The lab that provided this information should tell me which sample is the forward and which is the reverse (I sent them an email related with this issue but I haven't recieved any answer yet).
2) There's no such difference or importance in identificating the forward with a 1 and the reverse with a 2, and presumably I could assing arbitrarily a 1 or a 2 to each one of the paired-samples, and procced to further analysis, but this second theory appeared to my mind in hat seems to be a very silly solution.
SO, any help about this?... I can't start analyzing my samples until I solve this problem...
normally "at least in my case" it should be in file name for example machine_lan.1.fastq machine_lan.2.fastq, or both in one file but each read will be distinguished 1 for forward and 2 for reverse (is this Illumina platform?)
if it's paired-end then each _R1_ file should have an associated _R2_ file :
if not: some files are missing.
as pierre and medhat, pair end reads will be name
R1
andR2
. As you have mentioned these are from smallRNA, pair end data is not needed, it seem same samples have run in multiple run. Regarding analysis, you can just concatenate same sample files and proceed.Oooook thanks a lot to everyone, yeah... I'm used to do paired-end RNA-seq and I'm new in the microRNA world so I assumed things wrong...
So yes, they are single-end reads, now I see it XD, but they repeated the run over all the samples twice. Now I'm not sure whether to procced concatenating both runs in each sample to merge new fastqs with two runs in one fastq file, or just do the analyses separately in both runs to compare, or just do both :S
If it is the same sample run multiple times you can concatenate the files (unless one of the replicates was deemed not suitable and the pool was re-run for that reason).
if the rerun was due to read deficiency then you can concatenate and do the analysis. If they are replicates, then analyze individually