Problem guessing what fastq from paired-end rna-seq is forward and what is reverse
1
0
Entering edit mode
8.3 years ago
Emilio Marmol ▴ 180

Hi everyone, I have what seems a conceptual or a strange problem with some fastq I have just recieved.

Theoretically, I assumed that when recieven those fastq from a paired-end NGS analyisis from small-RNA fraction, there would be some kind of identification in both pairs of fastqs for each sample, identificating which of them is the forward one and which is the reverse one.

I have these data:

/160712_700470R_0449_BHVHH7BCXX/ which has inside:

7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz

and

/160713_700470R_0450_BHVKV5BCXX/ which has inside:

7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz

as you all can see, both directories have the same number of fastq representing each sample, labelled with the same name. Theoretically, I should assume that the forward fastq is one of them, and the reverse is the other one, in each pair of samples.

Taking a look inside each pair of fastq, they appear to be different, as you can see:

/

160712_700470R_0449_BHVHH7BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head

@700470R:449:HVHH7BCXX:1:1107:1493:1874 1:N:0:TGACCA
NGCGCCGCGGCTGGACGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+
#<<DDIIIIIIHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:449:HVHH7BCXX:1:1107:1465:1915 1:N:0:TGACCA
NGCGACCTCAGATCAGACGAAGATCGGAAGAGCACACGTCTGAACTCCAGT
+
#<<DDHIHHHIFEHIHHIIIIIIHIHIIGIIIIIGIIHIIIIHIIIGIIIG
@700470R:449:HVHH7BCXX:1:1107:1971:1937 1:N:0:TGACCA

/160713_700470R_0450_BHVKV5BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head

@700470R:450:HVKV5BCXX:1:1101:1664:1955 1:N:0:TGACCA
NTTGGTCCCCTTCAACCAGCTGTAGATCGGAAGAGCACACGTCTGAACTCC
+
#<<DDHIHIHIIIIIIHIIIHIIIIIIIIIIIIIIIIIHIIIIIIHIIIII
@700470R:450:HVKV5BCXX:1:1101:1940:1935 1:N:0:TGACCA
NGGAATGTAAAGAAGTATGTACAGATCGGAAGAGCACACGTCTGAACTCCA
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:450:HVKV5BCXX:1:1101:2588:1943 1:N:0:TGACCA
NCGTACCGTGAGTAATAATGCGAGATCGGAAGAGCACACGTCTGAACTCCA

I was assuming to see any kind of guide in the header of each read, as in the final section of each header's read (1:N:0:TGACCA ) there should be an identification determining whether this is a forward strand (1), or a reverse strand (2), but surprisingly, there's a 1 in both of them. So I kinda freaked out...

Debating about this with my pillow, I reached two possible conclusions:

1) The lab that provided this information should tell me which sample is the forward and which is the reverse (I sent them an email related with this issue but I haven't recieved any answer yet).

2) There's no such difference or importance in identificating the forward with a 1 and the reverse with a 2, and presumably I could assing arbitrarily a 1 or a 2 to each one of the paired-samples, and procced to further analysis, but this second theory appeared to my mind in hat seems to be a very silly solution.

SO, any help about this?... I can't start analyzing my samples until I solve this problem...

RNA-Seq next-gen fastq forward reverse • 2.7k views
ADD COMMENT
0
Entering edit mode

normally "at least in my case" it should be in file name for example machine_lan.1.fastq machine_lan.2.fastq, or both in one file but each read will be distinguished 1 for forward and 2 for reverse (is this Illumina platform?)

ADD REPLY
0
Entering edit mode

if it's paired-end then each _R1_ file should have an associated _R2_ file :

7006_S29_L002_R1_001.fastq.gz 
7006_S29_L002_R2_001.fastq.gz 

if not: some files are missing.

ADD REPLY
0
Entering edit mode

as pierre and medhat, pair end reads will be name R1 and R2. As you have mentioned these are from smallRNA, pair end data is not needed, it seem same samples have run in multiple run. Regarding analysis, you can just concatenate same sample files and proceed.

ADD REPLY
0
Entering edit mode

Oooook thanks a lot to everyone, yeah... I'm used to do paired-end RNA-seq and I'm new in the microRNA world so I assumed things wrong...

So yes, they are single-end reads, now I see it XD, but they repeated the run over all the samples twice. Now I'm not sure whether to procced concatenating both runs in each sample to merge new fastqs with two runs in one fastq file, or just do the analyses separately in both runs to compare, or just do both :S

ADD REPLY
0
Entering edit mode

If it is the same sample run multiple times you can concatenate the files (unless one of the replicates was deemed not suitable and the pool was re-run for that reason).

ADD REPLY
0
Entering edit mode

if the rerun was due to read deficiency then you can concatenate and do the analysis. If they are replicates, then analyze individually

ADD REPLY
3
Entering edit mode
8.3 years ago

Either the samples weren't sequenced paired-end (it rarely makes sense to do so for smallRNAseq unless you're doing single-cell sequencing) or they forgot to deliver those files. The former is more likely. You might ask them why they ran the samples on a second flow cell, i.e., was there a problem with the first run that you should know about or was it just for depth?

ADD COMMENT

Login before adding your answer.

Traffic: 2796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6