Hi All,
I downloaded data from SRA archive. and utilize fastq-dump to convert it into FASTQ files.
fastq-dump ----outdir $OUTPUT -I --split-files $INPUT/SRR00000.sra
The FASTQ file headers were something like this:
@SRR1101035.5.1 5 length=100 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +SRR1101035.5.1 5 length=100
@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################
@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +SRR1101035.6.1 6 length=100
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
I removed the info in the FASTQ header section, and with command as suggested here before:
cat your_original_fasta_file | paste - - - - | awk -v OFS="\t" ' {print $1,$4,"+",$8}' | tr "\t" "\n" > new_fasta_file
The results were:
@SRR1101035.5.1 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +
@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################
@SRR1101035.6.1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
Here in the above section
@SRR1101035.5.1 define as @samplename.readid.type of pair_in pair-end-mode.
But still when i utilize BWA software to map the first pair and second pair, it prompts an error message, i.e.:
"paired reads have different names: "SRR1101036.5.1", "SRR1101036.5.2"
QUESTION:
I WANT TO EDIT THE FASTQ FILE, SO in future if i use any read-pair info sensitive tool, there will be no problem in processing the data.
Like other conventional files, the one possible way is, i edit the header info in fastq file:
something like convert: @SRR1101035.5.1 >>> @SRR1101035.5#/1
Could you please suggest how i can edit this in linux command-line or by any other possible means, Remember i want to preserve the sample and read ids info in header section???
Thanks for suggestion Antonio!
I tried --split-3 function in command-line..
fastq-dump --gzip --outdir $OUTPUT -I --split-3 $INPUT/SRR1101035.sra
The output files written with same number of spots like with previous command line. So i believe there's no such read that is not following the rule (no third file generated).
But the problem persists. Could you please suggest something command-line in one step so i can replace
@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG
+SRR1101035.6.1 6 length=100
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
Expected Results:
@SRR1101035.6#/1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG
+
<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################
-- Thanks