If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.
If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.
linearize your two fastq files with awk and create a new column with a common "key" (here the name before "/") and sort on the key:
gunzip -c file1.fastq.gz |\
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\
awk '{i=index($1,"/"); printf("%s\t%s\n",substr($1,1,i-1),$0);}' |\
sort -k1,1 -t " " > sorted1.txt
#same for file2.fastq.gz
(..) > sorted2.txt
join both files with unix join:
join -t ' ' -1 1 -2 1 sorted1.txt sorted2.txt
and recreate the two fastq files with cut and awk.
Following up Pierre's post, once when you have your joined sorted file, I recreated my two mate-pair files in the following way:
cat joined_sorted_file |awk '{print substr($2"\n"$3"\n"$4"\n"$5,1)}' > mate1_sort
cat joined_sorted_file |awk '{print substr($6"\n"$7"\n"$8"\n"$9,1)}' > mate2_sort
I am not sure if this is the most efficient way to do so but at least it works :)
Hi,
go through this: Selecting Random Pairs From Fastq?
Ilia
Hi, the idea was to use the sorting approach provided in above thread: turn fasta to one line (tabular), then remove the '#/1' in seq1 library and '#/2' in seq2 library. After that sort each file on first column (corresponding to read ID), after sorting add back the '#/1' and '#/2' to end of first column and turn the tabular fastq back to original four lines per read format.
Ilia
Actually I do not want to select a random subset but check if the sequences are in the same order and if not then order them so that the first read in the first file is the mate of the first read in the second file. Of course, I could do this within BioC,but if the fastq files are large, it takes quite a while to process them.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks a lot for your help! the code worked perfectly when changing the following options: I just used "sort -k1,1 " and "join -1 1 -2 1 sorted1.txt sorted2.txt" (so always without "t" option). Could you maybe also give me a hint how to recreate the two fastq files with cut and awk? Unfortunately I am yet not very fluent with linux shell commands. One additional question: what would happen if I have one ID in one file which is not present in the other file?