A company performed paired end sequencing of exomic libraries on for us. When I requested the read data in fastq format from them, they mailed me a hard drive with _export.txt files. The alignments were done to UCSC hg18. I want to realign the samples myself. The files from a sample are:
s61_export.txt
s62_export.txt
Interestingly, these files do not have the same number of lines s61 has 67,003,423 while s62 has 144 fewer lines. They also are not sorted such that the paired reads are on the same lines in the files.
My first move was to sort both by the cluster coordinates (columns 5 and 6) so that the paired reads would be on the same line in each file.
sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_1_export.txt > s_6_1_sorted.txt &
sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_2_export.txt > s_6_2_sorted.txt &
However, I do not think this will work out well unless all of the extra reads in the s61 file sort to the end of the resulting file. I was subsequently planning on using casava to convert each of the sorted export.txt files to fastq with the command:
CASAVA -a Export2Fastq -e s_6_1_sorted.txt -o s_6_1.fq --purityFilter=YES
CASAVA -a Export2Fastq -e s_6_2_sorted.txt -o s_6_2.fq --purityFilter=YES
From the resulting fq files I was going to use BWA to do paired end alignment to UCSC hg19 and more downstream analysis with the resulting bam file.
My issue is that I'm baffled as to how to get from the _export.txt files given to me, back to fastq such that the paired nature of the read data is preserved. Could anyone offer guidance or suggestions here?
--Colin
Istvan, thank you for your reply.
The specification for the file format can be found in the CASAVA user guide (http://biowulf.nih.gov/apps/CASAVAUG15011196B.pdf)
The relevant first few columns of an ELAND _export.txt file are as follows: Machine, Run Number, Lane, Tile, X Coordinate of cluster, Y Coordinate of cluster, Index sequence, Read number (1 for single reads; 1 or 2 for paired ends or multiplexed single reads)
A couple (paired) example lines are:
Paired reads should have the same cluster coordinates as I understand it.
I wrote a quick little python script to parse the files, make a set of tuples from all the (x,y) coordinates and then compare the two sets. It's not an elegant solution as it eats 21gb ram, but it works in less than half an hour, so I have time for coffee.
The log is as follows:
I'm going to investigate the differences and see about handling the coordinate duplicates in each file and the symmetric differences between the files. From what I've read, the _export.txt was meant as an internal format. I'm less than pleased that I've gotten it returned to me as a copy of the "source data" for an analysis we contracted to a company.
--Colin
I guess I grossly misunderestimated the memory requirements ;-) glad that it seems to work out though