I have paired-end RNA-Seq data - read1 and read2 - stored in the same fastq file.
I'd like to align the reads using tophat.
Do I have to separate the data into two different files before running tophat?
I have paired-end RNA-Seq data - read1 and read2 - stored in the same fastq file.
I'd like to align the reads using tophat.
Do I have to separate the data into two different files before running tophat?
If you have any pattern in the read name ( like 1:N:####
or /1
or _1
etc) you could use the fastq-grep to match the pattern to extract the R1 and R2 into two separate files.
Or a simple Awk patter match will do. something like:
zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R1 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_1.fastq.gz
zcat fastq.gz | paste - - - - | awk '{ if $1 ~ /< R2 pattern here>/ { print $1"\n"$2"\n"$3"\n"$4" }' | gzip > Read_2.fastq.gz
But they need to be kept in order, if they are not.
hi,
a bit late, but it can be useful for others. I just did some corrections to the good suggestion from Goutham Atla, it's faster than a lot of scripts:
zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq
zcat my_jgi_interleaved_file.fastq.gz | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > my_jgi_read1.fastq
If you want to run it on a ton of files in the same directory and have your outputs with the name of your original jgi file included with read1 and read2, and if your files are already unzipped (if not just change the first .fastq for fastq.gz and use zcat instead of cat:
for f in ./*.fastq ; do cat "$f" | paste - - - - | awk '$2~ /1:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > "$f._read1.fastq"; done
for f in ./*.fastq ; do cat "$f" | paste - - - - | awk '$2~ /2:N/ {print $1,$2"\n"$3"\n"$4"\n"$5}' > "$f._read2.fastq"; done
I don't see anything about processing interleaved files in Tophat's manual; it only mentions paired reads in two files. But you can de-interleave the file very fast with my reformat tool (written in Java):
reformat.sh in=reads.fq out1=r1.fq out2=r2.fq
Looks like classpath is correct but had problem running it:
$ bbmap/reformat.sh -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
java -ea -Xmx2g -cp /lustre/groups/lorainelab/data/d/fastq/bbmap/current/ jgi.ReformatReads -Xmx2g in=1.fastq.gz out=1_1.fastq out2=1_2.fastq
Exception in thread "main" java.lang.UnsupportedClassVersionError: jgi/ReformatReads : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: jgi.ReformatReads. Program will exit.
Ann,
That means you have a very old version of Java installed. You can either install (or request that your sysadmin install) Java 7, or use the latest Java 6-compiled version of BBTools. I apologize for the inconvenience.
-Brian
Follow-up:
I ended up writing a script to do this - before Geek_y's post. (Sorry, I haven't tried Geek_y's solution. It looks simpler & easier.)
Code is: splitPairs.py
https://bitbucket.org/lorainelab/genomes_src
Warning: splitPairs.py
has test coverage, but other scripts may not.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If you mean to separate an interleaved fastq ((2n-1)-th read to one file; (2n)-th to another):
If tophat2 support streaming, you can do something like the following without creating temporary files (bash only):