Hi,
I have been analyzing a set of small RNA seq and I encountered a small problem with fasta/fastq files. After trimming and collapsing, I wanted to filter for reads that are 22 nt long with a Guanine in the 5'. This is the code I used to filter the reads:
cat input_wt3_trimmed_collapsed_1_2.fq | paste - - | awk 'length($4) >= 22 && length($4) <=22' | sed 's/\t/\n/g' > input_wt3_trimmed_collapsed_2.fq
awk '$2 ~ /^G/' elution_wt1_trimmed_collapsed_1_2.fq > elution_wt1_trimmed_collapsed_1_2_22Gs_2.fq
However, these command lines converted my fasta/fq files into one line fasta format from two lines format, here is the example:
before:
>1-1763
TACCCGTATAAGTTTCTGCTGAG
>2-1550
TGAGATCGTTCAGTACGGCAA
after:
>73-969 GAGATCGGGCGGGAAGTGGTAT
>89-940 GTTTCCGGCTCACGTCCTCTGA
>90-938 GCGTGTAAGTTCGGCGGCGTGA
I would really appreciate if you guys have any better way of fixing this problem. When I want to map these reads with STAR, it is not recognised as compatible. I guess I need to convert the final file into a two lines fasta file such as:
>73-969
GAGATCGGGCGGGAAGTGGTAT
>89-940
GTTTCCGGCTCACGTCCTCTGA
>90-938
GCGTGTAAGTTCGGCGGCGTGA
What could be the best way to fix this problem?
best
Ahmet
please reformat the examples. everything is just one line. also its length($4) == 22 and in awk you can also test for G at 5'
I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
from your code in OP, I understand that you are parsing a fq file and your output is also fastq file . But examples provided by you are neither fastq/fq nor fasta. Could you please post a record or few records from fq?