How to extract the first and last N bases from a read in a fastq file?
1
0
Entering edit mode
5.6 years ago
lovegenetics ▴ 70

how can I extract the first and last N bases from a read in a fastq file?

I have used the following command to extract the last 1000 bases of a read from a fastq file but I'd also like to incorportate the first 1000 bases to the command as well:

$$  grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | grep -o '.{1000}$'

Also, how can I use the new command for the first and last N bases on a perl script as I have >450 reads in a fastq file?

Many thanks,

Any help will be appreciated.

fastq • 2.8k views
ADD COMMENT
0
Entering edit mode

If you want to use perl (or python), I would suggest parsing the file 'properly' with the Bio module. Extracting this information will be fairly trivial, and much more robust.

ADD REPLY
1
Entering edit mode
5.6 years ago
GenoMax 147k

Here is one way (using part of your own solution) :

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | cut -c 1-1000

OR

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | sed 's/.//1001g'

Not sure why you are referring to perl in your question since there is no perl involved.

Hint: For things like this search stackoverflow for solutions.

ADD COMMENT
0
Entering edit mode

Thank you @genomax.

I was wondering if there's a syntax I could use to extract both the 1st and last 1kb of sequence from each read?

ADD REPLY
0
Entering edit mode

How about this?

zcat MyReads.fastq.gz | head
@1:NM_014620:16:182
GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@2:NM_014620:1094:172
ATGAAAAAAATTCACGTTAGCACGGTGAACCCCAATTATAACGGAGGGGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@3:NM_022658:294:172
TGTACGGGCCCGGCGGCTCGGCGCCCGGCTTCCAGCACGCTTCGCACCAC

bases=10 ;
zcat MyReads.fastq.gz | awk -v len="$bases" -F "" '{if (NR % 4 == 2) {for (i=1; i<=NF; i++) {if (i <= len || (NF-i) < len) {printf $(i)}}; printf "\n"}}' | head
GTTCCGAGCGCTAACCATTGC
ATGAAAAAAAAACGGAGGGGA
TGTACGGGCCCTTCGCACCAC
ACTAGATGTAAATGAAGAAAG
GCGTGAGGAGGGCAGCGCTGC
GGAAGGACTCTGGGGCAAGGA
CTACCTATAGATGCGGTTTTG
AAGGGACTGTGGAGGCGTCTC
ATAACATCCATCTGTTTTGTT
CCAACCTGCCCCCGGGGAGCC

In this example, I am just taking the first and last 10 bases. To retrieve 1000bp, change the value of bases before the awk command

ADD REPLY

Login before adding your answer.

Traffic: 2229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6