fastq file on the basis of read length
2
3
Entering edit mode
10.4 years ago
Varun Gupta ★ 1.3k

Hello Everyone,

I have a fastq file and I want to extract only those reads which have length greater than 25 bp. So i want to make another fastq file with read length > 25 bp. How can I do this. This is my top 100 lines of fastq file

@SRR1024131.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1911:1053 length=100
AGGGCAAGTATGAAGAAGTAGAATATT
+SRR1024131.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1911:1053 length=100
DDFHHFHHGGGHGGIFHIJIIDIIJJI
@SRR1024131.2 DBRHHJN1:259:D0PM7ACXX:1:1101:2522:1198 length=100
GGCTCAACTTTCGATGGT
+SRR1024131.2 DBRHHJN1:259:D0PM7ACXX:1:1101:2522:1198 length=100
FFFGHHHHJJJJJGFIJF
@SRR1024131.3 DBRHHJN1:259:D0PM7ACXX:1:1101:3117:1165 length=100
ACATTTTTGAGTGCTTACTACAGT
+SRR1024131.3 DBRHHJN1:259:D0PM7ACXX:1:1101:3117:1165 length=100
FFFHHHHHHHIHEHHFGHFHHGII
@SRR1024131.4 DBRHHJN1:259:D0PM7ACXX:1:1101:3474:1075 length=100
TAGTACTTAGCAAAGAGTGA
+SRR1024131.4 DBRHHJN1:259:D0PM7ACXX:1:1101:3474:1075 length=100
DDDFHDFHIAGHIGHG@33A
@SRR1024131.5 DBRHHJN1:259:D0PM7ACXX:1:1101:3952:1099 length=100
TGAGAACTGAATTCCATAGGCTGT
+SRR1024131.5 DBRHHJN1:259:D0PM7ACXX:1:1101:3952:1099 length=100
EFFHGHHHHJIJJJJJIBFHEHIG
@SRR1024131.9 DBRHHJN1:259:D0PM7ACXX:1:1101:5277:1092 length=100
GCGGCGGCGTTATTCCCATGACCCGCCGG
+SRR1024131.9 DBRHHJN1:259:D0PM7ACXX:1:1101:5277:1092 length=100
FDDDHHDHI@B>=B>?@BD>ACCCBC@BB
@SRR1024131.11 DBRHHJN1:259:D0PM7ACXX:1:1101:6019:1101 length=100
AGTAGATTTGTATGGATTT
+SRR1024131.11 DBRHHJN1:259:D0PM7ACXX:1:1101:6019:1101 length=100
DDDHHFFHIGHAGHEFIII
@SRR1024131.14 DBRHHJN1:259:D0PM7ACXX:1:1101:8423:1248 length=100
AGTCGGTGATGGGAGTCTCT
+SRR1024131.14 DBRHHJN1:259:D0PM7ACXX:1:1101:8423:1248 length=100
FFFHHHFHIJIIJIJBHIJJ
@SRR1024131.15 DBRHHJN1:259:D0PM7ACXX:1:1101:9484:1233 length=100
TGCTGGGTCACACCTGAAGCT
+SRR1024131.15 DBRHHJN1:259:D0PM7ACXX:1:1101:9484:1233 length=100
FFFHHGHFHIJHHHJJHHIJJ
@SRR1024131.16 DBRHHJN1:259:D0PM7ACXX:1:1101:9807:1100 length=100
ACTATTCCAGCGAGAGTTAACATAAATTCCAAT
+SRR1024131.16 DBRHHJN1:259:D0PM7ACXX:1:1101:9807:1100 length=100
FFFHHHHHJJIJJJJIHHGHIJJGJJJJIIJJI
@SRR1024131.17 DBRHHJN1:259:D0PM7ACXX:1:1101:10857:1034 length=100
TAATCATTTTAATTGTACAGTTCAGTAATGT
+SRR1024131.17 DBRHHJN1:259:D0PM7ACXX:1:1101:10857:1034 length=100
B?3CDFBFFFFFIIF:EFHAHIC?FE+ABHH
@SRR1024131.19 DBRHHJN1:259:D0PM7ACXX:1:1101:13257:1082 length=100
ATGTGTTTGTAGGTTGTTTGTTGTCTTTA
+SRR1024131.19 DBRHHJN1:259:D0PM7ACXX:1:1101:13257:1082 length=100
DFFHHHHHJFHHIHHJFGHIJJIFIIIIG
@SRR1024131.20 DBRHHJN1:259:D0PM7ACXX:1:1101:14103:1161 length=100
TGAGGTAGTAGGTTGTATAGTT
+SRR1024131.20 DBRHHJN1:259:D0PM7ACXX:1:1101:14103:1161 length=100
FFEHFCFHFGHGEFHC<HHIED
@SRR1024131.21 DBRHHJN1:259:D0PM7ACXX:1:1101:16005:1093 length=100
TTCTCTCTCTCTGTGTGTGCGTGTGTGTGTGT
+SRR1024131.21 DBRHHJN1:259:D0PM7ACXX:1:1101:16005:1093 length=100
DDFGHGGFJGIJFIBCBAFHHCGGFDCFGFED
@SRR1024131.24 DBRHHJN1:259:D0PM7ACXX:1:1101:17113:1023 length=100
TCCCTGAGACCCTAACTTGTGA
+SRR1024131.24 DBRHHJN1:259:D0PM7ACXX:1:1101:17113:1023 length=100
FFFHHHHHJJJIIJJIJJJJJJ
@SRR1024131.26 DBRHHJN1:259:D0PM7ACXX:1:1101:18596:1025 length=100
TGAGGTAGGAGGTTGTATAGTTAT
+SRR1024131.26 DBRHHJN1:259:D0PM7ACXX:1:1101:18596:1025 length=100
DDDDDACDEEEE:AF3CE@A9ABE
@SRR1024131.27 DBRHHJN1:259:D0PM7ACXX:1:1101:19286:1068 length=100
TCCCTGAGACCCTAACTTGTGA
+SRR1024131.27 DBRHHJN1:259:D0PM7ACXX:1:1101:19286:1068 length=100
DDDFHHHHIIIGG;CEGIEHHG
@SRR1024131.28 DBRHHJN1:259:D0PM7ACXX:1:1101:20016:1230 length=100
CAAATAATTACAGTTAT
+SRR1024131.28 DBRHHJN1:259:D0PM7ACXX:1:1101:20016:1230 length=100
DFFGBFBHG@HGHHGFA
@SRR1024131.29 DBRHHJN1:259:D0PM7ACXX:1:1101:20465:1216 length=100
GTTACGCTCGCCTTGGCCGT
+SRR1024131.29 DBRHHJN1:259:D0PM7ACXX:1:1101:20465:1216 length=100
FFFGHHHHJJJJGGHIFHGD
@SRR1024131.30 DBRHHJN1:259:D0PM7ACXX:1:1101:20573:1152 length=100
AGAAGGAACTTTTACAACTGTGTGGTTTT
+SRR1024131.30 DBRHHJN1:259:D0PM7ACXX:1:1101:20573:1152 length=100
DDBDBB+AFHGE>@<C<?:AA@HEE:)?F
@SRR1024131.32 DBRHHJN1:259:D0PM7ACXX:1:1101:21322:1217 length=100
ATTACTGAAGAAAAGTTTACCT
+SRR1024131.32 DBRHHJN1:259:D0PM7ACXX:1:1101:21322:1217 length=100
AADHHHHB<:EEF;C22A22AC
@SRR1024131.35 DBRHHJN1:259:D0PM7ACXX:1:1101:4318:1259 length=100
AAAAGCATTCATCAGCCCAA
+SRR1024131.35 DBRHHJN1:259:D0PM7ACXX:1:1101:4318:1259 length=100
FFFGHGHHJGCIJFGGIJII
@SRR1024131.36 DBRHHJN1:259:D0PM7ACXX:1:1101:4391:1407 length=100
CTGGACTCTTACTGCGTTTCATACATCT
+SRR1024131.36 DBRHHJN1:259:D0PM7ACXX:1:1101:4391:1407 length=100
FFFH?HHHIGGIGIII<FBEHIIIEIGE
@SRR1024131.39 DBRHHJN1:259:D0PM7ACXX:1:1101:6327:1406 length=100
AAGTACGCACGGCCGGTACAGTGAAG
+SRR1024131.39 DBRHHJN1:259:D0PM7ACXX:1:1101:6327:1406 length=100
FFFHGHHHIJIGIIII0?FHGHIJGH
@SRR1024131.43 DBRHHJN1:259:D0PM7ACXX:1:1101:7579:1334 length=100
TGTGTATAAATGTATTT
+SRR1024131.43 DBRHHJN1:259:D0PM7ACXX:1:1101:7579:1334 length=100
FFFHHHGHJJJJHGIJJ

Any help!!

Regards
Varun

fastq • 6.2k views
ADD COMMENT
3
Entering edit mode

You may find a suitable answer faster by simply searching because a variety of similar questions have been asked before, e.g., Filtering Fastq Sequences Based On Lengths

ADD REPLY
6
Entering edit mode
10.4 years ago

linearize with paste, filter with awk, convert tabs to CR

gunzip -c *.fastq.gz |\
paste - - - - |\
awk 'length($2) > 25' |\
tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Thanks Pierre

ADD REPLY
5
Entering edit mode
5.8 years ago
A. Domingues ★ 2.7k

I believe you could also use seqtk to do this:

seqtk seq -L 25 yourseqs.fastq.gz > cleanseqs.fastq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6