Question

extracting reads from FASTA/FASTQ file

0

Entering edit mode

11 weeks ago

bleven • 0

Hello, I have specific genes for which I want to pull reads from FASTA or FASTQ (I am not sure which would be better). How would I do that? thank you!

RNA-seq RNA • 418 views

ADD COMMENT • link updated 11 weeks ago by GenoMax 148k • written 11 weeks ago by bleven • 0

0

Entering edit mode

Is there a reason you want to do this? A couple of ways to proceed. One way is to use your gene sequence as an alignment index, and then align your fastq reads to the index. Keep all the reads that match the index. Another way is to map all the reads to the alignment index for the genome from which the gene was derived, and then keep all the reads that overlap with the gene coordinates (easy using samtools). Other than that, the only other solution consistent with the way you worded your question would be to use a direct pattern match between your fastq reads and the gene sequence - which seems exceedingly clunky and probably not what you're really after. What are you trying to achieve?

ADD REPLY • link 11 weeks ago by seidel 11k

score 0 · Answer 1 · 2024-10-10

Use bbduk.sh from BBMap suite in filter mode. Guide available here : https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/ It will work with any kind of reads and you can output data as fasta if you wish.

That said, if the original data came from whole genome, using a reduced representation (e.g. genes of your interest) always has the possibility that some reads may get pulled in by chance. If you have a reference available then aligning to complete genome and then extracting reads (as suggested already) would be the cleanest way to do this. bbmap.sh the aligner can help with that.