Hello, I have specific genes for which I want to pull reads from FASTA or FASTQ (I am not sure which would be better). How would I do that? thank you!
Hello, I have specific genes for which I want to pull reads from FASTA or FASTQ (I am not sure which would be better). How would I do that? thank you!
Use bbduk.sh
from BBMap suite in filter
mode. Guide available here : https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/ It will work with any kind of reads and you can output data as fasta if you wish.
That said, if the original data came from whole genome, using a reduced representation (e.g. genes of your interest) always has the possibility that some reads may get pulled in by chance. If you have a reference available then aligning to complete genome and then extracting reads (as suggested already) would be the cleanest way to do this. bbmap.sh
the aligner can help with that.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is there a reason you want to do this? A couple of ways to proceed. One way is to use your gene sequence as an alignment index, and then align your fastq reads to the index. Keep all the reads that match the index. Another way is to map all the reads to the alignment index for the genome from which the gene was derived, and then keep all the reads that overlap with the gene coordinates (easy using samtools). Other than that, the only other solution consistent with the way you worded your question would be to use a direct pattern match between your fastq reads and the gene sequence - which seems exceedingly clunky and probably not what you're really after. What are you trying to achieve?