I have multiple txt files containing headlists of multiple gene IDs. I have another file containing all the coding sequences along with the gene IDs. I am trying to loop over each file and extract the sequence for the IDs. For some reason, it is not working.
An ID file, of which there are ~500:
Gene.1070__Species1@61356__g.1070__m.1070 Gene.93__Species2@2041__g.93__m.93
The coding sequence file contains thousands of sequences, including:
Gene.1070__Species1@61356__g.1070__m.1070 ACATTC... Gene.93__Species2@2041__g.93__m.93 GGGG...
I have tried:
for file in *.txt; do
seqkit grep -f "$file" cds.cds -o "$file.fa"
done
This returns empty files. I have successfully used seqkit in the past and I cant figure out why this time it does not work.
I have tried to use the perl script subset_fasta.pl but I get an error and this returns empty files also. I have checked and the gene names in the headlist match the gene names in the coding sequences file. Can anybody help me figure out what is going wrong?
I have also tried:
seqtk subseq in.fq name.lst > out.fq but I cant get this to work on a single file.
hi cats123,
Welcome to the forum.
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
Please paste several lines of the text file, and make sure you're using the latest version of seqkit.
Tips:
>
symbol is not part of the FASTA head.Please post what's inside both the files.
There is no error in the loop. Try looking at your input files cds.cds and txt files. Check if the headers are matching. Make sure that IDs in ID files do not have ">" before IDs. In addition, from your code, your output would have .txt.fa extension. Try this:
if echo is okay, remove echo. You can also take a line from echo and see if it works fine.
You can avoid loop by using parallel:
Remove dry-run if dry-run is okay.