Extracting fasta fsequences based on ID headlists
1
0
Entering edit mode
2.5 years ago
cats123 • 0

I have multiple txt files containing headlists of multiple gene IDs. I have another file containing all the coding sequences along with the gene IDs. I am trying to loop over each file and extract the sequence for the IDs. For some reason, it is not working.

An ID file, of which there are ~500:

Gene.1070__Species1@61356__g.1070__m.1070 Gene.93__Species2@2041__g.93__m.93

The coding sequence file contains thousands of sequences, including:

Gene.1070__Species1@61356__g.1070__m.1070 ACATTC... Gene.93__Species2@2041__g.93__m.93 GGGG...

I have tried:

for file in *.txt; do
   seqkit grep -f "$file" cds.cds -o "$file.fa"
done

This returns empty files. I have successfully used seqkit in the past and I cant figure out why this time it does not work.

I have tried to use the perl script subset_fasta.pl but I get an error and this returns empty files also. I have checked and the gene names in the headlist match the gene names in the coding sequences file. Can anybody help me figure out what is going wrong?

I have also tried:

seqtk subseq in.fq name.lst > out.fq but I cant get this to work on a single file.

seqkit fasta grep script • 1.5k views
ADD COMMENT
1
Entering edit mode

hi cats123,

Welcome to the forum.

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Please paste several lines of the text file, and make sure you're using the latest version of seqkit.

Tips: > symbol is not part of the FASTA head.

ADD REPLY
0
Entering edit mode

Please post what's inside both the files.

There is no error in the loop. Try looking at your input files cds.cds and txt files. Check if the headers are matching. Make sure that IDs in ID files do not have ">" before IDs. In addition, from your code, your output would have .txt.fa extension. Try this:

for file in *.txt; do echo seqkit grep -f $file cds.cds -o ${file/.txt/}.fa; done

if echo is okay, remove echo. You can also take a line from echo and see if it works fine.

You can avoid loop by using parallel:

$ parallel --plus --dry-run seqkit grep -f {} cds.cds -o {.}.fa ::: *.txt

Remove dry-run if dry-run is okay.

ADD REPLY
0
Entering edit mode
2.5 years ago

What you are doing in your script there is to loop over all files with extension .txt .

What you likely want to do is to loop over all those files but then use the content of each of those files as input to seqkit.

Is there a single ID or multiple in each of those files? If only one you can expand your loop to get for each file the first line for instance (which thus contains the single ID). If you have multiple IDs in each file you need to add a loop that will go over each of the entries in that file and pass that on to seqkit

ADD COMMENT
0
Entering edit mode

There are multiple IDs in each file. How can I do a loop like that? A loop in a loop?

ADD REPLY
0
Entering edit mode

yes exactly. something like this: (pseudo-code)

for file in *.txt
  for ID in $( cat "$file" )
    seqkit ...
  done
done
ADD REPLY
0
Entering edit mode

for file in *.fa_headlist.txt; do for line in $( cat "$file" ); do seqkit grep -f "$line" cds.cds -o "$file.fa"; done; done

This doesnt run and says no such file regarding each ID. Do you know what is wrong with this code? Thanks for all your help.

ADD REPLY
1
Entering edit mode

add echo statements in your loop to see what is in each variable. (eg. echo $file or echo $line )

ADD REPLY

Login before adding your answer.

Traffic: 2831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6