perl code to extract sequences from multi-line fasta works on all test files but not on research files
2
0
Entering edit mode
9.7 years ago
kdiaz17 • 0

I have headers from a BLAST output file that I would like to create a subset database from for use in HMMer, so I pulled some code from a class I had to create a subset fasta file from the main fasta by giving a perl script a list of headers and pulling sequences with matching headers from that fasta and creating a subset fasta with the matching sequences. It works on the class test files, but not on my own research files. I've tried different scripts with the same end and all have the same effect. Even if I just use one sequence that I know matches as the input fasta, the output is always empty once I run any of the scripts. I'm not sure what's wrong- my files appear to be formatted just like the test files (multi-line wrapping, > included in the headers file, not tab delimited, etc), and I can't tell if it's my input fasta or the headers file or what. It's not the different extensions either- it worked on other files. Anyone know what's wrong? It has to be something in my files themselves.

For reference my own files:

A couple of the scripts I've been using:

The test files:

perl fasta • 5.4k views
ADD COMMENT
0
Entering edit mode

Some of your links are dead. Empty or wrong files might be one of the problems.

ADD REPLY
0
Entering edit mode

Just checked them, they all work (just one wasn't, and it was a Dropbox parsing error) and permissions are there. These are also only copies of what I'm working off of, I only uploaded them to Dropbox for examples. They actual files are stored on my computer.

ADD REPLY
6
Entering edit mode
9.7 years ago
venu 7.1k

Perl is a very nice language. But when there are simple tools available for this kind of tasks, why to go for perl.

Try using faSomeRecords

Here you can find it.

$ chmod +x faSomeRecords

$ ./faSomeRecords input.faa ids.txt output.faa

Working properly with your files.

ADD COMMENT
0
Entering edit mode

I downloaded it and ran it, but all output files are empty (even my test files this time) even with downloading slightly different versions of it. Which is a shame, because it looks like it would clear this up right away. Thank you though.

ADD REPLY
3
Entering edit mode

Don't download different versions. Download the same. Remove '>' symbol from ids file. Then it will work.

ADD REPLY
0
Entering edit mode

It worked, thank you VERY much sir! Can't believe I didn't try removing the ">" before. But regardless, now I can actually get some serious work done this weekend. Thank you again for introducing me to this valuable tool, I appreciate it very much!

ADD REPLY
2
Entering edit mode
9.7 years ago

There is a program for this purpose called FilterByName, in the BBTools package. Usage:

filterbyname.sh in=sequences.fasta names=names.txt include=t out=filtered.fasta

It's very flexible - the sequences can be in fasta, fastq, scarf, sam, bam, and compressed in various formats. The names can be just about anything:

A name or list of literal names, comma-delimited, such as names=chr1,chr7,chr18

A file or list of files containing names (one name per line) such as names=names.txt

A list of sequence files, such as names=blacklist.fasta,junk.fastq,unmapped.sam

...or any combination of the above.

The include flag sets the mode. If include=t, then only sequences with matching names will be retained. If include=f, then only sequences with nonmatching names will be retained (so anything with a name matching your name list will be discarded).

ADD COMMENT
0
Entering edit mode

I did know about this tool. It looks really nice. I tried to run it and it runs, but I do not understand where is the output. This is what I obtain:

Input is being processed as unpaired
Time:               0.228 seconds.
Reads Processed:    62933     276.42k reads/sec
Bases Processed:    18378339     80.72m bases/sec
Reads Processed:    29
Bases Processed:    14320
ADD REPLY
0
Entering edit mode

Ah, sorry! I forgot to add the out= flag. I've fixed the command above.

ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6