Hi everybody,
I faced with a crazy problem during extraction of desired sequence. I want to extract some sequences within a fasta sequence file based on their header name, the list name contain 86563 header names, one per line. I used tools for this end, but the number of output derived from all tools were different and also not the same with the number of list name (86563). In detail, I used the awk command
awk '
NR==1{printf $0"\t";next}
{printf /^>/ ? "\n"$0"\t" : $0}
' test.fa | \
awk -F"\t" '
BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}
{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}
'
that returned 87812 sequence.
with the python script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Extract sequences from a fasta file if their name is in a 'wanted' file.
Wanted file contains one sequence name per line.
Usage:
%program <input_file> <wanted_file> <output_file>"""
import sys
import re
try:
from Bio import SeqIO
except:
print "This program requires the Biopython library"
sys.exit(0)
try:
fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
except:
print __doc__
sys.exit(0)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
name = seq.id
if name in wanted and len(seq.seq.tostring()) > 0:
wanted.remove(name) # Output only the first appearance for a name
SeqIO.write([seq], f, "fasta")
I got 60432 sequence.
And with the another script:
#!/bin/bash
#filterbyname in=<infile> out=<outfile>
function usage(){
echo "
Written by Brian Bushnell
Last modified May 28, 2015
Description: Filters reads by name.
Usage: filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>
in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Other parameters and their defaults:
include=f Set to 'true' to include the filtered names rather than excluding them.
substring=f Allow one name to be a substring of the other, rather than a full match.
f: No substring matching.
t: Bidirectional substring matching.
header: Allow input read headers to be substrings of names in list.
name: Allow names in list to be substrings of input read headers.
casesensitive=t (case) Match case also.
ow=t (overwrite) Overwrites files that already exist.
app=f (append) Append to files that already exist.
zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f (interleaved) Determines whether INPUT file is considered interleaved.
names= A list of strings or files. The files can have one name per line, or
be a standard read file (fasta, fastq, or sam).
minlen=0 Do not output reads shorter than this.
Java Parameters:
-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
To read from stdin, set 'in=stdin'. The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'. The format should be specified with an extension, like 'out=stdout.fasta'
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx800m"
EA="-ea"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
parseXmx "$@"
if [[ $set == 1 ]]; then
return
fi
freeRam 800m 84
z="-Xmx${RAM}m"
z2="-Xms${RAM}m"
}
calcXmx "$@"
function filterbyname() {
#module unload oracle-jdk
#module unload samtools
#module load oracle-jdk/1.7_64bit
#module load pigz
#module load samtools
local CMD="java $EA $z -cp $CP driver.FilterReadsByName $@"
echo $CMD >&2
eval $CMD
}
filterbyname "$@"
with the above script, I got 89718 sequences.
Could you please help me why there is a such differences and also why none of scripts didn't give me the right number of sequences, which equal with the number of header name? Thanks so much in advance for your help and solving the issue
Could you post output of
sort -u IDs.txt | wc -l
andhead IDs.txt
andgrep -c "^>" your.fasta
andgrep "^>" your.fasta | sort -u | wc -l
?Thanks. Yes, of course,
The output of
sort -u IDs.txt | wc -l
was60497
the output of
grep -c "^>" file.fasta
was91578
and the output ofgrep "^>" file.fasta | sort -u | wc -l
was64142
.I'm so grateful for your helpful suggestion.
Well, these numbers probably explain the different outputs, e.g. one approach may fetch all seqs that have e.g. ">Contig1" header while another may only fetch the first seq with ">Contig1" header..
You're right, I also have for example contig1100, contig1101; however, the number of sequences that I got using the above scripts were not the same with the header name list! Could you please let me know what's your suggested approach to get the sequences of interest in this situation?
Are all the sequences with identical headers the same in your fasta file? If not.. you're out of luck.
How could any algorithm know which one of these you want?
If the names don't matter, you could parse your fasta into "header - tab" format:
Sort by colum 1, and then sort your ID list also and:
And then finally parse the output back to fasta format. This would work assuming that if you have e.g. two Contig1 in your id list, there are also exactly two Contig1 headers in your fasta file (and same rule must apply to all other redundant IDs/headers). Probably not the case, but who knows..
Actually, there is not identical header in my fasta file, there may be similar header, for example I have contig1, contig11, contig1100, contig1101, contig1102, which all begin with 1.
If there were no identical headers in your fasta file, then
grep -c "^>" file.fasta
(how many headers in file.fasta) andgrep "^>" file.fasta | sort -u | wc -l
(how many unique headers in file.fasta) would have returned the same number.grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head
would show some of the most redundant headers and their counts..That's right, but when I visually see contig header in excel file, there is only for example one contig1. may be the all contigs names that begin with 1 (say contig11, contig1101, contig1102,...) considered identical by scripts and your command, is it right in your view?
No. Contig1 and Contig11 would not be considered identical by uniq and neither would contig1101 and contig1102. Just check the last command I posted. Also, I highly recommend that you stop using excel.
Hi my friend. Thanks for all your comments. that's completely right, there is redundant sequences based on the output of
grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head
, which was:please accept my deep apologize, I'm just student and always learning. Actually, I got the fasta file from my friend, this fasta file is combination of different assemblies at various k-mer that subjected to cd-hit-est tool. In each (single) assembly, the header was named "contig", which could cause problem, and then all assemblies merged together. To solve such a problems, the contigs in each assembly should named with a specific name, which isn't the same with contigs name in other assembly, am I right?
Thanks
No problem. I as much as guessed that merging of assemblies was behind this. Your solution is also correct. Good luck.
Thanks so much my friend.