error in extracting one chromosome from multiple fasta files to a single fasta file
1
0
Entering edit mode
3.5 years ago
evafinegan • 0

Hello All,

I have multiple fasta files. I want to make output files for each chromosome from all the fasta files. For example, output file all_ch1.fasta will have ch1 sequences from all the fasta files and so on. I tried:

samtools faidx *fasta.gz ch1 > all_ch1.fasta

But I am getting this error:

[W::fai_get_val] Reference sample2.fasta.gz not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in sample2.fasta.gz

I checked sample2.fasta.gz file but it is not empty. Thank you for any help!

fasta • 3.1k views
ADD COMMENT
0
Entering edit mode

if this format is correct, ch1 is sample1_rhg1.0ch1 in sample1.fasta.gz and ch1 is sample2_rhg1.0ch1 in sample2.fasta.gz, try:

with seqkit (dry-run)

$ seqkit seq *.gz | seqkit split -di --id-regexp ".*0(.*)" 

New files would be in stdin.split directory. File names would be stdin.id_ch1.fasta, stdin.id_ch2.fasta for each ch and each fasta sequence name will be exactly as it is in each ch fasta. For eg. >sample1_rhg1.0ch1 and >sample2_rhg1.0ch1 for ch1. Download seqkit from https://bioinf.shenwei.me/seqkit/download/. Remove d from -di once you are okay with dry-run output.

without seqkit (assuming that sequences are flattened and all files have equal number ch entities)

$ zgrep "^>" sample1.fasta.gz | awk -F "\.[0-9]+" '{print $2}' | sort -Vu | while read line; do zgrep -A 1 -h $line *.gz > $line.fasta ; done

Files would be named ch1.fasta, ch2.fasta etc.

ADD REPLY
0
Entering edit mode
3.5 years ago

your running a command line like

samtools faidx *fasta.gz ch1 > all_ch1.fasta

so if you have multiple fasta.gz file in the directory (like sample1.fasta.gz sample2.fasta.gz sample3.fasta.gz ), the command above exapnds to

samtools faidx sample1.fasta.gz sample2.fasta.gz sample3.fasta.gz  ch1 > all_ch1.fasta

and so samtools search for sample2.fasta.gz in sample1.fasta.gz

so , may be , you want a loop like

for F in  *fasta.gz ; do samtools faidx $F ch1 ; done > all_ch1.fasta
ADD COMMENT
0
Entering edit mode

Thank you, Pierre! I tried and it gives an error:

[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1
[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1
[W::fai_get_val] Reference ch1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in ch1

The chromosome names in each of the fasta files are like: sample1_rhg1.0ch1 sample2_rhg1.0ch1 sample3_rhg1.0ch1

How can I mention that in the loop so that it recognizes ch1 in each of the fasta file? Thank you!

ADD REPLY
0
Entering edit mode

there is no ch1 in your fasta file.

check this with

grep  -F -w ch1 *.fai

just look at the fai files to find the correct name.

ADD REPLY
0
Entering edit mode

Yes, I checked that the

ch1 is sample1_rhg1.0ch1 in sample1.fasta.gz and
ch1 is sample2_rhg1.0ch1 in sample2.fasta.gz

How can I specify to read ch1 with different prefixes for each of the sample fasta file using the loop. Thank you!

ADD REPLY
0
Entering edit mode

for S in 1 2 ; do samtools faidx "sample${S}.fasta.gz" "sample${S}_rhg1.0ch1" ; done > all_ch1.fasta

ADD REPLY
0
Entering edit mode

Thank you! All the fasta files do not have the same prefix as "sample". So, based on your suggestion, I tried

for S in *fasta.gz ; do samtools faidx "${S}" "${S}ch1" ; done > all_ch1.fasta

But I got this error that it looks for sample1_rhg1.0.fasta.gzch01 instead of sample1_rhg1.0ch01

ADD REPLY
1
Entering edit mode

why would it look for sample1_rhg1.0.fasta.gzch01 when your file name is sample1.fasta.gz (looking at "${S}ch1" part of the code) ? From your code, it should be looking for sample1.fasta.gzch1, not sample1_rhg1.0.fasta.gzch01. Try following code: (from Pierre and mine)

$ for S in 1 2 3; do echo "sample${S}.fasta.gz" "sample${S}_rhg1.0ch1" ; done 

sample1.fasta.gz sample1_rhg1.0ch1
sample2.fasta.gz sample2_rhg1.0ch1
sample3.fasta.gz sample3_rhg1.0ch1

$ for S in *fasta.gz ; do  echo $S ${S/.fasta.gz/}_rhg1.0ch1 ; done  

sample1.fasta.gz sample1_rhg1.0ch1
sample2.fasta.gz sample2_rhg1.0ch1
sample3.fasta.gz sample3_rhg1.0ch1
ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6