Concatenate the chromosome files from hsa19 into a single FASTA file
2
2
Entering edit mode
10.3 years ago
Kato ▴ 60

Hi, I need to concatenate the chromosome files into a single FASTA file but I have a lot of files with strange names.

  • The typical chr*.fa
  • Random chr*_gl0000**_random.fa
  • ChrUn chrUn_gl0000**.fa

Note that * means any number

What is the correct order to concatenate these files with cat? In working in making the reference for mapping smallRNA.

Thanks in advance

reference Mapping miRNA • 8.4k views
ADD COMMENT
2
Entering edit mode

Those are sequences can not be assigned to a certain chromosome. see here.

Question:
What is ChrUn?

Response:
ChrUn contains clone contigs that can't be confidently placed on a specific chromosome. For the chr_N_random and chrUn_random files, we essentially just concatenate together all the contigs into short pseudo-chromosomes. The coordinates of these are fairly arbitrary, although the relative positions of the coordinates are good within a contig. You can find more information about the data organization and format on the Data Organization and Format page.

I do not think the order of the fasta file matters (correct me if I am wrong)

what I did:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
cat *fa.gz > UCSC_hg19_genome.fa.gz
ADD REPLY
0
Entering edit mode

and if this is small rna you will get a lot of multiple mapped reads if you add all chromosomes, that if you plan to remove is better to use only the known chromosomes and maybe the chrUn. But the latter, normally are ribosomal genes, that have some copies in the genome as well. So, you need to think about the strategy you want to follow to map and annotate to decide this beforehand.

ADD REPLY
3
Entering edit mode
8.9 years ago
Louis ▴ 160

cat * > genome.fa will lose the numerical ordering (for lexicographical ordering instead) so you get: 1, 11-19, 2, 20-22, 3-9, M, other, X, Y. Some tools want the proper ordering (1-22, X, Y, M, other), which is doable in a line of bash :

echo "$(ls chr*.fa | sort -V | grep -vP 'chr[^X|Y|\d]'; ls chr*.fa | sort -V | grep -vP 'chr[\d|X|Y]')" | xargs cat > genome.fa
ADD COMMENT
0
Entering edit mode
10.3 years ago
cat chr*.fa chr*_gl0000**_random.fa chrUn_gl0000**.fa > output.fasta

* means "anything", so it should work.

But keep it even more simple: cat *.fa > output.fasta

ADD COMMENT
1
Entering edit mode

Take care that cat chr*.fa chr*_gl0000**_random.fa ... > output.fasta will put duplicates in the final output as files chrUn*_gl0000.blabla.fa are matched by multiple globs.

ADD REPLY

Login before adding your answer.

Traffic: 3033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6