Question

Concatenate the chromosome files from hsa19 into a single FASTA file

2

Entering edit mode

10.3 years ago

Kato ▴ 60

Hi, I need to concatenate the chromosome files into a single FASTA file but I have a lot of files with strange names.

The typical chr*.fa
Random chr*_gl0000**_random.fa
ChrUn chrUn_gl0000**.fa

Note that * means any number

What is the correct order to concatenate these files with cat? In working in making the reference for mapping smallRNA.

Thanks in advance

reference Mapping miRNA • 8.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Kato ▴ 60

2

Entering edit mode

Those are sequences can not be assigned to a certain chromosome. see here.

Question:
What is ChrUn?

Response:
ChrUn contains clone contigs that can't be confidently placed on a specific chromosome. For the chr_N_random and chrUn_random files, we essentially just concatenate together all the contigs into short pseudo-chromosomes. The coordinates of these are fairly arbitrary, although the relative positions of the coordinates are good within a contig. You can find more information about the data organization and format on the Data Organization and Format page.

I do not think the order of the fasta file matters (correct me if I am wrong)

what I did:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
cat *fa.gz > UCSC_hg19_genome.fa.gz

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.3 years ago by Ming Tommy Tang ★ 4.5k

0

Entering edit mode

and if this is small rna you will get a lot of multiple mapped reads if you add all chromosomes, that if you plan to remove is better to use only the known chromosomes and maybe the chrUn. But the latter, normally are ribosomal genes, that have some copies in the genome as well. So, you need to think about the strategy you want to follow to map and annotate to decide this beforehand.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Lorena Pantano ▴ 380

Ram · Answer 1 · 2016-02-09

cat * > genome.fa will lose the numerical ordering (for lexicographical ordering instead) so you get: 1, 11-19, 2, 20-22, 3-9, M, other, X, Y. Some tools want the proper ordering (1-22, X, Y, M, other), which is doable in a line of bash :

echo "$(ls chr*.fa | sort -V | grep -vP 'chr[^X|Y|\d]'; ls chr*.fa | sort -V | grep -vP 'chr[\d|X|Y]')" | xargs cat > genome.fa

Ram · Answer 2 · 2014-09-18

0

Entering edit mode

10.3 years ago

mickael.leclercq ▴ 30

cat chr*.fa chr*_gl0000**_random.fa chrUn_gl0000**.fa > output.fasta

* means "anything", so it should work.

But keep it even more simple: cat *.fa > output.fasta

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by mickael.leclercq ▴ 30

1

Entering edit mode

Take care that cat chr*.fa chr*_gl0000**_random.fa ... > output.fasta will put duplicates in the final output as files chrUn*_gl0000.blabla.fa are matched by multiple globs.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by dariober 15k