I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.
The structure of the file looks like this:
file1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I want two files with the outputs as:
PDGFRB|ENST00000522466.1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
and, DGAT2|ENST00000604935.5.txt
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.
#!/bin/bash
IFS=">" read -r -d '' -a my_array < file1.txt
for element in "${my_array[@]}";
do
gene_name=$(echo "$element" | awk '{print $1}')
gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
echo "$gene_name"
echo $"element" > $gene_name.txt
done
I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.
The structure of the file looks like this:
file1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I want two files with the outputs as:
PDGFRB|ENST00000522466.1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
and, DGAT2|ENST00000604935.5.txt
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.
#!/bin/bash
IFS=">" read -r -d '' -a my_array < file1.txt
for element in "${my_array[@]}";
do
gene_name=$(echo "$element" | awk '{print $1}')
gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
echo "$gene_name"
echo $"element" > $gene_name.txt
done
Share
Improve this question
edited Nov 19, 2024 at 12:53
Ed Morton
206k18 gold badges87 silver badges207 bronze badges
asked Nov 19, 2024 at 11:00
user23441879user23441879
311 silver badge2 bronze badges
3
|
3 Answers
Reset to default 5Using any awk:
$ awk -F'>' 'NF>1{ close(out); out=$2".txt" } { print > out }' file1.txt
$ head *\|*
==> DGAT2|ENST00000604935.5.txt <==
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
==> PDGFRB|ENST00000522466.1.txt <==
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
Did you consider awk
for this task?
awk -F'\n' -v RS='>' '
FNR > 1 {
outFile = $1 ".txt";
printf("%s", RS $0) > outFile;
close(outFile);
}
' file1.txt
The idea is to consume the input file using >
as record separator (instead of the linefeed character). Each record will then contain the header (stripped from its leading >
) in the first line and the whole sequence in the remainder lines. That makes the processing quite straightforward.
Now, the very first record is expected to be empty (or containing comments), so you skip it using the condition FNR > 1
ASIDE
Not that it is wrong, but do you really want to keep the |
in the filenames?
awk
is the better way to do this, but if you're going to try using a while/read loop, you probably want to structure it like:
while read line; do f=${line#>};
if ! test "$f" = "$line"; then exec > $f.txt; fi;
printf '%s\n' "$line";
done < input
Note that if you do that in an interactive terminal, you'll want to either run it in a subshell or followup with exec > /dev/tty
or similar.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745566418a4633421.html
|
symbol in your file names. There are many good reasons for sticking with the portable filename character set and restricting your names to useA-Za-z0-9._-
Personally, I also recommend against uppercase letters, since they will burn you when you go go a case-insensitive filesystem. Putting a pipe symbol in your filenames is just asking for trouble. – William Pursell Commented Nov 19, 2024 at 13:34$"element"
should be"$element"
; with that fix your code works for me; and while not a problem in this case I'd opt to wrap the target file in double quotes, ie,"$gene_name.txt"
just to be safe; invoking 4 subshells for each pass through the loop isn't very efficient and while there are a few ways to address this inbash
I'd opt for one of theawk
solutions – markp-fuso Commented Nov 19, 2024 at 14:17