We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.
Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At"
.
I'm trying to get counts of specific cases, but I don't want to modify the string. It looks like gsub
requires a replacement after the match so I don't know if this is appropriate to use or if sed
is possible to use instead to get around replacement.
One example of this count problem is wanting to print the counts of any letter not preceded by "+num":
I used gsub
and left the replace section empty, but this would still count the letters preceded by ^
or $
, which I'm trying to avoid.
gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);
Is gsub
appropriate for this case or should I try a different approach?
We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.
Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At"
.
I'm trying to get counts of specific cases, but I don't want to modify the string. It looks like gsub
requires a replacement after the match so I don't know if this is appropriate to use or if sed
is possible to use instead to get around replacement.
One example of this count problem is wanting to print the counts of any letter not preceded by "+num":
I used gsub
and left the replace section empty, but this would still count the letters preceded by ^
or $
, which I'm trying to avoid.
gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);
Is gsub
appropriate for this case or should I try a different approach?
1 Answer
Reset to default 0I you just want to count the occurrences, perhaps something like this would work?
If the column of interest is the first column ($1):
awk '{
split($1, a, "")
for (i = 3; i <= length(a); i++) {
if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) {
cnt++
}
}
print "line", NR, "count = " cnt
cnt = 0
}' file
More info:
awk '{
split($1, a, "") # split $1 into an array of 'one character per entry'
for (i = 3; i <= length(a); i++) { # cycle through the array, starting at the 3rd char
if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) { # if the 1st char is not + or -, and the 2nd char is not a number, and the 3rd char is a letter
cnt++ # increase the count by 1
}
}
print "line", NR, "count = " cnt # print the count for each line
cnt = 0 # reset count before moving to the next line
}' file
With your example string "..,,,^A.tcCC$t*+2At" you get a count of 7. Is this the expected output? Would this approach work for your use-case?
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744129767a4559785.html
gsub()
function you're using - you definitely need to show your efforts so far! And it would help to give a (small!) sample input and expected output, to help us understand what you're trying to achieve. – Toby Speight Commented Mar 27 at 7:43