linux - Sedawk for character match not replacement - Stack Overflow

We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream

We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.

Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At".

I'm trying to get counts of specific cases, but I don't want to modify the string. It looks like gsub requires a replacement after the match so I don't know if this is appropriate to use or if sed is possible to use instead to get around replacement.

One example of this count problem is wanting to print the counts of any letter not preceded by "+num": I used gsub and left the replace section empty, but this would still count the letters preceded by ^ or $, which I'm trying to avoid.

gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);

Is gsub appropriate for this case or should I try a different approach?

We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.

Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At".

I'm trying to get counts of specific cases, but I don't want to modify the string. It looks like gsub requires a replacement after the match so I don't know if this is appropriate to use or if sed is possible to use instead to get around replacement.

One example of this count problem is wanting to print the counts of any letter not preceded by "+num": I used gsub and left the replace section empty, but this would still count the letters preceded by ^ or $, which I'm trying to avoid.

gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);

Is gsub appropriate for this case or should I try a different approach?

Share Improve this question edited Mar 27 at 7:46 Toby Speight 31.3k52 gold badges76 silver badges113 bronze badges asked Mar 26 at 19:31 Kat NewcomerKat Newcomer 31 silver badge3 bronze badges 4
  • 1 It is unclear what the desired end result is, please clarify. If you could also add a few example lines that could help, they don't have to be real. – Uberhumus Commented Mar 26 at 19:53
  • 1 in addition to @Uberhumus request, provide the expected results of the process (you may need to manually provide if your code doesn't quite do what you need - also (provide your attempts!(copy/paste please, no screenshots) ) , read minimal reproducible example on how to post. – ticktalk Commented Mar 26 at 21:59
  • 1 It's not even clear which language's gsub() function you're using - you definitely need to show your efforts so far! And it would help to give a (small!) sample input and expected output, to help us understand what you're trying to achieve. – Toby Speight Commented Mar 27 at 7:43
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Bot Commented Mar 30 at 15:34
Add a comment  | 

1 Answer 1

Reset to default 0

I you just want to count the occurrences, perhaps something like this would work?

If the column of interest is the first column ($1):

awk '{
    split($1, a, "")
    for (i = 3; i <= length(a); i++) {
        if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) {
            cnt++
        }
    }
    print "line", NR, "count = " cnt
    cnt = 0
}' file

More info:

awk '{
split($1, a, "")  # split $1 into an array of 'one character per entry'
for (i = 3; i <= length(a); i++) {  # cycle through the array, starting at the 3rd char
if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) {  # if the 1st char is not + or -, and the 2nd char is not a number, and the 3rd char is a letter 
cnt++  # increase the count by 1
}
}
print "line", NR, "count = " cnt  # print the count for each line
cnt = 0  # reset count before moving to the next line
}' file

With your example string "..,,,^A.tcCC$t*+2At" you get a count of 7. Is this the expected output? Would this approach work for your use-case?

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744129767a4559785.html

相关推荐

  • linux - Sedawk for character match not replacement - Stack Overflow

    We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream

    9天前
    30

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信