linux - Sedawk for character match not replacement - Stack Overflow|江阴雨辰互联

We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.

Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At".

I'm trying to get counts of specific cases, but I don't want to modify the string. It looks like gsub requires a replacement after the match so I don't know if this is appropriate to use or if sed is possible to use instead to get around replacement.

One example of this count problem is wanting to print the counts of any letter not preceded by "+num": I used gsub and left the replace section empty, but this would still count the letters preceded by ^ or $, which I'm trying to avoid.

gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);

Is gsub appropriate for this case or should I try a different approach?

We have a large (5-6 million row) tab-delimited file, and the quality metrics that influence downstream filtering are recorded per line in an antiquated way.

Each row is a separate entity with one column of interest consisting of a very large string (think 200-10000 characters), e.g. "..,,,^A.tcCC$t*+2At".

gsub(/[+-][0-9]+[ACGTNacgtn]+/, "", reads);

Is gsub appropriate for this case or should I try a different approach?

Share Improve this question edited Mar 27 at 7:46 Toby Speight 31.3k52 gold badges76 silver badges113 bronze badges asked Mar 26 at 19:31 Kat Newcomer 31 silver badge3 bronze badges

1 It is unclear what the desired end result is, please clarify. If you could also add a few example lines that could help, they don't have to be real. – Uberhumus Commented Mar 26 at 19:53
1 in addition to @Uberhumus request, provide the expected results of the process (you may need to manually provide if your code doesn't quite do what you need - also (provide your attempts!(copy/paste please, no screenshots) ) , read minimal reproducible example on how to post. – ticktalk Commented Mar 26 at 21:59
1 It's not even clear which language's gsub() function you're using - you definitely need to show your efforts so far! And it would help to give a (small!) sample input and expected output, to help us understand what you're trying to achieve. – Toby Speight Commented Mar 27 at 7:43
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Bot Commented Mar 30 at 15:34

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I you just want to count the occurrences, perhaps something like this would work?

If the column of interest is the first column ($1):

awk '{
    split($1, a, "")
    for (i = 3; i <= length(a); i++) {
        if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) {
            cnt++
        }
    }
    print "line", NR, "count = " cnt
    cnt = 0
}' file

More info:

awk '{
split($1, a, "")  # split $1 into an array of 'one character per entry'
for (i = 3; i <= length(a); i++) {  # cycle through the array, starting at the 3rd char
if ((a[(i - 2)] !~ /[+-]/ && a[(i - 1)] !~ /[0-9]/) && (a[i] ~ /[ACGTNacgtn]+/)) {  # if the 1st char is not + or -, and the 2nd char is not a number, and the 3rd char is a letter 
cnt++  # increase the count by 1
}
}
print "line", NR, "count = " cnt  # print the count for each line
cnt = 0  # reset count before moving to the next line
}' file

With your example string "..,,,^A.tcCC$t*+2At" you get a count of 7. Is this the expected output? Would this approach work for your use-case?

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744129767a4559785.html

linux - Sedawk for character match not replacement - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

linux - Sedawk for character match not replacement - Stack Overflow

1 Answer 1

相关推荐