linux - Replace all instances of character in portion of string in bash - Stack Overflow

I need to replace all instances of a character (period in my case) in 1+ portionssegmentsranges of a

I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).

Example:

Starting String: "<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark> ..." .

Needed transformation: Replace all periods "." between <mark> and </mark> with the string "<wbr />" .

Desired Result: "<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>" .

EDITS:

The starting string will never contain <mark> or </mark> within a set of them (ie. the range markers are never nested).

I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?

For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.

Here's something I got working. It's not pure Bash, but it's simple.

while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
    my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done

I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).

Example:

Starting String: "<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark> ..." .

Needed transformation: Replace all periods "." between <mark> and </mark> with the string "<wbr />" .

Desired Result: "<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>" .

EDITS:

The starting string will never contain <mark> or </mark> within a set of them (ie. the range markers are never nested).

I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?

For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.

Here's something I got working. It's not pure Bash, but it's simple.

while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
    my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done
Share Improve this question edited Mar 26 at 11:43 codesniffer asked Mar 24 at 17:02 codesniffercodesniffer 1,21612 silver badges23 bronze badges 9
  • 2 Do not parse XML with regex. Use an XML parser. – Léa Gris Commented Mar 24 at 18:26
  • 2 Post valid XML in your question. – Cyrus Commented Mar 24 at 18:34
  • 1 This quick hack (which absolutely will not work for general XML strings) may help to get you started on a pure Bash solution: tmp=$string; newstr=; while [[ $tmp == *'<mark>'*'</mark>'* ]]; do tmp2=${tmp#*<mark>*</mark>}; tmp3=${tmp%"$tmp2"}; tmp=$tmp2; tmp4=${tmp3%%<mark>*</mark>}; tmp5=${tmp3#"$tmp4"}; tmp5=${tmp5//./'<wbr />'}; tmp5="<begin>${tmp5#<mark>}"; tmp5="${tmp5%</mark>}</end>"; newstr+=$tmp4$tmp5; done; newstr+=$tmp; printf '%s\n' "$newstr" – pjh Commented Mar 24 at 19:25
  • @Shawn - good observation! I changed the tags mid-edit and missed some. I've corrected the Desired Result. – codesniffer Commented Mar 24 at 20:33
  • 1 you could start by replacing the while $(echo ... | grep ...); do with while grep -q -E '<mark>[^.]*\.[^<]*</mark>' <<< "${my_str}"; do to eliminate two subshell calls on each pass through the loop; the $(echo ... | sed ...) could be replaced with $(sed ... <<< "${my_str}") to eliminate another subshell, while this last subshell could be replaced with some creative parameter substitutions; though I'd look into how to compare ${my_str} to a regex and how that populates the BASH_REMATCH[] array, then the BASH_REMATCH[] results can be used to formulate the parameter substitution – markp-fuso Commented Mar 24 at 21:33
 |  Show 4 more comments

5 Answers 5

Reset to default 4

Setup:

string='<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark>'

One bash solution:

regex='(<mark>[^<]*</mark>)'           # assumes no "<" between "<mark>" and "</mark>" tags
unset prev_string                      # used to test for a change to 'string'

# while we have a match and a change has been made to 'string' ...

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    # typeset -p BASH_REMATCH          # uncomment to see contents of the BASH_REMATCH[] array

    prev_string="${string}"

    # use nested parameter substitutions to make replacement

    string="${string/${BASH_REMATCH[1]}/${BASH_REMATCH[1]//\./<wbr \/>}}"
done

NOTE: "${prev_string}" != "${string}" added as a quick hack to insure we don't go into an infinite loop in the case where no modifications are made to string (eg, no periods between the tags)

A variation on the above which adds a few cpu cycles while making the parameter substitutions easier to read and understand:

regex='(<mark>[^<]*</mark>)'
unset prev_string

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    old="${BASH_REMATCH[1]}"           # copy the match; makes follow-on commands a bit cleaner
    new="${old//\./<wbr \/>}"          # replace all periods with "<wbr />"

    prev_string="${string}"
    string="${string/${old}/${new}}"   # update "string" by replacing "${old}" with "${new}"
done

These both generate:

$ typeset -p string
declare -- string="<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>"

Feed Perl from stdin or append a file name:

perl -pe 's%(<mark>.*?</mark>)% $1 =~ s|\.|<wbr />|gr %eg'

Output:

<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

Source: https://unix.stackexchange/a/152623/74329

This is probably super inperformant, but it only uses a single regex to search and replace - no loop needed. I am no expert in shell scripts, so I will not provide one, but this should work inside a Perl call.

Try matching:

([^.]+|\G)\.(?=(?:(?!<mark>).)+<\/mark>)

and replacing with:

$1<wbr />

See: regex101


Explanation

MATCH:

  1. Match all .:
  • ( ... ): Capture to group 1 either
    • [^.]+: anything but a dot
    • |\G: or the end of the last match
  • \.: then match a dot
  1. Ensure the dot is inside <mark> ... </mark> tags:
  • (?= ... ): Look ahead and assert
    • (?: ... )+: that you match anything
      • (?!<mark>).: but it cannot be <mark>.
    • <\/mark>: Find </mark>, ensuring that you must be inside the tag

REPLACE:

  • $1: Keep the first group (everything before a dot, but inside tag)
  • <wbr />: and replace the dots with <wbr />

Using any awk in any shell on all Unix boxes:

$ awk '
BEGIN {
    FS = OFS = "</mark>"
}
{
    for (i = 1; i <= NF; i++) {
        if ( match($i, /<mark>.*/) ) {
            tgt = substr($i, RSTART, RLENGTH)
            gsub(/\./, "<wbr />", tgt)
            $i = substr($i, 1, RSTART - 1) tgt
        }
    }
    print
}
' file
<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

This Shellcheck-clean pure Bash code updates the value of the variable my_str:

tmp=$my_str
my_str=
while [[ $tmp =~ ^(.*)(\<mark\>.*\</mark\>)(.*)$ ]]; do
    tmp=${BASH_REMATCH[1]}
    my_str=${BASH_REMATCH[2]//./<wbr />}${BASH_REMATCH[3]}${my_str}
done
my_str=${tmp}${my_str}
  • The code makes no assumptions about characters between <mark> and </mark>. (E.g. < is OK.)
  • <mark>...</mark> substrings are processed right-to-left within the input string to work around the fact that matching of regular expressions in Bash is always greedy.
  • See mkelement0's excellent answer to How do I use a regex in a shell script? for information about regular expressions in Bash.
  • See Substituting part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the expansion mechanism (${var//old/new}) used in ${BASH_REMATCH[2]//./<wbr />}.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744238446a4564586.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信