regex - How does C determine whether a character is lower case (islower or isupper)? - Stack Overflow

I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:u

I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ an uppercase letter? (Examples here.)

It seems to map to an "islower" function which is defined in the C language, somehow.

://cplusplus/reference/clibrary/cctype/islower/

Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.

For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.

.c#L392

I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc).

It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? .html

I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ an uppercase letter? (Examples here.)

It seems to map to an "islower" function which is defined in the C language, somehow.

https://en.cppreference/w/c/string/byte/islower

http://web.archive./web/20120308171350/https://cplusplus/reference/clibrary/cctype/islower/

Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.

For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.

https://github/coreutils/coreutils/blob/1f0bf8d7c4b7131c6a8762de02ea01affef4db65/src/tr.c#L392

I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc).

It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? https://docs.oracle/cd/E19253-01/817-2521/overview-1002/index.html

Share Improve this question asked Nov 16, 2024 at 15:53 Atomic TripodAtomic Tripod 3462 silver badges9 bronze badges 5
  • That is highly dependent on implementation. Traditionally it was common with an array, one element for each character in the full alphabet (so including control and non-printable characters, i.e. with 256 elements). Each element was a bit-mask, where a specific bit set meant that the character was a upper-case character or not. For common 8-bit encodings it might still be handled that way. – Some programmer dude Commented Nov 16, 2024 at 15:58
  • 1 That's helpful, but tr has some definition of islower for any character I give it. How does it determine it? Is there an example implementation I could look at? – Atomic Tripod Commented Nov 16, 2024 at 15:59
  • Remember that all GNU tools are open source, which means that the source is available to read. It's part of GNU coreutils whose source is available from this github repository. – Some programmer dude Commented Nov 16, 2024 at 16:05
  • And if it turns out that it's using the standard C isupper and islower, then the source for those are available as well. – Some programmer dude Commented Nov 16, 2024 at 16:12
  • There are files that contain the information for each locale installed on a system. See sourceware./glibc/wiki/Locales for an introduction. – Shawn Commented Nov 16, 2024 at 16:29
Add a comment  | 

1 Answer 1

Reset to default 3

The determination of lower case letters, per locale, is commonly determined before compile time.

localeconv() [formatting of numeric quantities] allows the dynamic changing of some locale attributes, but not the determination of lower case.


The locale may change with char *setlocale(int category, const char *locale);

At program startup, the equivalent of setlocale(LC_ALL, "C"); is executed.

At least 2 locales are defined:

  1. "C": A minimal C environment. This is defined in the spec with 'a' - 'z', and nothing else, as lower case letters.

  2. "": Implementation's native environment.

Some implementations allow for dozens of different locales. Some only have the minimal 2 - which might use the same determination of lower case letters - so no functional difference.

Thus the behavior of islower() can change during a program's run.


Soapbox C's locale is an initial attempt to localize code to various country/culture standards. Yet it is cumbersome, inadequate and incurs troubles with multi-threading. Proceed with caution.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745653622a4638411.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信