GNU Regex POSIX character classes

2021-01-12

Linux Linux Utilities Bash

Character classes are a feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.

A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of [:, a keyword denoting the class, and :]. Table below lists the character classes defined by the POSIX standard.

Class	Meaning
`[:alnum:]`	Alphanumeric characters - this is the same as `[0-9A-Za-z]`
`[:alpha:]`	Alphabetic characters: `[:lower:]` and `[:upper:]` - this is the same as `[A-Za-z]`
`[:blank:]`	Space and TAB characters
`[:cntrl:]`	Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL)
`[:digit:]`	Numeric characters. Digits: 0 1 2 3 4 5 6 7 8 9
`[:graph:]`	Characters that are both printable and visible (a space is printable but not visible, whereas an 'a' is both)
`[:lower:]`	Lowercase alphabetic characters
`[:upper:]`	Uppercase alphabetic characters
`[:print:]`	Printable characters (characters that are not control characters)
`[:punct:]`	Punctuation characters (characters that are not letters, digits, control characters, or space characters)
`[:space:]`	Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)
`[:xdigit:]`	Characters that are hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

Notes:
Class	Note
`[:punct:]`	Punctuation characters; in the 'C' locale and ASCII character encoding, this is ```! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` {
`[:graph:]`	It is same as `[:alnum:]` and `[:punct:]`
`[:print:]`	Printable characters: `[:alnum:]`, `[:punct:]`, and `space`.

For example, before the POSIX standard, you had to write /[A-Za-z0-9]/ to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write /[[:alnum:]]/ to match the alphabetic and numeric characters in your character set.

Some utilities that match regular expressions provide a nonstandard [:ascii:] character class; awk does not. However, you can simulate such a construct using [\x00-\x7F]. This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list ([^\x00-\x7F]) to match any single-byte characters that are not in the ASCII range.

Regexp Operators

GNU software that deals with regular expressions provides a number of additional regexp operators. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores ('_'):

Operator	Description	Note
`\s`	Matches any space character as defined by the current locale. Think of it as shorthand for `[[:space:]]`	gawk
`\S`	Matches any character that is not a space, as defined by the current locale. Think of it as shorthand for `[^[:space:]]`	gawk
`\w`	Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for `[[:alnum:]_]`	gawk
`\W`	Matches any character that is not word-constituent. Think of it as shorthand for `[^[:alnum:]_]`	gawk
`\<`	Matches the empty string at the beginning of a word. For example, /<away/ matches 'away' but not 'stowaway'	gawk
`\>`	Matches the empty string at the end of a word. For example, /stow>/ matches 'stow' but not 'stowaway'	gawk
`\y`	Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, \yballs?\y matches either 'ball' or 'balls', as a separate word.	gawk
`\b`	Matches a word boundary. Matches the empty string at either the beginning or the end of a word. For example, \brat\b matches the separate word 'rat'.	grep
`\B`	Matches the empty string that occurs between two word-constituent characters. For example, c\Brat\Be matches 'crate', but dirty \Brat doesn’t match 'dirty rat'. \B is essentially the opposite of \y	gawk
`\B`	\B matches characters which are not a word boundary	grep
\`	matches the beginning of the whole input or Matches the empty string at the beginning of a buffer (string)	grep
`\'`	matches the end of the whole input or Matches the empty string at the end of a buffer (string)	grep

Because ^ and $ always work in terms of the beginning and end of strings, these operators don’t add any new capabilities for awk. They are provided for compatibility with other GNU software.

In other GNU software, the word-boundary operator is \b. However, that conflicts with the awk language’s definition of \b as backspace, so gawk uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using \y for the GNU \b appears to be the lesser of two evils.

regex