Regular Expressions Tutorial - part 1 - Basics of Regular Expressions

Regular Expressions Tutorial - part 1 - Basics of Regular Expressions

What is Regular Expression


Regular expression, regex, or regexp (sometimes called a rational expression) is special sequence of characters that define a search pattern (if you want a mask) for text strings. Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities, built-in or via libraries.

Patterns


The phrase regular expressions, and consequently, regexes, is often used to mean the specific, standard textual syntax for representing patterns for matching text. Each character in a regular expression (that is, each character in the string describing its pattern) is either:

  • metacharacter, having a special meaning, or

  • regular character that has a literal meaning

Each regular expression consist from metacharacter and regular character

For example, in the regex a., is a literal character which matches just 'a' and . is a meta character that matches every character except a newline. Therefore, this regex matches, for example, 'a ', or 'ax', or 'a0' text strings.

Simple regular expressions


The simplest regular expression is a common letter - e.g r and when a string is searched in the text to accommodate this regular expression, it simply searches for the letter "r". By default, as in Unix, it is case-sensitive. However, in most utilities, you can turn off this feature.

Since even in the simplest cases a person usually seeks a word and not a single letter, regular expressions can be chained. If you use the regular expression think, it actually represents the chaining of five elementary single-letter regular expressions. The result is the behavior you would expect - the word "think" will be searched for.
Simple word search is the most primitive but also the most common application of regular expressions.

How find metacharacters in text


Metacharacters include \, ^, $, ., [, ], |, ( ,), ?, *, +, {, }, ^ and more.

Maybe you already thought "but what if I need to find metacharacter like a dot?" More generally: how to exclude the special meaning of some characters. The general answer to this question is a "backslash". It is customary in Unix that if you assign a backslash to a special character, you will disable its special behavior (and in some cases, the opposite, as you will see later).

For example, if we want to look for strings containing a.b, we have to use regular expression a\.b. Another example: regular expression \.\.\. looking for three dots in the text.

Regex bracket expression


  • [ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c".

  • [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. Character intervals originate from ASCII encoding. This means that, for example, the regular expression [a-z] matches with any English-language lowercase chars from a to z. To add a uppercase letter is not a big problem: [a-zA-Z].

  • [^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.

  • - The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-] or [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

There is a specific environment inside the square brackets []. The . here represents a common dot and the meaning of the other two special characters can be suppressed in a simple order. The ASCII caret represents a negation only if it is stated at the beginning and the dash serves as an interval separator only if it has its limits on both sides. For example, [.^az-] matches only one of the characters ".", "^", "-", "a" or "z".

If one of the allowed characters is a square bracket, put it right after the opening. For example, the regular expression [][] matches the left or right bracket. If you would write the characters inside the outer brackets in reverse order [[]], the meaning would change radically: it would be interpreted as [[] immediately followed ]. That would only give him the string "[]".

Anchoring to the beginning and end of string


  • ^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

  • $ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

Examples:


  • .at matches any three-character string ending with "at", including "hat", "cat", and "bat".
  • [hc]at matches "hat" and "cat".
  • [^b]at matches all strings matched by .at except "bat".
  • [^hc]at matches all strings matched by .at other than "hat" and "cat".
  • ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
  • [hc]at$ matches "hat" and "cat", but only at the end of the string or line.
  • \[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".

Summary


regular expression match
\ Escape special meaning of meta characters
^ Start of string or line
$ End of string or line
. Match any single character
[] Match one item in this character set
[abc] Match single character that is a or b or c
[^abc] Negative range ( Not a or b or c )
[a-z] match single lowercase letter from a to z
[A-Z] match single uppercase letter from A to Z
[0-7] match single number from 0 to 7