Read more

Regular Expressions - Cheat Sheet

Deleted user #6
September 01, 2010Software engineer

You can write regular expressions some different ways, e.g. /regex/ and %r{regex}. For examples, look here.

Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

Remember that it is always a good idea to match a regex visually first.

Characters

Literal Characters

[ ] \ ^ $ . | ? * + ( )

Character Classes

[ae]            matches a and e, e.g. gr[ae]y => grey or gray => but NOT graay or graey
[0-9]           matches a SINGLE digit in the range from 0 to 9
[0-9a-fA-F]     hexadecimal digit
^               negates character class, q[^x] matches qu in question, but NOT Iraq, 
                since there is no character after the q for the negated character class to match

Shorthand Characters

  \d   matches a single character that is a digit
  \w   matches a word character (alphanumeric characters plus underscore)
  \s   matches white space character (includes tabs and line breaks)
  \t   matches tab character

Non-Printable Characters

  \xFF      matches hexadecimal character
  \uFFFF    matches unicode character, \u20AC matches €
  .         matches all, sometimes except line breaks [^\n] Unix, [^\r\n] Windows

Anchors

^     matches the start of a line
$     matches the end of a line
\A    matches the start of a string
\z    matches the end of a string

\b    matches a word boundary. A word boundary is a position
      between a character that can be matched by \w and a character
      that cannot be matched by \w.
      also matches at the start and/or end of the string if the first
      and/or last characters in the string are word characters.
\B    matches at every position where \b cannot match.

Alternation

cat|dog will match cat in "About cats and dogs", if RegEx is applied again, it will match dog

Quantifiers

?      none or one, e.g. colou?r matches colour or color
*      zero or more times
+      once or more times
{n, m} use curly braces to specify a specific amount of repetition within range [n, m]
{n}    exactly n times

Examples

  • <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.
  • Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.
  • Use \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.

Modes: Greedy, Lazy and Possessive

Example string: This test is a <EM>first</EM> test string.

  • greedy (default): * and + match as much as they can and backtrack when they can't satisfy the regex, i.e. the .* in /.*test/ will first match the whole example string and then go back to match this: This test is a <EM>first</EM> test .

  • lazy (ungreedy): specified by adding a question mark to the qualifier. *? and +? match as little as possible, i.e. /.*?test/ will match This test.

  • possessive: specified by adding a plus sign to the qualifier. Reads like "greedy without backtracking" – *+ and ++ try to match everything but immediately return if it doesn't succeed, i.e. /\d++/ matches 333 whereas /\d++3/ does not. (A lazy /\d+?/ would only match 3.)
    Use it with caution. Mostly you'll want to use it for small expressions, e.g. for nested sub-regexes.

For more details have a look at the card on quantifier modes.

Look-around

Look-arounds provide a way to match context-dependant. You can look-behind, look-ahead and to both in a positive and negative way. The look-around will not be part of the match.

  • Positive lookahead: /foo(?=bar)/ matches the foo in the foo and the bar but not in this food is bad
  • Negative lookahead: /otto(?!normal)/ matches the otto in ottomotor but not in ottonormalverbraucher
  • Positive lookbehind: /(?<=ma)kandra/ matches the kandra in makandra but not in kandra
  • Negative lookbehind: /(?<!foo)bar/ matches the bar in moo bar but not in foobar

Modifiers in Ruby

Add modifiers after the final slash, e.g. /Regex/im, or at the beginning of the regex, e.g. /(?i)regex/.

  • i: case insensitivity

  • m: make the .-character also match newlines. Know that this modifier does work in Ruby, but not JS or Perl.

  • o: evaluate string interpolation only once (e.g. /foo#{Counter.value}/)

  • x: ignore whitespace (and comments) inside the regex. Allows for definitions like this:

    /
      <
      (3)+    # repeating part
     \ you    # need to escape this space!
    /x
    

    Any whitespace you could have in regular regexes is eliminated before matching (/( ?= foo) bar/x is the same as /(?=foo)bar/). Hence to match spaces, you need to escape them.

    x has unexpected side effects: /foo +/x matches foo and foofoo, it seems to actually use /(?:foo)+/ for matching. Furthermore, /I sign in as ?/x matches What do you expect with a match result of "" (internal regex is /(?:Isigninas)?/). Obviously the engine eliminates whitespace from left to right and turns resulting substrings into unreferenced groups before applying quantifiers. (This is true for Ruby, could not check it for other languages.)

Commonly used patterns

Deleted user #6
September 01, 2010Software engineer
Posted to makandra dev (2010-09-01 16:44)