Regexes

Syntax

  • Regex("[regex]")
  • r"[regex]"
  • match(needle, haystack)
  • matchall(needle, haystack)
  • eachmatch(needle, haystack)
  • ismatch(needle, haystack)

Parameters

ParameterDetails
needlethe Regex to look for in the haystack
haystackthe text in which to look for the needle

Capture groups

The substrings captured by capture groups are accessible from RegexMatch objects using indexing notation.

For instance, the following regex parses North American phone numbers written in (555)-555-5555 format:

julia> phone = r"\((\d{3})\)-(\d{3})-(\d{4})"

and suppose we wish to extract the phone numbers from a text:

julia> text = """
       My phone number is (555)-505-1000.
       Her phone number is (555)-999-9999.
       """
"My phone number is (555)-505-1000.\nHer phone number is (555)-999-9999.\n"

Using the matchall function, we can get an array of the substrings matched themselves:

julia> matchall(phone, text)
2-element Array{SubString{String},1}:
 "(555)-505-1000"
 "(555)-999-9999"

But suppose we want to access the area codes (the first three digits, enclosed in brackets). Then we can use the eachmatch iterator:

julia> for m in eachmatch(phone, text)
           println("Matched $(m.match) with area code $(m[1])")
       end
Matched (555)-505-1000 with area code 555
Matched (555)-999-9999 with area code 555

Note here that we use m[1] because the area code is the first capture group in our regular expression. We can get all three components of the phone number as a tuple using a function:

julia> splitmatch(m) = m[1], m[2], m[3]
splitmatch (generic function with 1 method)

Then we can apply such a function to a particular RegexMatch:

julia> splitmatch(match(phone, text))
("555","505","1000")

Or we could map it across each match:

julia> map(splitmatch, eachmatch(phone, text))
2-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("555","505","1000")
 ("555","999","9999")

Finding matches

There are four primary useful functions for regular expressions, all of which take arguments in needle, haystack order. The terminology "needle" and "haystack" come from the English idiom "finding a needle in a haystack". In the context of regexes, the regex is the needle, and the text is the haystack.

The match function can be used to find the first match in a string:

julia> match(r"(cat|dog)s?", "my cats are dogs")
RegexMatch("cats", 1="cat")

The matchall function can be used to find all matches of a regular expression in a string:

julia> matchall(r"(cat|dog)s?", "The cat jumped over the dogs.")
2-element Array{SubString{String},1}:
 "cat" 
 "dogs"

The ismatch function returns a boolean indicating whether a match was found inside the string:

julia> ismatch(r"(cat|dog)s?", "My pigs")
false

julia> ismatch(r"(cat|dog)s?", "My cats")
true

The eachmatch function returns an iterator over RegexMatch objects, suitable for use with for loops:

julia> for m in eachmatch(r"(cat|dog)s?", "My cats and my dog")
           println("Matched $(m.match) at index $(m.offset)")
       end
Matched cats at index 4
Matched dog at index 16

Regex literals

Julia supports regular expressions1. The PCRE library is used as the regex implementation. Regexes are like a mini-language within a language. Since most languages and many text editors provide some support for regex, documentation and examples of how to use regex in general are outside the scope of this example.

It is possible to construct a Regex from a string using the constructor:

julia> Regex("(cat|dog)s?")

But for convenience and easier escaping, the @r_str string macro can be used instead:

julia> r"(cat|dog)s?"

1: Technically, Julia supports regexes, which are distinct from and more powerful than what are called regular expressions in language theory. Frequently, the term "regular expression" will be used to refer to regexes also.



2016-09-11
2016-09-11
Julia Language Pedia
Icon