String Normalization

Syntax

  • normalize_string(s::String, ...)

Parameters

ParameterDetails
casefold=trueFold the string to a canonical case based off the Unicode standard.
stripmark=trueStrip diacritical marks (i.e. accents) from characters in the input string.

Case-Insensitive String Comparison

Strings can be compared with the == operator in Julia, but this is sensitive to differences in case. For instance, "Hello" and "hello" are considered different strings.

julia> "Hello" == "Hello"
true

julia> "Hello" == "hello"
false

To compare strings in a case-insensitive manner, normalize the strings by case-folding them first. For example,

equals_ignore_case(s, t) =
    normalize_string(s, casefold=true) == normalize_string(t, casefold=true)

This approach also handles non-ASCII Unicode correctly:

julia> equals_ignore_case("Hello", "hello")
true

julia> equals_ignore_case("Weierstraß", "WEIERSTRASS")
true

Note that in German, the uppercase form of the ß character is SS.

Diacritic-Insensitive String Comparison

Sometimes, one wants strings like "resume" and "résumé" to compare equal. That is, graphemes that share a basic glyph, but possibly differ because of additions to those basic glyphs. Such comparison can be accomplished by stripping diacritical marks.

equals_ignore_mark(s, t) =
    normalize_string(s, stripmark=true) == normalize_string(t, stripmark=true)

This allows the above example to work correctly. Additionally, it works well even with non-ASCII Unicode characters.

julia> equals_ignore_mark("resume", "résumé")
true

julia> equals_ignore_mark("αβγ", "ὰβ̂γ̆")
true


2016-10-25
2016-10-25
Julia Language Pedia
Icon