Regexes Got Good: The History And Future Of Regular Expressions In JavaScript<\/h1>\nSteven Levithan<\/address>\n 2024-08-20T15:00:00+00:00
\n 2024-10-15T23:05:45+00:00
\n <\/header>\n
Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text<\/strong>, but they have a longstanding reputation (perhaps outdated, as I\u2019ll show) for being difficult to write and understand.<\/p>\nThis is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.<\/p>\n
In this article, I\u2019ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library<\/a> that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).<\/p>\nThe History of Regular Expressions in JavaScript<\/h2>\n
ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.<\/p>\n
But that was then.<\/p>\n
Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?<\/p><\/blockquote>\n
Let\u2019s take a look at them.<\/p>\n
Don\u2019t worry if it\u2019s hard to understand what some of the following features mean — we\u2019ll look more closely at several of the key features afterward.<\/p>\n
\n- ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (
\/[\/]\/<\/code>).<\/li>\n- ES6\/ES2015 added two new regex flags:
y<\/code> (sticky<\/code>), which made it easier to use regexes in parsers, and u<\/code> (unicode<\/code>), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags<\/code> getter, support for subclassing RegExp<\/code>, and the ability to copy a regex while changing its flags.<\/li>\n- ES2018 was the edition that finally made JavaScript regexes pretty good. It added the
s<\/code> (dotAll<\/code>) flag, lookbehind, named capture, and Unicode properties (via p{...}<\/code> and P{...}<\/code>, which require ES6\u2019s flag u<\/code>). All of these are extremely useful features, as we\u2019ll see.<\/li>\n- ES2020 added the string method
matchAll<\/code>, which we\u2019ll also see more of shortly.<\/li>\n- ES2022 added flag
d<\/code> (hasIndices<\/code>), which provides start and end indices for matched substrings.<\/li>\n- And finally, ES2024 added flag
v<\/code> (unicodeSets<\/code>) as an upgrade to ES6\u2019s flag u<\/code>. The v<\/code> flag adds a set of multicharacter \u201cproperties of strings\u201d to p{...}<\/code>, multicharacter elements within character classes via p{...}<\/code> and q{...}<\/code>, nested character classes, set subtraction [A--B]<\/code> and intersection [A&&B]<\/code>, and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...]<\/code>.<\/li>\n<\/ul>\nAs for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v<\/code>, is supported in Node.js 20 and 2023-era<\/a> browsers. The rest are supported in 2021-era browsers or earlier.<\/p>\nEach edition from ES2019 to ES2023 also added additional Unicode properties that can be used via p{...}<\/code> and P{...}<\/code>. And to be a completionist, ES2021 added string method replaceAll<\/code> — although, when given a regex, the only difference from ES3\u2019s replace<\/code> is that it throws if not using flag g<\/code>.<\/p>\nAside: What Makes a Regex Flavor Good?<\/h3>\n
With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:<\/p>\n
\n- Performance.<\/strong>
\nThis is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8\u2019s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox<\/a>; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.<\/li>\n- Support for advanced features<\/strong> that handle common or important use cases.
\nHere, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter \u201cproperties of strings,\u201d set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.<\/li>\n- Ability to write readable and maintainable patterns.<\/strong>
\nHere, native JavaScript has long been the worst of the major flavors since it lacks the x<\/code> (\u201cextended\u201d) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.<\/li>\n<\/ul>\nSo, it\u2019s a bit of a mixed bag.<\/p>\n
\n
\n 2024-10-15T23:05:45+00:00
\n <\/header>\n
This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.<\/p>\n
In this article, I\u2019ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library<\/a> that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).<\/p>\n ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.<\/p>\n But that was then.<\/p>\n Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?<\/p><\/blockquote>\n Let\u2019s take a look at them.<\/p>\n Don\u2019t worry if it\u2019s hard to understand what some of the following features mean — we\u2019ll look more closely at several of the key features afterward.<\/p>\n As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:<\/p>\n So, it\u2019s a bit of a mixed bag.<\/p>\nThe History of Regular Expressions in JavaScript<\/h2>\n
\n
\/[\/]\/<\/code>).<\/li>\n
y<\/code> (
sticky<\/code>), which made it easier to use regexes in parsers, and
u<\/code> (
unicode<\/code>), which added several significant Unicode-related improvements along with strict errors. It also added the
RegExp.prototype.flags<\/code> getter, support for subclassing
RegExp<\/code>, and the ability to copy a regex while changing its flags.<\/li>\n
s<\/code> (
dotAll<\/code>) flag, lookbehind, named capture, and Unicode properties (via
p{...}<\/code> and
P{...}<\/code>, which require ES6\u2019s flag
u<\/code>). All of these are extremely useful features, as we\u2019ll see.<\/li>\n
matchAll<\/code>, which we\u2019ll also see more of shortly.<\/li>\n
d<\/code> (
hasIndices<\/code>), which provides start and end indices for matched substrings.<\/li>\n
v<\/code> (
unicodeSets<\/code>) as an upgrade to ES6\u2019s flag
u<\/code>. The
v<\/code> flag adds a set of multicharacter \u201cproperties of strings\u201d to
p{...}<\/code>, multicharacter elements within character classes via
p{...}<\/code> and
q{...}<\/code>, nested character classes, set subtraction
[A--B]<\/code> and intersection
[A&&B]<\/code>, and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets
[^...]<\/code>.<\/li>\n<\/ul>\n
v<\/code>, is supported in Node.js 20 and 2023-era<\/a> browsers. The rest are supported in 2021-era browsers or earlier.<\/p>\n
p{...}<\/code> and
P{...}<\/code>. And to be a completionist, ES2021 added string method
replaceAll<\/code> — although, when given a regex, the only difference from ES3\u2019s
replace<\/code> is that it throws if not using flag
g<\/code>.<\/p>\n
Aside: What Makes a Regex Flavor Good?<\/h3>\n
\n
\nThis is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8\u2019s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox<\/a>; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.<\/li>\n
\nHere, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter \u201cproperties of strings,\u201d set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.<\/li>\n
\nHere, native JavaScript has long been the worst of the major flavors since it lacks the x<\/code> (\u201cextended\u201d) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.<\/li>\n<\/ul>\n
\n