Regular Expressions

This translation is incomplete. Please help translate this article from English.

Les expressions regulars són patrons utilitzats per a trobar combinacions de caràcters en cadenes de text. A JavaScript, les expressions regulars són també objectes. Aquests patrons s'utilitzen amb els mètodes exec i test de l'objecte RegExp, així com amb els mètodes match, replacesearch, i split de l'objecte String. Aquest capítol descriu les expressions regulars a JavaScript.

Crear una expressió regular

Les expressions regulars es construeixen d'una de les dues maneres següents:

Utilitzant un literal d'expressió regular, el qual consisteix en un patró embolcallat en barres, tal i com es mostra en l'exemple següent:

var re = /ab+c/;

Els literals d'expressions regulars fan que l'expressió regular es compili quan es carrega l'script. Utilitzeu aquest mètode quan l'expressió regular romangui constant per a una millor eficiència.

O bé tot cridant la funció constructora de l'objecte RegExp, tal i com es mostra en l'exemple següent:

var re = new RegExp("ab+c");

Utilitzar la funció constructora fa que l'expressió regular es compili en temps d'execució. Utilitzeu la funció construtora quan sapigueu d'avantmà que el patró de l'expressió regular canviarà, o bé quan no sapigueu el patró d'avantmà i l'hagueu d'obtindre a partir d'algun altre recurs, com ara preguntat-lo a l'usuari.

Escriure patrons d'expressions regulars

El patró d'una expressió regular està composat per caràcters simples, com ara /abc/, o bé una combinació de caràcters simples i especials, com ara /ab*c/ o be /Chapter (\d+)\.\d*/. L'últim exemple inclou parèntesi, els quals s'utilitzen com a dispositiu de memòria. Les coincidències trobades amb aquesta part del patró s'emmagatzemen per a un us posterior, tal i com es descriu a Utilitzar coincidències parcials amb parèntesi.

Utilitzar patrons simples

Els patrons simples estan formats per caràcters que es volen trobar dins una cadena de forma directa. Per exemple, el patró /abc/ troba coincidències en cadenes només quan es troben exactament els caràcters 'abc' junts i en aquest ordre. Es trobaria una coincidència en la cadena "Hi, do you know your abc's?" i "The latest airplane designs evolved from slabcraft". En els dos casos la coincidència és amb la subcadena 'abc'. No hi ha cap coincidència a la cadena 'Grab crab' ja que tot i que conté la subcadena 'ab c', no conté la subcadena exacta 'abc'.

Utilitzar caràcters especials

Quan la cerca de patrons requereix de quelcom més complicat que una coincidència directa, com ara trobar el caràcter b com a mínim un o més cops consecutius, o ara trobar un espai en blanc, el patró forçosament ha d'incloure caràcters especials. Per exemple, el patró /ab*c/ qualsevol combinació de caràcters en la qual un sol caràcter 'a' tingui a continuació zero o més 'b (* vol vol dir 0 o més ocurrències del caràcter previ) i una 'c' immediatament a continuació. En la cadena de caràcters "cbbabbbbcdebc," el patró anterior trobarà la coincidència 'abbbbc'.

La taula següent proporciona un llistat complet dels caràcters especials que es poden fer anar en expressions regulars, juntament amb la seva descripció.

Caràcters especials en expressions regulars.
Caràcter Significat
\

S'ajusta a les regles següents:

Una barra invertida precedint un caràcter no especial indica que el següent caràcter no és especial però no s'ha d'interpretar literalment. Per exemple, una 'b' no precedida per un '\' generalment coincideix amb el caràcter minúscula 'b' allà on es trobi. La cadena '\b', però, en si mateixa no coincideix amb cap caràcter;  by itself doesn't match any character, sino que forma el caràcter especial límit de paraula.

Una barra invertida precedint un caràcter especial indica que el caràcter següent no és especial i hauria de ser interpretat de forma literal. Per exemple, el patró /a*/ utilitza el caràcter especial '*' per a trobar 0 o més caràcters a consecutius. En canvi el patró /a\*/ obliga a interpretar de forma literal el caràcter '*', fent que només es trobin les coincidències que siguin iguals a la cadena 'a*'.

No oblideu escapar el caràcter \ quan utilitzeu la notació RegExp("patró") ja que \ també és el caràcter d'escapament per a strings.

^ Denota una coincidència a l'inici de l'entrada. Si l'indicador de multilínia val true també farà coincidir el patró immediatament després d'un caràcter de trencament de línia.

Per exemple, /^A/ no trobarà cap coincidència de 'A' en "an A", però si que la trobarà a "An E".

El símbol '^' té un significat diferent quan apareix com a primer caràcter en un conjunt de caràcters d'un patró. Vegeu el conjunt de caràcters complementaris per a més detalls i exemples.
$

Denota una coincidència la final de l'entrada. Si l'indicador de multilínia val true, també denotarà una coincidència immediatament abans d'un caràcter de trencament de línia.

Per exemple, /t$/ no trobarà la 't' en "eater", però si que trobarà una coincidència a "eat".

*

Denota una ocurrència de l'expressió que el precedeix quan aquesta apareix zero o més cops seguits. Equivalent a {0,}.

Per exemple, /bo*/ retornarà com a ocurrència 'boooo' en la cadena "A ghost booooed" i 'b' en "A bird warbled", però no trobarà cap coincidència a "A goat grunted".

+

Denota una ocurrència de l'expressió que el precedeix quan aquesta apareix 1 o més cops seguits. Equivalent a {1,}.

Per exemple, /a+/ retornarà com a ocurrència 'a' en "candy" i totes les a' en "caaaaaaandy", però res a "cndy".

? Denota una ocurrència de l'expressió que el precedeix quan aquesta apareix 0 o 1 cops. Equivalent a {0,1}.

Per exemple, /e?le?/ retornarà com a coincidència 'el' en "angel", retornarà 'le' en "angle" i també retornarà 'l' en "oslo".

Si s'utilitza immediatament després d'un dels quantificadors *, +, ?, o {}, indica que el quantificador no es comportarà de forma avara (retornant a cada ocurrència el menor nombre de caràcters possible), al contrari que per defecte, on tots ells es comporten de forma avara (retornant a cada ocurrència el major nombre de caràcters possible). Per exemple, aplicar el patró /\d+/ a "123abc" retornarà la coincidència "123". Ara be, aplicar el patró /\d+?/ a la mateixa cadena que l'anterior només retornarà com a coincidència "1".

També s'utilitza en afirmacions de cerca cap endavant, tal i com es descriu a les entrades x(?=y) i x(?!y) d'aquesta taula.
 
.

(El punt decimal) marca com a ocurrència qualsevol caràcter individual excepte el caràcter de nova línia.

Per exemple, /.n/ trobarà les coincidències 'an' i 'on' a "nay, an apple is on the tree", però no 'nay'.

(x)

Marca com a ocurrència 'x' i recorda la ocurrència, tal i com es mostra a l'exemple següent. A aquests parèntesis són anomenats parèntesi capturadors.

Les cadenes '(foo)' i '(bar)' del patró /(foo) (bar) \1 \2/ marquen i recorden les primeres dues paraules de la cadena "foo bar foo bar". Els \1 i \2 del patró marquen com a ocurrència les dues últimes paraules de la cadena. Cal destacar que s'utilitzen \1, \2, \n en la part de trobar coincidències de l'expressió regular. En la part de substitució s'utilitza la sintaxi $1, $2, $n, per exemple: 'bar foo'.replace( /(...) (...)/, '$2 $1' ).

(?:x) Marca com a ocurrència 'x' però no recorda l'ocurrència. Aquests parèntesi s'anomenen parèntesi no capturadors, i permeten definir subexpressions per als operadors amb els que treballen. Considerem l'expressió següent com a exemple: /(?:foo){1,2}/. Si l'expressió sigués /foo{1,2}/, els caràcters {1,2} només s'aplicarien a l'última 'o' de 'foo'. Amb els parèntesi no capturadors la part {1,2} s'aplica a la paraula 'foo' sencera.
x(?=y)

Només marca com a ocurrència 'x' si 'y' be després de 'x'. Aquesta tècnica s'anomena lookahead.

Per exemple, /Jack(?=Sprat)/ només marcarà com a ocurrència 'Jack' si seguidament trobem 'Sprat'. /Jack(?=Sprat|Frost)/ marcarà com a ocurrència 'Jack' només si seguidament trobem 'Sprat' o bé 'Frost'. Tanmateix ni 'Sprat' ni 'Frost' formaran part de la ocurrència resultant.

x(?!y)

Matches 'x' only if 'x' is not followed by 'y'. This is called a negated lookahead.

For example, /\d+(?!\.)/ matches a number only if it is not followed by a decimal point. The regular expression /\d+(?!\.)/.exec("3.141") matches '141' but not '3.141'.

x|y

Matches either 'x' or 'y'.

For example, /green|red/ matches 'green' in "green apple" and 'red' in "red apple."

{n} Matches exactly n occurrences of the preceding expression. N must be a positive integer.

For example, /a{2}/ doesn't match the 'a' in "candy," but it does match all of the a's in "caandy," and the first two a's in "caaandy."
{n,m}

Where n and m are positive integers and n <= m. Matches at least n and at most m occurrences of the preceding expression. When m is omitted, it's treated as ∞.

For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy". Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.

[xyz] Character set. This pattern type matches any one of the characters in the brackets, including escape sequences. Special characters like the dot(.) and asterisk (*) are not special inside a character set, so they don't need to be escaped. You can specify a range of characters by using a hyphen, as the following examples illustrate.

The pattern [a-d], which performs the same match as [abcd], matches the 'b' in "brisket" and the 'c' in "city". The patterns /[a-z.]+/ and /[\w.]+/ match the entire string "test.i.ng".
[^xyz]

A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. You can specify a range of characters by using a hyphen. Everything that works in the normal character set also works here.

For example, [^abc] is the same as [^a-c]. They initially match 'r' in "brisket" and 'h' in "chop."

[\b] Matches a backspace (U+0008). You need to use square brackets if you want to match a literal backspace character. (Not to be confused with \b.)
\b

Matches a word boundary. A word boundary matches the position where a word character is not followed or preceeded by another word-character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero. (Not to be confused with [\b].)

Examples:
/\bm/ matches the 'm' in "moon" ;
/oo\b/ does not match the 'oo' in "moon", because 'oo' is followed by 'n' which is a word character;
/oon\b/ matches the 'oon' in "moon", because 'oon' is the end of the string, thus not followed by a word character;
/\w\b\w/ will never match anything, because a word character can never be followed by both a non-word and a word character.

Note: JavaScript's regular expression engine defines a specific set of characters to be "word" characters. Any character not in that set is considered a word break. This set of characters is fairly limited: it consists solely of the Roman alphabet in both upper- and lower-case, decimal digits, and the underscore character. Accented characters, such as "é" or "ü" are, unfortunately, treated as word breaks.

\B

Matches a non-word boundary. This matches a position where the previous and next character are of the same type: Either both must be words, or both must be non-words. The beginning and end of a string are considered non-words.

For example, /\B../ matches 'oo' in "noonday", and /y\B./ matches 'ye' in "possibly yesterday."

\cX

Where X is a character ranging from A to Z. Matches a control character in a string.

For example, /\cM/ matches control-M (U+000D) in a string.

\d

Matches a digit character. Equivalent to [0-9].

For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."

\D

Matches any non-digit character. Equivalent to [^0-9].

For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."

\f Matches a form feed (U+000C).
\n Matches a line feed (U+000A).
\r Matches a carriage return (U+000D).
\s

Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].

For example, /\s\w*/ matches ' bar' in "foo bar."

\S

Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].

For example, /\S\w*/ matches 'foo' in "foo bar."

\t Matches a tab (U+0009).
\v Matches a vertical tab (U+000B).
\w

Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_].

For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."

\W

Matches any non-word character. Equivalent to [^A-Za-z0-9_].

For example, /\W/ or /[^A-Za-z0-9_]/ matches '%' in "50%."

\n

Where n is a positive integer, a back reference to the last substring matching the n parenthetical in the regular expression (counting left parentheses).

For example, /apple(,)\sorange\1/ matches 'apple, orange,' in "apple, orange, cherry, peach."

\0 Matches a NULL (U+0000) character. Do not follow this with another digit, because \0<digits> is an octal escape sequence.
\xhh Matches the character with the code hh (two hexadecimal digits)
\uhhhh Matches the character with the code hhhh (four hexadecimal digits).

Escaping user input to be treated as a literal string within a regular expression can be accomplished by simple replacement:

function escapeRegExp(string){
  return string.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
}

Using parentheses

Parentheses around any part of the regular expression pattern cause that part of the matched substring to be remembered. Once remembered, the substring can be recalled for other use, as described in Using Parenthesized Substring Matches.

For example, the pattern /Chapter (\d+)\.\d*/ illustrates additional escaped and special characters and indicates that part of the pattern should be remembered. It matches precisely the characters 'Chapter ' followed by one or more numeric characters (\d means any numeric character and + means 1 or more times), followed by a decimal point (which in itself is a special character; preceding the decimal point with \ means the pattern must look for the literal character '.'), followed by any numeric character 0 or more times (\d means numeric character, * means 0 or more times). In addition, parentheses are used to remember the first matched numeric characters.

This pattern is found in "Open Chapter 4.3, paragraph 6" and '4' is remembered. The pattern is not found in "Chapter 3 and 4", because that string does not have a period after the '3'.

To match a substring without causing the matched part to be remembered, within the parentheses preface the pattern with ?:. For example, (?:\d+) matches one or more numeric characters but does not remember the matched characters.

Working with regular expressions

Regular expressions are used with the RegExp methods test and exec and with the String methods match, replace, search, and split. These methods are explained in detail in the JavaScript reference.

Methods that use regular expressions
Method Description
exec A RegExp method that executes a search for a match in a string. It returns an array of information.
test A RegExp method that tests for a match in a string. It returns true or false.
match A String method that executes a search for a match in a string. It returns an array of information or null on a mismatch.
search A String method that tests for a match in a string. It returns the index of the match, or -1 if the search fails.
replace A String method that executes a search for a match in a string, and replaces the matched substring with a replacement substring.
split A String method that uses a regular expression or a fixed string to break a string into an array of substrings.

When you want to know whether a pattern is found in a string, use the test or search method; for more information (but slower execution) use the exec or match methods. If you use exec or match and if the match succeeds, these methods return an array and update properties of the associated regular expression object and also of the predefined regular expression object, RegExp. If the match fails, the exec method returns null (which coerces to false).

In the following example, the script uses the exec method to find a match in a string.

var myRe = /d(b+)d/g;
var myArray = myRe.exec("cdbbdbsbz");

If you do not need to access the properties of the regular expression, an alternative way of creating myArray is with this script:

var myArray = /d(b+)d/g.exec("cdbbdbsbz");

If you want to construct the regular expression from a string, yet another alternative is this script:

var myRe = new RegExp("d(b+)d", "g");
var myArray = myRe.exec("cdbbdbsbz");

With these scripts, the match succeeds and returns the array and updates the properties shown in the following table.

Results of regular expression execution.
Object Property or index Description In this example
myArray   The matched string and all remembered substrings. ["dbbd", "bb"]
index The 0-based index of the match in the input string. 1
input The original string. "cdbbdbsbz"
[0] The last matched characters. "dbbd"
myRe lastIndex The index at which to start the next match. (This property is set only if the regular expression uses the g option, described in Advanced Searching With Flags.) 5
source The text of the pattern. Updated at the time that the regular expression is created, not executed. "d(b+)d"

As shown in the second form of this example, you can use a regular expression created with an object initializer without assigning it to a variable. If you do, however, every occurrence is a new regular expression. For this reason, if you use this form without assigning it to a variable, you cannot subsequently access the properties of that regular expression. For example, assume you have this script:

var myRe = /d(b+)d/g;
var myArray = myRe.exec("cdbbdbsbz");
console.log("The value of lastIndex is " + myRe.lastIndex);

// "The value of lastIndex is 5"

However, if you have this script:

var myArray = /d(b+)d/g.exec("cdbbdbsbz");
console.log("The value of lastIndex is " + /d(b+)d/g.lastIndex);

// "The value of lastIndex is 0"

The occurrences of /d(b+)d/g in the two statements are different regular expression objects and hence have different values for their lastIndex property. If you need to access the properties of a regular expression created with an object initializer, you should first assign it to a variable.

Using parenthesized substring matches

Including parentheses in a regular expression pattern causes the corresponding submatch to be remembered. For example, /a(b)c/ matches the characters 'abc' and remembers 'b'. To recall these parenthesized substring matches, use the Array elements [1], ..., [n].

The number of possible parenthesized substrings is unlimited. The returned array holds all that were found. The following examples illustrate how to use parenthesized substring matches.

The following script uses the replace() method to switch the words in the string. For the replacement text, the script uses the $1 and $2 in the replacement to denote the first and second parenthesized substring matches.

var re = /(\w+)\s(\w+)/;
var str = "John Smith";
var newstr = str.replace(re, "$2, $1");
console.log(newstr);

This prints "Smith, John".

Advanced searching with flags

Regular expressions have four optional flags that allow for global and case insensitive searching. These flags can be used separately or together in any order, and are included as part of the regular expression.

Regular expression flags
Flag Description
g Global search.
i Case-insensitive search.
m Multi-line search.
y Perform a "sticky" search that matches starting at the current position in the target string. See sticky

To include a flag with the regular expression, use this syntax:

var re = /pattern/flags;

or

var re = new RegExp("pattern", "flags");

Note that the flags are an integral part of a regular expression. They cannot be added or removed later.

For example, re = /\w+\s/g creates a regular expression that looks for one or more characters followed by a space, and it looks for this combination throughout the string.

var re = /\w+\s/g;
var str = "fee fi fo fum";
var myArray = str.match(re);
console.log(myArray);

This displays ["fee ", "fi ", "fo "]. In this example, you could replace the line:

var re = /\w+\s/g;

with:

var re = new RegExp("\\w+\\s", "g");

and get the same result.

The m flag is used to specify that a multiline input string should be treated as multiple lines. If the m flag is used, ^ and $ match at the start or end of any line within the input string instead of the start or end of the entire string.

Examples

The following examples show some uses of regular expressions.

Changing the order in an input string

The following example illustrates the formation of regular expressions and the use of string.split() and string.replace(). It cleans a roughly formatted input string containing names (first name first) separated by blanks, tabs and exactly one semicolon. Finally, it reverses the name order (last name first) and sorts the list.

// The name string contains multiple spaces and tabs,
// and may have multiple spaces between first and last names.
var names = "Harry Trump ;Fred Barney; Helen Rigby ; Bill Abel ; Chris Hand ";

var output = ["---------- Original String\n", names + "\n"];

// Prepare two regular expression patterns and array storage.
// Split the string into array elements.

// pattern: possible white space then semicolon then possible white space
var pattern = /\s*;\s*/;

// Break the string into pieces separated by the pattern above and
// store the pieces in an array called nameList
var nameList = names.split(pattern);

// new pattern: one or more characters then spaces then characters.
// Use parentheses to "memorize" portions of the pattern.
// The memorized portions are referred to later.
pattern = /(\w+)\s+(\w+)/;

// New array for holding names being processed.
var bySurnameList = [];

// Display the name array and populate the new array
// with comma-separated names, last first.
//
// The replace method removes anything matching the pattern
// and replaces it with the memorized string—second memorized portion
// followed by comma space followed by first memorized portion.
//
// The variables $1 and $2 refer to the portions
// memorized while matching the pattern.

output.push("---------- After Split by Regular Expression");

var i, len;
for (i = 0, len = nameList.length; i < len; i++){
  output.push(nameList[i]);
  bySurnameList[i] = nameList[i].replace(pattern, "$2, $1");
}

// Display the new array.
output.push("---------- Names Reversed");
for (i = 0, len = bySurnameList.length; i < len; i++){
  output.push(bySurnameList[i]);
}

// Sort by last name, then display the sorted array.
bySurnameList.sort();
output.push("---------- Sorted");
for (i = 0, len = bySurnameList.length; i < len; i++){
  output.push(bySurnameList[i]);
}

output.push("---------- End");

console.log(output.join("\n"));

Using special characters to verify input

In the following example, the user is expected to enter a phone number. When the user presses the "Check" button, the script checks the validity of the number. If the number is valid (matches the character sequence specified by the regular expression), the script shows a message thanking the user and confirming the number. If the number is invalid, the script informs the user that the phone number is not valid.

Within non-capturing parentheses (?: , the regular expression looks for three numeric characters \d{3} OR | a left parenthesis \( followed by three digits \d{3}, followed by a close parenthesis \), (end non-capturing parenthesis )), followed by one dash, forward slash, or decimal point and when found, remember the character ([-\/\.]), followed by three digits \d{3}, followed by the remembered match of a dash, forward slash, or decimal point \1, followed by four digits \d{4}.

The Change event activated when the user presses Enter sets the value of RegExp.input.

<!DOCTYPE html>
<html>  
  <head>  
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">  
    <meta http-equiv="Content-Script-Type" content="text/javascript">  
    <script type="text/javascript">  
      var re = /(?:\d{3}|\(\d{3}\))([-\/\.])\d{3}\1\d{4}/;  
      function testInfo(phoneInput){  
        var OK = re.exec(phoneInput.value);  
        if (!OK)  
          window.alert(phoneInput.value + " isn't a phone number with area code!");  
        else
          window.alert("Thanks, your phone number is " + OK[0]);  
      }  
    </script>  
  </head>  
  <body>  
    <p>Enter your phone number (with area code) and then click "Check".
        <br>The expected format is like ###-###-####.</p>
    <form action="#">  
      <input id="phone"><button onclick="testInfo(document.getElementById('phone'));">Check</button>
    </form>  
  </body>  
</html>

Document Tags and Contributors

 Contributors to this page: enTropy
 Last updated by: enTropy,