PHP Regular Expression Tutorial: 2008

MySQL Regular Expressions Cheat Sheet

MySQL Regular Expressions

Regular Expressions in MySQL are used within the REGEXP and RLIKE sections of WHERE clauses in the selection of records for display, update or deletion. They use Henry Spencer's implementation, which is aimed at conformance with POSIX standard 1003.2, extended version.

Operator Type	Examples	Description
Literal Characters Match a character exactly	a A y 6 % @	Letters, digits and many special characters match exactly
	\$ \^ \+ \\ \?	Precede other special characters with a \ to cancel their regex special meaning
	\n \t \r	Literal new line, tab, return
	\cJ \cG	Control codes
	\xa3	Hex codes for any character
Anchors and assertions	^	Field starts with
	$	Field ends with
	[[:<:]]	Word starts with
	[[:>:]]	Word ends with
Character groups any 1 character from the group	[aAeEiou]	any character listed from [ to ]
	[^aAeEiou]	any character except aAeEio or u
	[a-fA-F0-9]	any hex character (0 to 9 or a to f)
	.	any character at all
	[[:space:]]	any space character (space \n \r or \t)
	[[:alnum:]]	any alphanumeric character (letter or digit)
Counts apply to previous element	+	1 or more ("some")
	*	0 or more ("perhaps some")
	?	0 or 1 ("perhaps a")
	{4}	exactly 4
	{4,}	4 or more
	{4,8}	between 4 and 8
	Add a ? after any count to turn it sparse (match as few as possible) rather than have it default to greedy
Alternation	\|	either, or
Grouping	( )	group for count and save to variable

The above list shows the most commonly used elements of MySQL regular expressions, and is not exhaustive.

POSIX Character Class Definitions

Value	Meaning
[:digit:]	Only the digits 0 to 9
[:alnum:]	Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]	Any alpha character A to Z or a to z.
[:blank:]	Space and TAB characters only.
[:xdigit:]	Hexadecimal notation 0-9, A-F, a-f.
[:punct:]	Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } \| ~
[:print:]	Any printable character.
[:space:]	Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:]	Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:]	Any alpha character A to Z.
[:lower:]	Any alpha character a to z.
[:cntrl:]	Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]

Example

Replace URLs with links

 $text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]",
                     "\\0", $text);
?>

ereg_replace — Replace regular expression

ereg_replace — Replace regular expression
This function scans string for matches to pattern , then replaces the matched text
with replacement
string ereg_replace ( string $pattern , string $replacement , string $string )

pattern - A POSIX extended regular expression.

replacement - If pattern contains parenthesized substrings, replacement may contain substrings
of the form \\digit, which will be replaced by the text matching the digit'th parenthesized
substring; \\0 will produce the entire contents of string. Up to nine substrings may be used.
Parentheses may be nested, in which case they are counted by the opening parenthesis.

string -The input string.

Return Values
The modified string is returned. If no matches are found in string , then it will be returned
unchanged.
Example#1

$str= "This is a test";


echo str_replace(" is", " was", $str);
echo ereg_replace("( )is", "\\1was", $str);
echo ereg_replace("(( )is)", "\\2was", $str);
?>

prints "This was a test" three times:
Example#2

$s = "Coding PHP is fun.";
$pattern = "(.*)PHP(.*)";
$replacement = " They say \\1other languages\\2";
print ereg_replace($pattern, $replacement, $s);

?>

 Output:
They say Coding other languages is fun.
Explanation:
"PHP" is replaced with "other languages", and the sentence is changed a little, using \1 and \2 to access the parts within parentheses.

Example#3
Replace URLs with links

 $text = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]",
                     "\\0", $text);
?>

eregi — Case insensitive regular expression match

eregi — Case insensitive regular expression match
This function is identical to ereg() except that it ignores case distinction when matching alphabetic characters.
int eregi ( string $pattern , string $string [, array &$regs ] )
pattern - Case insensitive regular expression.
string - The input string.
regs - If matches are found for parenthesized substrings of pattern and the function is called
with the third argument regs , the matches will be stored in the elements of the array regs .
$regs[1] will contain the substring which starts at the first left parenthesis; $regs[2] will
contain the substring starting at the second, and so on. $regs[0] will contain a copy of the
complete string matched.
Example#1

 = 'XYZ';
if (eregi('z', $string)) {
  echo "'$string' contains a 'z' or 'Z'!";
}
?>

ereg — Regular expression match

ereg — Regular expression match
Searches a string for matches to the regular expression given in pattern in a case-sensitive way.

int ereg ( string $pattern , string $string [, array &$regs ] )
pattern - Case sensitive regular expression.
string- The input string.

regs -If matches are found for parenthesized substrings of pattern and the function is called with the third argument regs , the matches will be stored in the elements of the array regs .$regs[1] will contain the substring which starts at the first left parenthesis; $regs[2] will contain the substring starting at the second, and so on. $regs[0] will contain a copy of the complete string matched.

Return Values

Returns the length of the matched string if a match for pattern was found in string , or FALSE if no matches were found or an error occurred.
If the optional parameter regs was not passed or the length of the matched string is 0, this function returns 1.

Example#1
The following code snippet takes a date in ISO format (YYYY-MM-DD) and prints it in DD.MM.YYYY format:

 if (ereg ("([0-9]{4})-([0-9]{1,2})-([0-9]{1,2})", $date, $regs)) {
    echo "$regs[3].$regs[2].$regs[1]";
} else {
    echo "Invalid date format: $date";
}
?>

Example#2
Check if string only contains letters and numbers.

if (ereg("[^A-Za-z0-9]", $string)) {
echo "Error: String can only contain letters and numbers!";
exit();
}

Example#3
This is intended to validate fully specified (international) phone numbers without forcing the user to use the full international format and giving them maximum reasonable flexibility including an optional extension number.
Allows numbers plus any of: space():.ext,+-
Example: +44(0)113 249-0442 ext:1234
The code matches any combination of the allowed character set.

php
    $phoneNumber="+44(0)113 249-0442 ext:1234";
    $regex="[0-9 ():.ext,+-]{".strlen($phoneNumber)."}";

    if(ereg($regex,$phoneNumber)){
        echo "ok";
    }else {
        echo "invalid phone number";
    } 
?>

PHP Regex Cheat Sheet

Special Sequences

\w - Any “word” character (a-z 0-9 _)
\W - Any non “word” character
\s - Whitespace (space, tab CRLF)
\S - Any non whitepsace character
\d - Digits (0-9)
\D - Any non digit character
. - (Period) – Any character except newline

Meta Characters

^ - Start of subject (or line in multiline mode)
$ - End of subject (or line in multiline mode)
[ - Start character class definition
] - End character class definition
| - Alternates, eg (a|b) matches a or b
( - Start subpattern
) - End subpattern
\ - Escape character

Quantifiers

n* - Zero or more of n
n+ - One or more of n
n ? - Zero or one occurrences of n
{n} - n occurrences exactly
{n,} - At least n occurrences
{,m} - At most m occurrences
{n,m} - Between n and m occurrences (inclusive)

Pattern Modifiers

i - Case Insensitive
m - Multiline mode - ^ and $ match start and end of lines
s - Dotall - . class includes newline
x - Extended– comments and whitespace
e - preg_replace only – enables evaluation of replacement as PHP code
S - Extra analysis of pattern
U - Pattern is ungreedy
u - Pattern is treated as UTF-8

Point based assertions

\b - Word boundary
\B - Not a word boundary
\A - Start of subject
\Z - End of subject or newline at end
\z - End of subject
\G - First matching position in subject

Assertions

(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
(?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
(?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
(? - Negative look behind assertion (?
(?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
(?(x)) - Conditional subpatterns
(?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
(?#) - Comment (?# Pattern does x y or z)

Download PHP Regex Cheat sheet

How To Negate a Character Class

Using the ^ ('carrot') to negate a character class:

Metacharacters [] denote a "character class". Within a 'character class' you can have what is called a 'range', which I have been using in the examples above. Lets take a look:
^[c-f]$
This means "the beginning of the line followed by a lowercase letter between c and f (c,d,e, or f), followed by the end of the line."

The carrot character, ^, can also be used inside a character class to "negate" it. negating a character class basically means "not this". Lets take a look..
Eg: ^(.*)[^a-z5-9]*$
This basically means "the beginning of the line followed by zero or more of any character followed by zero or more of any digit or character that is NOT lowercase or 5-9.".

Introduction to Regular Expressions in PHP

Regular expressions were created by an American mathematician named Stephen Kleene.
PHP supports two different types of regular expressions: POSIX-extended and Perl-Compatible Regular Expressions (PCRE). The PCRE functions are more powerful than the POSIX ones, and faster too, so we will concentrate on them.

Some Important Terms
Let Me Start With an example *.txt [Find all files with extension txt]

Metacharacter
A metacharacter is a special character that the regex engine will use to apply "rules" for

Eg: *.txt

Literal text
Literal text is actual "text" that you are using to be matched in your regular expression.
Eg: *.txt

Character Class []
A character class is something that lets you tell the regex engine what characters (literal
text) that you would like to allow at that point in the regular expression
Eg: [Jj]ohn

Anchor
An anchor is actually a 'metacharacter', but It doesn't actually match text, only the position of text.
Eg: /^[Jj]ohn$/

Whitespace -
Whitespace is actually "literal text", but it is empty space. a string of:
Eg: $str = " "; is comprised of 'whitespace'.

Common Metacharacters and Anchors

Caret symbol (^)
A caret (^) character at the beginning of a regular expression indicates that it must match the beginning of the string.
Eg: ^z searches for a part that begins with z.

Dollar Symbol($)
A dollar sign ($) is used to match strings that end with the given pattern
Eg: z$ searches for a part that ends with z.

Dot (.)
A Dot metacharacter matches any single character except newline (\).
Eg: pattern h.t matches hat, hothit, hut, h7t, etc

The vertical pipe ( | )
The vertical pipe (|) metacharacter is used for alternatives in a regular expression. It behaves much like a logical OR operator and you should use it if you want to construct a pattern that matches more than one set of characters. For instance, the pattern Utah|Idaho|Nevada matches strings that contain "Utah" or "Idaho" or "Nevada". Parentheses give us a way to group sequences. For example, (Nant|b)ucket matches "Nantucket" or "bucket". Using parentheses to group together characters for alternation is called grouping.

Other Meta Characters.
The metacharacters +, *, ?, and {} affect the number of times a pattern should be matched.

Plus (+)
Match one or more of the preceding expression

The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree and tread but not trough.

Asterisk/ Star(*)
Match zero or more of the preceding expression

The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree and tread and trough.

Question Mark(?)
Match zero or one of the preceding expression

The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color and colour.

Curly braces {}
{1} means "match exactly 1 occurrences of the preceding expression", with one
{1,} means "match 1 or more occurrences of the preceding expression",
{1,5} means "match the previous character if it occurs at least 1 times, but no more than 5 times".

{n}

Matches the preceding character n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).

{n,m}

Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).

Note
Using these metacharacters and a pair of (parentheses) you can create a number of different and complex search patterns. Here are some examples of different search patterns :

abc{3}	searches for abccc
(abc){3}	searches for abcabcabc
on\|off	searches for onff or ooff
(on)\|(off)	searches for on or off

Quick Reference

^z	searches for a part that begins with z.
z$	searches for a part that ends with z.
z+	searches for at least one z in a row.
z?	searches for zero or one z.
(yz)	searches for yz grouped together.
y\|z	searches for y or z.
z{3}	searches for zzz.
z{1,}	searches for z or zz or zzz and so on...
z{1,3}	searches for z or zz or zzz only.

Other metacharacter type searches include...

.	searches for ANY character or letter.
[a-z]	searches for any lowercase letter.
[A-Z]	searches for any uppercase letter.
[0-9]	searches for any digit 0 to 9.
\	escapes the next character.
\n	new line.
\t	tab.

Note: If you want to match a literal metacharacter in a pattern, you have to escape it with a backslash

PHP Regular Expression Tutorial