Introduction to Regular Expressions in PHP

Introduction to Regular Expressions in PHP

Regular expressions were created by an American mathematician named Stephen Kleene.
PHP supports two different types of regular expressions: POSIX-extended and Perl-Compatible Regular Expressions (PCRE). The PCRE functions are more powerful than the POSIX ones, and faster too, so we will concentrate on them.


Some Important Terms
Let Me Start With an example *.txt [Find all files with extension txt]


Metacharacter
A metacharacter is a special character that the regex engine will use to apply "rules" for

Eg: *.txt


Literal text
Literal text is actual "text" that you are using to be matched in your regular expression.
Eg: *.txt


Character Class []
A character class is something that lets you tell the regex engine what characters (literal
text) that you would like to allow at that point in the regular expression
Eg: [Jj]ohn


Anchor
An anchor is actually a 'metacharacter', but It doesn't actually match text, only the position of text.
Eg: /^[Jj]ohn$/


Whitespace -
Whitespace is actually "literal text", but it is empty space. a string of:
Eg: $str = " "; is comprised of 'whitespace'.


Common Metacharacters and Anchors


Caret symbol (^)
A caret (^) character at the beginning of a regular expression indicates that it must match the beginning of the string.
Eg: ^z searches for a part that begins with z.


Dollar Symbol($)
A dollar sign ($) is used to match strings that end with the given pattern
Eg: z$ searches for a part that ends with z.


Dot (.)
A Dot metacharacter matches any single character except newline (\).
Eg: pattern h.t matches hat, hothit, hut, h7t, etc


The vertical pipe ( | )
The vertical pipe (|) metacharacter is used for alternatives in a regular expression. It behaves much like a logical OR operator and you should use it if you want to construct a pattern that matches more than one set of characters. For instance, the pattern Utah|Idaho|Nevada matches strings that contain "Utah" or "Idaho" or "Nevada". Parentheses give us a way to group sequences. For example, (Nant|b)ucket matches "Nantucket" or "bucket". Using parentheses to group together characters for alternation is called grouping.


Other Meta Characters.
The metacharacters +, *, ?, and {} affect the number of times a pattern should be matched.


Plus (+)
Match one or more of the preceding expression

The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree and tread but not trough.


Asterisk/ Star(*)
Match zero or more of the preceding expression

The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree and tread and trough.


Question Mark(?)
Match zero or one of the preceding expression

The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color and colour.


Curly braces {}
{1} means "match exactly 1 occurrences of the preceding expression", with one
{1,} means "match 1 or more occurrences of the preceding expression",
{1,5} means "match the previous character if it occurs at least 1 times, but no more than 5 times".

{n}

Matches the preceding character n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).


{n,m}

Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).


Note
Using these metacharacters and a pair of (parentheses) you can create a number of different and complex search patterns. Here are some examples of different search patterns :

abc{3}

searches for abccc

(abc){3}

searches for abcabcabc

on|off

searches for onff or ooff

(on)|(off)

searches for on or off



Quick Reference

^z

searches for a part that begins with z.

z$

searches for a part that ends with z.

z+

searches for at least one z in a row.

z?

searches for zero or one z.

(yz)

searches for yz grouped together.

y|z

searches for y or z.

z{3}

searches for zzz.

z{1,}

searches for z or zz or zzz and so on...

z{1,3}

searches for z or zz or zzz only.


Other metacharacter type searches include...

.

searches for ANY character or letter.

[a-z]

searches for any lowercase letter.

[A-Z]

searches for any uppercase letter.

[0-9]

searches for any digit 0 to 9.

\

escapes the next character.

\n

new line.

\t

tab.

Note: If you want to match a literal metacharacter in a pattern, you have to escape it with a backslash

4 comments:

satrughan

February 2, 2009 at 2:33 AM

Hi
It's of great help for beginers like
me.Very simple and easy to understand.

Satrughand

biovamps

April 28, 2010 at 11:30 PM

Thanks buddy..
ur valuable comments is highly appreciated.. do let me know what other improvements I can do on this article

None

June 10, 2010 at 9:25 PM

I don't know who r u but here after i never forget u..... Yes Ur article is very useful for me to know about RE.... Really its excellent.... Cheers up dude.....

Alex

October 13, 2010 at 3:53 AM

Careful : *.txt does not only match names ending by ".txt", but every word ending by any character followed by "txt", because . does not match only the dot, but any character.