Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 10 2010


How to use Lookaheads and Lookbehinds in your Regular Expressions

Today, we’ll be reviewing the intricacies of regular expressions. More specifically, we’ll discuss both how and why you should use positive/negative lookahead ands lookbehinds in your regular expressions. Originally meant to be a quick tip, this screencast ended up a bit longer than expected, at around eighteen minutes.

Press the HD button for a clearer picture.

Subscribe to our YouTube page to watch all of the video tutorials!

After viewing the video above, keep in mind that, for any given task, there are a plethora of ways match your desired text. For example, when matching a Twitter username – as we did in the video – you could also use a non-word-boundary rather than a positive lookbehind.


The key is to find the right tool for the job.

April 16 2010


Advanced Regular Expression Tips and Techniques

Regular Expressions are the Swiss Army knife for searching through information for certain patterns. They have a wide arsenal of tools, some of which often go undiscovered or underutilized. Today I will show you some advanced tips for working with regular expressions.

Adding Comments

Sometimes regular expressions can become complex and unreadable. A regular expression you write today may seem too obscure to you tomorrow even though it was your own work. Much like programming in general, it is a good idea to add comments to improve the readability of regular expressions.

For example, here is something we might use to check for US phone numbers.


It can become much more readable with comments and some extra spacing.


			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits


Let’s put it within a code segment.

$numbers = array(
"123 555 6789",
"123 55 6789");

foreach ($numbers as $number) {
	echo "$number is ";

	if (preg_match("/^

			(1[-\s.])?	# optional '1-', '1.' or '1'
			( \( )?		# optional opening parenthesis
			\d{3}		# the area code
			(?(2) \) )	# if there was opening parenthesis, close it
			[-\s.]?		# followed by '-' or '.' or space
			\d{3}		# first 3 digits
			[-\s.]?		# followed by '-' or '.' or space
			\d{4}		# last 4 digits

			$/x",$number)) {

		echo "valid\n";
	} else {
		echo "invalid\n";

/* prints

123 555 6789 is valid
1-(123)-555-6789 is valid
(123-555-6789 is invalid
(123).555.6789 is valid
123 55 6789 is invalid


The trick is to use the ‘x’ modifier at the end of the regular expression. It causes the whitespaces in the pattern to be ignored, unless they are escaped (\s). This makes it easy to add comments. Comments start with ‘#’ and end at a newline.

Using Callbacks

In PHP preg_replace_callback() can be used to add callback functionality to regular expression replacements.

Sometimes you need to do multiple replacements. If you call preg_replace() or str_replace() for each pattern, the string will be parsed over and over again.

Let’s look at this example, where we have an e-mail template.

$template = "Hello [first_name] [last_name],

Thank you for purchasing [product_name] from [store_name].

The total cost of your purchase was [product_price] plus [ship_price] for shipping.

You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days.


// assume $data array has all the replacement data
// such as $data['first_name'] $data['product_price'] etc...

$template = str_replace("[first_name]",$data['first_name'],$template);
$template = str_replace("[last_name]",$data['last_name'],$template);
$template = str_replace("[store_name]",$data['store_name'],$template);
$template = str_replace("[product_name]",$data['product_name'],$template);
$template = str_replace("[product_price]",$data['product_price'],$template);
$template = str_replace("[ship_price]",$data['ship_price'],$template);
$template = str_replace("[ship_days_min]",$data['ship_days_min'],$template);
$template = str_replace("[ship_days_max]",$data['ship_days_max'],$template);
$template = str_replace("[store_manager_name]",$data['store_manager_name'],$template);

// this could be done in a loop too,
// but I wanted to emphasize how many replacements were made

Notice that each replacement has something in common. They are always strings enclosed within square brackets. We can catch them all with a single regular expression, and handle the replacements in a callback function.

So here is the better way of doing this with callbacks:

// ...

// this will call my_callback() every time it sees brackets
$template = preg_replace_callback('/\[(.*)\]/','my_callback',$template);

function my_callback($matches) {
	// $matches[1] now contains the string between the brackets

	if (isset($data[$matches[1]])) {
		// return the replacement string
		return $data[$matches[1]];
	} else {
		return $matches[0];

Now the string in $template is only parsed by the regular expression once.

Greedy vs. Ungreedy

Before I start explaining this concept, I would like to show an example first. Let’s say we are looking to find anchor tags in an html text:

$html = 'Hello World!';

if (preg_match_all('/.*<\/a>/',$html,$matches)) {



The result will be as expected:

/* output:
    [0] => Array
            [0] => World!


Let’s change the input and add a second anchor tag:

$html = 'Hello

if (preg_match_all('/.*<\/a>/',$html,$matches)) {



/* output:
    [0] => Array
            [0] => Hello
            [1] => World!



Again, it seems to be fine so far. But don’t let this trick you. The only reason it works is because the anchor tags are on separate lines, and by default PCRE matches patterns only one line at a time (more info on: ‘m’ modifier). If we encounter two anchor tags on the same line, it will no longer work as expected:

$html = 'Hello World!';

if (preg_match_all('/.*<\/a>/',$html,$matches)) {



/* output:
    [0] => Array
            [0] => Hello World!



This time the pattern matches the first opening tag, and last opening tag, and everything in between as a single match, instead of making two separate matches. This is due to the default behavior being “greedy”.

“When greedy, the quantifiers (such as * or +) match as many character as possible.”

If you add a question mark after the quantifier (.*?) it becomes “ungreedy”:

$html = 'Hello World!';

// note the ?'s after the *'s
if (preg_match_all('/.*?<\/a>/',$html,$matches)) {



/* output:
    [0] => Array
            [0] => Hello
            [1] => World!



Now the result is correct. Another way to trigger the ungreedy behavior is to use the U pattern modifier.

Lookahead and Lookbehind Assertions

A lookahead assertion searches for a pattern match that follows the current match. This might be explained easier through an example.

The following pattern first matches for ‘foo’, and then it checks to see if it is followed by ‘bar’:

$pattern = '/foo(?=bar)/';

preg_match($pattern,'Hello foo'); // false
preg_match($pattern,'Hello foobar'); // true

It may not seem very useful, as we could have simply checked for ‘foobar’ instead. However, it is also possible to use lookaheads for making negative assertions. The following example matches ‘foo’, only if it is NOT followed by ‘bar’.

$pattern = '/foo(?!bar)/';

preg_match($pattern,'Hello foo'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello foobaz'); // true

Lookbehind assertions work similarly, but they look for patterns before the current match. You may use (?< for positive assertions, and (?<! for negative assertions.

The following pattern matches if there is a ‘bar’ and it is not following ‘foo’.

$pattern = '/(?<!foo)bar/';

preg_match($pattern,'Hello bar'); // true
preg_match($pattern,'Hello foobar'); // false
preg_match($pattern,'Hello bazbar'); // true

Conditional (If-Then-Else) Patterns

Regular expressions provide the functionality for checking certain conditions. The format is as follows:




The condition can be a number. In which case it refers to a previously captured subpattern.

For example we can use this to check for opening and closing angle brackets:

$pattern = '/^(<)?[a-z]+(?(1)>)$/';

preg_match($pattern, ''); // true
preg_match($pattern, ''); // false
preg_match($pattern, 'hello'); // true

In the example above, ‘1′ refers to the subpattern (<), which is also optional since it is followed by a question mark. Only if that condition is true, it matches for a closing bracket.

The condition can also be an assertion:

// if it begins with 'q', it must begin with 'qu'
// else it must begin with 'f'
$pattern = '/^(?(?=q)qu|f)/';

preg_match($pattern, 'quake'); // true
preg_match($pattern, 'qwerty'); // false
preg_match($pattern, 'foo'); // true
preg_match($pattern, 'bar'); // false

Filtering Patterns

There are various reasons for input filtering when developing web applications. We filter data before inserting it into a database, or outputting it to the browser. Similarly, it is necessary to filter any arbitrary string before including it in a regular expression. PHP provides a function named preg_quote to do the job.

In the following example we use a string that contains a special character (*).

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/'.$word.'/', $text); // causes a warning
preg_match('/'.preg_quote($word).'/', $text); // true

Same thing can be accomplished also by enclosing the string between \Q and \E. Any special character after \Q is ignored until \E.

$word = '*world*';

$text = 'Hello *world*!';

preg_match('/\Q'.$word.'\E/', $text); // true

However, this second method is not 100% safe, as the string itself can contain \E.

Non-capturing Subpatterns

Subpatterns, enclosed by parentheses, get captured into an array so that we can use them later if needed. But there is a way to NOT capture them also.

Let’s start with a very simple example:

preg_match('/(f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

Now let’s make a small change by adding another subpattern (H.*) to the front:

preg_match('/(H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => Hello'
echo "b* => " . $matches[2]; // prints 'b* => foo'

The $matches array was changed, which could cause the script to stop working properly, depending on what we do with those variables in the code. Now we have to find every occurence of the $matches array in the code, and adjust the index number accordingly.

If we are not really interested in the contents of the new subpattern we just added, we can make it ‘non-capturing’ like this:

preg_match('/(?:H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches[1]; // prints 'f* => foo'
echo "b* => " . $matches[2]; // prints 'b* => bar'

By adding ‘?:’ at the beginning of the subpattern, we no longer capture it in the $matches array, so the other array values do not get shifted.

Named Subpatterns

There is another method for preventing pitfalls like in the previous example. We can actually give names to each subpattern, so that we can reference them later on using those names instead of array index numbers. This is the format: (?Ppattern)

We could rewrite the first example in the previous section, like this:

preg_match('/(?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

Now we can add another subpattern, without disturbing the existing matches in the $matches array:

preg_match('/(?PH.*) (?Pf.*)(?Pb.*)/', 'Hello foobar', $matches);

echo "f* => " . $matches['fstar']; // prints 'f* => foo'
echo "b* => " . $matches['bstar']; // prints 'b* => bar'

echo "h* => " . $matches['hi']; // prints 'h* => Hello'

Don’t Reinvent the Wheel

Perhaps it’s most important to know when NOT to use regular expressions. There are many situations where you can find existing utilities than you can use instead.

Parsing [X]HTML

A poster at Stackoverflow has a brilliant explanation on why we should not use regular expressions to parse [X]HTML.

…dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of corrupt entities…

Joking aside, it is a good idea to take some time and figure out what kind of XML or HTML parsers are available, and how they work. For example, PHP offers multiple extensions related to XML (and HTML).

Example: Getting the second link url in an HTML page

$doc = DOMDocument::loadHTML('
		<a href="">First link</a>
		<a href="">Second link</a>

echo $doc->getElementsByTagName('a')

// prints:

Validating Form Input

Again, you can use existing functions to validate user inputs, such as form submissions.

if (!filter_var($_POST['email'], FILTER_VALIDATE_EMAIL)) {

	$errors []= "Please enter a valid e-mail.";
// get supported filters

/* output
    [0] => int
    [1] => boolean
    [2] => float
    [3] => validate_regexp
    [4] => validate_url
    [5] => validate_email
    [6] => validate_ip
    [7] => string
    [8] => stripped
    [9] => encoded
    [10] => special_chars
    [11] => unsafe_raw
    [12] => email
    [13] => url
    [14] => number_int
    [15] => number_float
    [16] => magic_quotes
    [17] => callback

More info: PHP Data Filtering


Here are some other utilities to keep in mind, before using regular expressions:

Thanks so much for reading!

November 26 2009


You Don’t Know Anything About Regular Expressions: A Complete Guide

Regular expressions can be scary…really scary. Fortunately, once you memorize what each symbol represents, the fear quickly subsides. If you fit the title of this article, there’s much to learn! Let’s get started.

Section 1: Learning the Basics

The key to learning how to effectively use regular expressions is to just take a day and memorize all of the symbols. This is the best advice I can possibly offer. Sit down, create some flash cards, and just memorize them! Here are the most common:

  • . – Matches any character, except for line breaks if dotall is false.
  • * – Matches 0 or more of the preceding character.
  • + – Matches 1 or more of the preceding character.
  • ? – Preceding character is optional. Matches 0 or 1 occurrence.
  • \d – Matches any single digit
  • \w – Matches any word character (alphanumeric & underscore).
  • [XYZ] – Matches any single character from the character class.
  • [XYZ]+ – Matches one or more of any of the characters in the set.
  • $ – Matches the end of the string.
  • ^ – Matches the beginning of a string.
  • [^a-z] – When inside of a character class, the the ^ means NOT; in this case, match anything that is NOT a lowercase letter.

Yep – it’s not fun, but just memorize them. You’ll be thankful if you do!


You can be certain that you’ll want to rip your hair out at one point or another when an expression doesn’t work, no matter how much it should – or you think it should! Downloading the RegExr Desktop app is essential, and is really quite fun to fool around with. In addition to real-time checking, it also offers a sidebar which details the definition and usage of every symbol. Download it!.

Section 2: Regular Expressions for Dummies: Screencast Series

The next step is to learn how to actually use these symbols! If video is your preference, you’re in luck! Watch the five lesson video series, “Regular Expressions for Dummies.”

Section 3: Regular Expressions and JavaScript

In this final section, we’ll review a handful of the most important JavaScript methods for working with regular expressions.

1. Test()

This one accepts a single string parameter and returns a boolean indicating whether or not a match has been found. If you don’t necessarily need to perform an operation with the a specific matched result – for instance, when validating a username – “test” will do the job just fine.


var username = 'JohnSmith';
alert(/[A-Za-z_-]+/.test(username)); // returns true

Above, we begin by declaring a regular expression which only allows upper and lower case letters, an underscore, and a dash. We wrap these accepted characters within brackets, which designates a character class. The “+” symbol, which proceeds it, signifies that we’re looking for one or more of any of the preceding characters. We then test that pattern against our variable, “JohnSmith.” Because there was a match, the browser will display an alert box with the value, “true.”

2. Split()

You’re most likely already familiar with the split method. It accepts a single regular expression which represents where the “split” should occur. Please note that we can also use a string if we’d prefer.

var str = 'this is my string';
alert(str.split(/\s/)); // alerts "this, is, my, string"

By passing “\s” – representing a single space – we’ve now split our string into an array. If you need to access one particular value, just append the desired index.

var str = 'this is my this string';
alert(str.split(/\s/)[3]); // alerts "string"

3. Replace()

As you might expect, the “replace” method allows you to replace a certain block of text, represented by a string or regular expression, with a different string.


If we wanted to change the string “Hello, World” to “Hello, Universe,” we could do the following:

var someString = 'Hello, World';
someString = someString.replace(/World/, 'Universe');
alert(someString); // alerts "Hello, Universe"

It should be noted that, for this simple example, we could have simply used .replace(’World’, ‘Universe’). Also, using the replace method does not automatically overwrite the value the variable, we must reassign the returned value back to the variable, someString.

Example 2

For another example, let’s imagine that we wish to perform some elementary security precautions when a user signs up for our fictional site. Perhaps we want to take their username and remove any symbols, quotation marks, semi-colons, etc. Performing such a task is trivial with JavaScript and regular expressions.

var username = 'J;ohnSmith;@%';
username = username.replace(/[^A-Za-z\d_-]+/, '');
alert(username); // JohnSmith;@%

Given the produced alert value, one might assume that there was an error in our code (which we’ll review shortly). However, this is not the case. If you’ll notice, the semi-colon immediately after the “J” was removed as expected. To tell the engine to continue searching the string for more matches, we add a “g” directly after our closing forward-slash; this modifier, or flag, stands for “global.” Our revised code should now look like so:

var username = 'J;ohnSmith;@%';
username = username.replace(/[^A-Za-z\d_-]+/g, '');
alert(username); // alerts JohnSmith

Now, the regular expression searches the ENTIRE string and replaces all necessary characters. To review the actual expression – .replace(/[^A-Za-z\d_-]+/g, ”); – it’s important to notice the carot symbol inside of the brackets. When placed within a character class, this means “find anything that IS NOT…” Now, if we re-read, it says, find anything that is NOT a letter, number (represented by \d), an underscore, or a dash; if you find a match, replace it with nothing, or, in effect, delete the character entirely.

4. Match()

Unlike the “test” method, “match()” will return an array containing each match found.


var name = 'JeffreyWay';
alert(name.match(/e/)); // alerts "e"

The code above will alert a single “e.” However, notice that there are actually two e’s in the string “JeffreyWay.” We, once again, must use the “g” modifier to declare a “global search.

var name = 'JeffreyWay';
alert(name.match(/e/g)); // alerts "e,e"

If we then want to alert one of those specific values with the array, we can reference the desired index after the parentheses.

var name = 'JeffreyWay';
alert(name.match(/e/g)[1]); // alerts "e"

Example 2

Let’s review another example to ensure that we understand it correctly.

var string = 'This is just a string with some 12345 and some !@#$ mixed in.';
alert(string.match(/[a-z]+/gi)); // alerts "This,is,just,a,string,with,some,and,some,mixed,in"

Within the regular expression, we created a pattern which matches one or more upper or lowercase letters – thanks to the “i” modifier. We also are appending the “g” to declare a global search. The code above will alert “This,is,just,a,string,with,some,and,some,mixed,in.” If we then wanted to trap one of these values within the array inside of a variable, we just reference the correct index.

var string = 'This is just a string with some 12345 and some !@#$ mixed in.';
var matches = string.match(/[a-z]+/gi);
alert(matches[2]); // alerts "just"

Splitting an Email Address

Just for practice, let’s try to split an email address – – into its respective username and domain name: “nettuts,” and “tutsplus.”

var email = '';
alert(email.replace(/([a-z\d_-]+)@([a-z\d_-]+)\.[a-z]{2,4}/ig, '$1, $2')); // alerts "nettuts, tutsplus"

If you’re brand new to regular expressions, the code above might look a bit daunting. Don’t worry, it did for all of us when we first started. Once you break it down into subsets though, it’s really quite simple. Let’s take it piece by piece.


Starting from the middle, we search for any letter, number, underscore, or dash, and match one ore more of them (+). We’d like to access the value of whatever is matched here, so we wrap it within parentheses. That way, we can reference this matched set later!


Immediately following the preceding match, find the @ symbol, and then another set of one or more letters, numbers, underscore, and dashes. Once again, we wrap that set within parentheses in order to access it later.


Continuing on, we find a single period (we must escape it with “\” due to the fact that, in regular expressions, it matches any character (sometimes excluding a line break). The last part is to find the “.com.” We know that the majority, if not all, domains will have a suffix range of two – four characters (com, edu, net, name, etc.). If we’re aware of that specific range, we can forego using a more generic symbol like * or +, and instead wrap the two numbers within curly braces, representing the minimum and maximum, respectively.

 '$1, $2')

This last part represents the second parameter of the replace method, or what we’d like to replace the matched sets with. Here, we’re using $1 and $2 to refer to what was stored within the first and second sets of parentheses, respectively. In this particular instances, $1 refers to “nettuts,” and $2 refers to “tutsplus.”

Creating our Own Location Object

For our final project, we’ll replicate the location object. For those unfamiliar, the location object provides you with information about the current page: the href, host, port, protocol, etc. Please note that this is purely for practice’s sake. In a real world site, just use the preexisting location object!

We first begin by creating our location function, which accepts a single parameter representing the url that we wish to “decode;” we’ll call it “loc.”

function loc(url) { }

Now, we can call it like so, and pass in a gibberish url :

var l = loc('');

Next, we need to return an object which contains a handful of methods.

function loc(url) {
	return {



Though we won’t create all of them, we’ll mimic a handful or so. The first one will be “search.” Using regular expressions, we’ll need to search the url and return everything within the querystring.

return {
	search : function() {
		return url.match(/\?(.+)/i)[1];
               // returns "somekey=somevalue&anotherkey=anothervalue#theHashGoesHere"

Above, we take the passed in url, and try to match our regular expressions against it. This expression searches through the string for the question mark, representing the beginning of our querystring. At this point, we need to trap the remaining characters, which is why the (.+) is wrapped within parentheses. Finally, we need to return only that block of characters, so we use [1] to target it.


Now we’ll create another method which returns the hash of the url, or anything after the pound sign.

hash : function() {
	return url.match(/#(.+)/i)[1]; // returns "theHashGoesHere"

This time, we search for the pound sign, and, once again, trap the following characters within parentheses so that we can refer to only that specific subset – with [1].


The protocol method should return, as you would guess, the protocol used by the page – which is generally “http” or “https.”

protocol : function() {
	return url.match(/(ht|f)tps?:/i)[0]; // returns 'http:'

This one is slightly more tricky, only because there are a few choices to compensate for: http, https, and ftp. Though we could do something like – (http|https|ftp) – it would be cleaner to do: (ht|f)tps?
This designates that we should first find either an “ht” or the “f” character. Next, we match the “tp” characters. The final “s” should be optional, so we append a question mark, which signifies that there may be zero or one instance of the preceding character. Much nicer.


For the sake of brevity, this will be our last one. It will simply return the url of the page.

href : function() {
	return url.match(/(.+\.[a-z]{2,4})/ig); // returns ""

Here we’re matching all characters up to the point where we find a period followed by two-four characters (representing com, au, edu, name, etc.). It’s important to realize that we can make these expressions as complicated or as simple as we’d like. It all depends on how strict we must be.

Our Final Simple Function:

function loc(url) {
	return {
		search : function() {
			return url.match(/\?(.+)/i)[1];

		hash : function() {
			return url.match(/#(.+)/i)[1];

		protocol : function() {
			return url.match(/(ht|f)tps?:/)[0];

		href : function() {
			return url.match(/(.+\.[a-z]{2,4})/ig);

With that function created, we can easily alert each subsection by doing:

var l = loc('');

alert(l.href()); //
alert(l.protocol()); // http:



Thanks for reading! I’m Jeffrey Way…signing off.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...