Regular Expressions

Regular Expressions (regex or regexp) make it possible to find specific words or letters in large bits of text.

Regular Expressions (regex or regexp) make it possible to find specific words or letters in large bits of text.

When you use HTML to put together a website, your computer understands what types of content each block of text is (It’s an <h1> headline! It’s a <p> paragraph!), but it doesn’t understand what content is inside that block of text.

Regular expressions are a way to parse those blocks of text and look for specific letters or patterns, such as punctuation, letter combinations, or even whole words.

What’s neat about regular expressions is that when they search your text, they can be very particular and very flexible, at the same time. Unfortunately, regular expressions happen to be a wee bit confusing to read, especially the first time you come across one, which makes them seem super scary to beginning programmers. But like anything programming-wise, regular expressions are constructed using their own alphabet of regular expression characters, and there is nothing wrong with just printing the list out and keeping it by your side as a reference.

There are infinite uses for regular expressions, but one of the most common is to validate user emails.

Now, validating an email seems pretty straightforward, right? It needs to have some letters, then an at (@) symbol, then some more letters, a period (.), and then a Top Level Domain (com).

/\S+@\S+\.\S+/

This regex says “For this to be a valid email, I want to see a string of letters and/or numbers (/\S), then at least one @ symbol (+@), then there should be more letters and/or numbers (\S), followed by a period (+\.), and them some more letters or numbers (\S+/).”

So we put in [email protected], the regex says “Yep, that looks like an email.” and we are good to go!

But what if I wanted to be tricky and put in a fake email address like [email protected]? What would the regex say?

Turns out that because all this regex knows to look for is some letters, followed by an @ symbol, followed by more letters, followed by a period, followed by some more letters, it would still say “Yep, looks like an email.” even though you and I both know that .skillcrush is not a real Top Level Domain (TDL).

In order to make sure that your users aren’t making up new TDLs (or using other punctuation where they shouldn’t be), you need to use a slightly more complicated regex:

/^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi)|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/i

Alright, I know that looks like a beast, and we don’t quite have time to decode it all, but I want to point you towards one particular bit:

\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi)

In regexp speak a | symbol means OR, so the above says “In order for this to be a valid email, you need a period (\.) and then I want to see aero or arpa or biz or com…” No making up a TDL here, this regex means business.

Cocktail Party Fact

Regular expressions are actually a whole special computer language onto themselves, but because they are so useful, they have been baked into most of the major web programming languages such as PHP, Ruby, and JavaScript.