Category

Regular Expressions

Category

REGEX

Lookaheads are one of the most powerful feature in regular expressions.  Lookaheads helps you to broaden your matches. You really may need to depend on these lookaheads in many scenarios. For instance, if you want to match every  “N” in the paragraph which is not followed by  a “O” . You cannot use the not (^) , say  /N[^O]/  as it will match the second character also.  So here comes the use of lookaheads and lookbehinds.

1.Lookaheads

    • Postive Lookahead– This will match groups which are followed by the group specified via the lookaheads. Have a look at this example, we need to match every N that is followed by an O .

This would the positive lookahead regex

Regex :

 N(?=O)

The given example will match every  N that is followed by and O . Here the (?=O) is the positive lookahead group.

Live Demo to Positive Look aheads

  • Negative Lookahead – This will match groups which are not followed by the group specified via the lookaheads

Regex : 

N(?!O)

This matches exactly the opposite what the positve lookahead matched. This matches every N that is not followed by an O , the negative look ahead group is  (?!O), and the !O represent that its a negative look ahead.

Live Demo to Negative Look aheads

2. Lookbehinds

The concept of lookbehinds are simple, its just the opposite of the look aheads. It matches the very next group obeying the condition of the Look behind group

  • Postive Lookbehind – This will match the groups which which are followed by the group specified via the lookbehind.

example : Consider the case were we need to match every O that comes right after an N.

Regex : 

(?<=N)O

The (?<=N) is the positive look behind group and it indicates that every O should me matched if its followed by and N . Live demo link to postive look behind 

  • Negative Lookbehind - This will match the groups which which are not followed by the group specified via the lookbehind.

Regex : 

(?

The negative look group indicates that every O should me matched if its not followed by and N .

Live demo link to negative look behind

This theory is very much helpful in matching the contents inside brackets () or html tags (<>) .

Practical Examples Of Regular Expressions- Match contents inside  a bracket.

 (?<=\()(.*)(?=\))

Here the Positive look behind group  (?<=\() and positive look ahead  group (?=\)) makes the total expression to match the whole content inside the bracket. "\"  is use escape the Brackets.

Live Demo to regex to match contents inside brackets

Also read, Useful Regular Expressions in real life

 What is Regular Expression?

Regular expression(regex) is a specific pattern used to match some texts, its a unique pattern. And by using this pattern we can match texts,strings, numbers,alphabets,symbols or whatever.

 How is it useful?

Regular expression makes your life easy. It extends the normal search procedure to large range of possibilities. Its very much useful in programming, content scraping,linguistics etc. For sure the Google should be using lots of regular expressions to track the content of a website.

   Interesting real life scenarios – Learn Regular Expressions

5 Ineveitable Regexes and its explanations

I’ve given 5 regular expressions and its explanations which we may need to use most of the time in our real life. All these regexes are splitted into sections and have given explanations to the sections.

Regex 1 – To find a particular word and its number of occurrences in a content/page.

 You can block illegal sites, user comments, or spam users by using regular expression. For example check the below link , which is  a live link in gskinner(a tool used to verify a regex). 

Demo:  Regex example selecting the word 

In this example, we are selecting the word “porn” from the whole text content. And after all what we can do is count the number of occurrence of the word and do a filtering based on that.

In this example the regex we’ve used is “porn” (ignore the quotes), which finds out the word porn and its other  occurrence .

Check out how this is useful for a webmaster.

  • Can filter out spam or adult comments, contact form submissions etc.
  • Crawling the sites content and can check how much importance they give to the specific word.
  • Blocking based on ip address ,regarding the searched keyword or other.
  • Suppose that  you need to match a sentence that says about a particular word. Say,          “Apple iphone“, where the mentioned word lies in between the sentence. Consider a  facebook status “I am loving Apple Iphone, its the best phone in the world” . You can match this whole sentence using regex. This would be really helpful in user/product reviews.

Regex 2  – Find all the out going links in a page and image urls

You can easily find out the outgoing links from a page ,so that you can do a filter based on those links. You can even build a Page Link Checker tool too.

Demo :  Find all the hyperlinks using regex

Regex   (?i)]+)>(.+?)

I’ll split this into 4 sections

 (?i)([^>]+)>(.+?)

Consider each color as each section. I’ve explained each sections below

What it does basically is select all the “a” tags case insensitive,                                                          Google will be an example.

  • (?i) – Means its case insensitive -,it will match both upper and lowe cases.
  • ([^>]+)> – You can see that this part ends with “>” which is end part of the a tag                          . So this matches everything until the  closing  “>”tag  . Everything given inside brackets () referes to groups so here we got one group  .  [^>]+ – Square bracket refers to a single character,  and the + followed by that means it will match (or repeat) that single character as much as possible and include in the match . *(star)   also got a similar meaning to +.  . So whats given inside the square  bracket?. ^>  – ^ means a negative match or it means match everything thats not the character followed. Here in our case it will match all the characters which is not a “>” . So in total < a([^>]+)>, means it will match everything starting with a tag and everything in between which is not a > , or in short                       – Match  – Google
  • (.+?) – That section matches everything thats after the closing > tag                                     – Match – Google

Likewise you can match all the img tags and find out the images inside the page.

Regex 3 – Find all the sentences not beginning with a capital letter.

Punctuation is a really important factor to maintain quality in what you write. Its really hard for you to  scan the whole text (say it got 3000 words)  and check for punctuation. By knowing regular expressions, you can solve this tedious task to an extent . For instance, If you need to match all the beginning word of a sentence that doesn’t begin with a capital letter, you can use regexes.

Regex:   \.(\s)*?[a-z]

Demo : Regex to check punctuation- To find the sentence not beginning with capital letter

This regex will find all the starting letter of a sentence which doesn’t begin with a capital letter.

\.(\s)*?[a-z]

  • \.  – Matches a . (dot), the backslash (\) is used to escape a dot, as dot got other meanings in a regex (Its a special character, very much like +,*,/ etc )   . “\.” means its the matching a dot  character.
  • (\s)*? –  This matches as much as spaces after a dot, most of the people puts one or two spaces after the end of a sentence  \s matches a space character the * (star) followed by that means, it will match as much as space characters possible. The question mark (?) followed by that means, the space character matching is optional (Many people don’t use space after the end of a sentence( to begin another sentence) ).
  • [a-z]– Everything given inside a square bracket [ ] represents a single character, and here it means it will match a lower case alphabet. If it was given [A-Z] , it would have matched an upper case alphabet.

Regex 4 – Find a word which you forgot.

Just imagine that you forgot a word, and you want find that particular word from the whole bunch of a bulky text. For instance, you are searching for the movie name
titanic“, and you only know that it begins with t and ends with c, you can find this word using a regex.

 Regex \bt[\w]*c\b 

DemoRegex to match an unknown word

Grouping  the  regex :  \bt[\w]*c\b

Meaning

  • The regex is wrapped in (\b \b) , which means that we are looking for word , placing \b-boundary will  help you to achieve this,it will eliminate all the other matches, for example, this  by putting the boundaries  ,our regex wont match the titanic in the word “titanicosis”  it matches a word, which starts with t  and ends with c.
  • t[\w]*c – means it will match every alphabets after t and before c. [\w] represent a single word character and [\w]*matches multiple occurrence of the the character until c is reached . That means apart from “titanic”, this will also match everything that starts with t and ends in c for example,”tic”,”tac”,  etc. You can make this regex more accurate by modifying the it a little bit.
    Improving this Regular Expression
  • You know that  the word titanic is 7 letters ,of which you know t and c , so whats in between should be 5 letters, we can add that logic too to the regex
  • t[\w]{5}c – {5} means that the single alphabet (\w) should be matched for 5 times until “c” is reached , so it will eliminate words like “tic”, “tac” etc. Check the below link

Demo: Regex to match the word with limited character numbers

Regex 5- Money formatting

Regex is very helpful in money formatting also. Consider that you got a bulk of text with money values entered as $3200, $3,200,$3200.00 $5000, $5,200 etc. What if you want
to remove all the comma separated values to normal values(without commas and the ending zeroes after decimal) for a standardisation?. I bet it would be a herculean task if you use normal search and replace. We can achieve this easily by using a regular expression

Regex\$[0-9],[0-9]{3}(\.[0-9]{2})?

Demo: Regular expression to match money values with formatting

Grouping it :  \$[0-9],[0-9]{3}(\.[0-9]{2})?

Meaning

  • \$[0-9]    –  This matches money values that start with a “$”(dollar) and a single number –          .$ got other meanings in regex, its used to find the end of a string. /a$/  find strings that ends with a. So we need to escape $ sign  with a backslash(\) indicating that we are looking for the $(dollar sign) .Its for this purpose we’ve used \$. – Match – $3,200
  • ,[0-9]{3} – This matches a “,” (comma) and 3 numeric digits  – Match  $3,200.00
  •  (\.[0-9]{2})? – This part of the regex handles the decimal part. The whole expression is placed inside “()”-  brackets  and followed by a question mark (?). Everything placed inside brackets “()” are groups in regular expressions. ? means that the preceding group is optional, or in our case it means , “Its not necessary to match the decimal part, but if there is one, match it”  .Which means it matches all the money values with and without decimal part. Now coming inside to the group \.[0-9]{2} –  The dot(.) has been escaped with a backslash (dot is very much like dollar which is a special character so we need to escape it ). Dot is followed by a two numeric digits which forms the decimal part –  Match – $3,200.00

How to use Regular expression to search and replace in Notepad++? 

notepad search

Notepad++ allows regular expression search, which helps us to search for a word with regex and replace it with some other. To use regular expressions in notepad.

  • Take the find option (CTRL+F).
  • Select the regular expression option.
  • Paste the regex and search.

Using Regular expressions with Javascript

We can use the match() function in javascript to match regular expressions.

example:

var str = "titanic is a ship";
var matches=str.match(/t[\w]{5}c/g);

Here the variable matches will hold the  match “titanic“.

/g : This means this will be searched globally, or all the matches corresponding to the search expression.  With out using /g, it will return a single match.

/i : In javascript, we can use /i along with the regex to make the match case insensitive.

General Regular Expression syntax in Javascript  :  /REGEX/

The two forward slashes /  /   are where we place the regex inside. g , at the end means its a global match, means it matches every occurrences of the particular word.

Using Regular expressions in PHP

We can use this regular expressions in php also. Php provides some great functions to match our regexes.

As like in javascripts, we enclose the regex in forward slashes in php.

General Regular Expression syntax in PHP:  /REGEX/

PHP functions which deals with Regexes

1. preg_match – This is a single match, this will return a single match found from the whole text. General syntax is

preg_match(regex,text,output)

Output:

Array
(
    [0] => titanic
)

2. preg_match_all – This holds the same concept of preg_match , but this matches globally, same as /g which we used in the javascript match.Means it will match every occurrences of the regex, we used. It is very useful to count  the number of times a pattern or word is used in a page. The general syntax is.

preg_match_all(regex,text,output)

Output:

Array
(
    [0] => Array
        (
            [0] => titanic
            [1] => titanic
        )

)

3. preg_replace – This acts as the find and replaces, the general syntax is

preg_replace (find,replace,text).

Output:

Queen Elizabeth was a big ship. Queen Elizabeth hit an iceberg

4. preg_split

General syntax : preg_split(regex,text)

$text="titanic was a big ship. titanic hit an iceberg";
$SplittedText=preg_split("/t[\w]{5}c/",$text);
print_r($SplittedText);

output: 

Array
(
    [0] => 
    [1] =>  was a big ship. 
    [2] =>  hit an iceberg
)

So that deals with all the basics of regex, learn all these regexes and try building new ones by  yourself! We love comments,  If you got some suggestions, thanks or whatever put it as comments 🙂