Sometimes a Regex tool can help you out from perplexing learning materials and make Regex writing super easy. This is a fast guide for beginners to use regular expressions to extract phone numbers from strings.
What Is RegEx
RegEx stands for Regular Expression, which is an object that describes the pattern of a string. With this expression understandable to the computer, we are able to locate the data that matches this pattern and retrieve the information we want.
“A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern.”
——Quoted from Wikipedia.com
How does a Regular Expression help us pull out phone numbers throughout the long text?
For example, you are looking for a way to extract at once all phone numbers from the text. This whole text has numerous sets of phone numbers scattered here and there randomly. You must be familiar with the “CONTROL + F” formula, which is built into most applications to help users find and highlight a certain string of data.
If you are able to write a Regular Expression code that elaborates the same pattern of these phone numbers, you can enter this code into a text editor with built-in regex capability through the “find” function and the data you are looking for will be well located.
How to Write A Regular Expression
If you want to extract phone numbers by using Regular Expressions but don’t know how to write one, this article may help you with this.
#Learn Some Basics of RegEx
Learning RegEx from scratch might take some time, while if you will be using this frequently in your daily work and hence significantly improve your productivity, it may be worth a try.
A good place to start is the JS RegEx tutorials in W3School. You will be learning the basic syntax of a RegEx code and the grammar of modifiers and quantifiers.
As this is rather complicated to get for total newbies, we will not dive into this in this article. If you want to take an easy way to instantly take advantage of RegEx, a RegEx will fit your immediate need.
#Use RegEx Tool Built-in Octoparse
There are some ready-to-use tools that help people write RegEx in a rather easier way. Octoparse has a built-in tool to do the job.
With this intuitive tool at hand, the only thing you need to care about is finding the pattern of the phone numbers you are looking for throughout the text.
For those who don’t know Octoparse yet, Octoparse is an easy-to-use web scraping tool. It helps people who have a need for massive content sources to collect content from any website within a short time and most importantly, without knowing any coding languages. Compared to others, the highlight of it is its boost mode, intuitive UI design, and so on. Worth mentioning, its unique auto-detection function can save you tons of work by perplexingly clicking around with messed-up data results.
Besides the auto-detection function, the pre-built templates are even more convenient. Using templates, you can obtain the product list information as well as detailed page information on the website you want. Octoparse provides quite a number of pre-built templates that help collect data from various web pages such as Amazon, eBay, Yellowpages, Twitter, etc. It can help you gather information from social media to help you monitor trending topics and people’s interests. On the other hand, you can also create a more customized crawler by yourself under the advanced mode. For that, you can make your own scraping workflows to get the required data in a few steps.
Powerful functions such as cloud service, scheduled automatic scraping, and IP rotation (to prevent IP ban) are offered in a paid plan. If you want to monitor stock numbers, prices, and other information about an array of shops/products on a regular basis, Octoparse is definitely a helpful tool. If you have not started your web scraping trip yet, we recommend you try Octoparse.
Examples of Phone Extraction Using Regex
It could be multiple phone numbers in a single large string and these phone numbers could come in a variety of formats. Here is an example of the file format:
- (021)1234567
- (123) 456 7899
- (123).456.7899
- (123)-456-7899
- 123-456-7899
- 123 456 7899
- 1234567899
- 0511-4405222
- 021-87888822
- +8613012345678
- …
What is the easiest way to extract phone numbers like these? Now we are going to use the Regular Expression tool to generate Regular Expressions and match all the phone numbers quickly.
First, find the common character that each phone number starts with and ends with. For example, for the targeted text above, I find its source code shown below.
<p>Here is an example of file format </p>
<ul>
<li>(021)1234567 </li>
<li>(123) 456 7899 </li>
<li>(123).456.7899 </li>
<li>(123)-456-7899 </li>
<li>123-456-7899 </li>
<li>123 456 7899 </li>
<li>1234567899 </li>
<li>0511-4405222 </li>
<li>021-87888822 </li>
<li>+8613012345678 </li>
<li>… </li>
</ul>
Each phone number starts with <li> and ends with </li>. And we can use the RegEx Tool in Octoparse to quickly extract all phone numbers.
Step 1. Run Octoparse and open the RegEx Tool.
Step 2. Copy and paste the source code in the “Source Text” box. Then select the “Start With” option and enter “<li>”.
Step 3. Next, select the “End With” option and enter “</li>”. Don’t forget to select the “Match All” option.
Step 4. Click “Match”. When it’s done, all the matched phone numbers are listed in the box on the left-hand side.
However, if you can’t find out the common characteristics that each phone number starts with and ends with, the tool won’t be sufficient to generate a Regex code. You may need to equip yourself with more knowledge of Regex syntax and write a special Regular Expression for each pattern.
I wrote down two additional Regular Expressions for two formats of phone numbers.
- Regular Expression 1:
Code: \d{3}-\d{8}|\d{4}-\d{7}
Match: 0511-4405222 | 021-87888822
- Regular Expression 2:
Code: \(\d{2,4}\)\d{6,7}
Match: (021)1234567 | (0411)123456 | (000)000000 |(123)1234567
Finding a pattern of phone numbers among the text and coming up with a Regex code that describes the pattern is the key to this task.