Making Sense of Text in a Hurry: A Regular Expressions Primer

Photo by Dinnow (Pexels.com)

Introduction

Whether you are brand new to regex and have text data you’d like to make sense of, or you have experience laboring over stack overflow questions hoping to find the exact same use case without quite understanding the jumble of regex you’re putting into use; this introduction will prove a useful foundation as you expand your experience using them.

Today we’ll cover three key functions and three key patterns.

A Brief Explanation of Regular Expressions

Regular expressions is a useful tool for matching patterns within text.

Pattern identification can be particularly useful for categorizing given strings of text. For example, let’s say you have your customer’s website data, but you’d like to check for occurrences of a given phrase; regex to the rescue!

Key Functions

We’ll be using the re package and will be running through the following three functions:

  • re.search()
  • re.split()
  • re.findall()

Simple enough!

Search

Definition

Search allows us to search a string for a given pattern. If the pattern is present in the string, the function will return a “match object” and a null value otherwise.

Example

Let’s say that you want to search a string called company description for the inclusion of “tech”. For the sake of the example, let’s imagine you’ll be using the presence of the word “tech” in the company description to classify each record accordingly.

Call the function, pass the pattern and the string.

company_description = "selling tech products to retailers"
re.search('tech', company_description)

As simple as that, you have a straight forward method of detecting the presence of a pattern. The match object itself will include the character index of the first occurrence of the pattern.

Split

Definition

Split allows us to “split” a string into separate elements of a list. In split we use regex to designate the pattern by which the function will split the string. More on this below.

Example

Let’s say that you want to market to your prospects according to the technology products they currently have and you even have that data, but unfortunately that data for each customer is one long continuous string that separates each technology by a comma.

A simple solution would be to split the string with each case of a comma (and space).

technologies = 'salesforce, gainsight, marketo, intercom'
re.split(', ', technologies)

Now you have broken each technology into its own entry within a list.

Findall

Definition

Findall is very similar to both search & match; the critical difference being the “all” in “findall”. Rather than just returning the location of the first occurrence, findall returns each occurrence of the pattern. For illustrative purposes we’ve kept things simple with directly quoted patterns, but shortly we’ll review different patterns that you can put to work as well.

Example

Let’s say you are selling a return processing product to e-commerce companies and you’ve scraped the websites of some of your prospects with the hopes of seeing if they offer free reviews; the hypothesis in this case would be that greater volumes of mentions of ‘free returns’ suggests a higher propensity to sign up for this product.

website_text = 'free returns... yada yada yada... free returns... and guess what... free returns'

returns = re.findall('free returns', website_text)

Key Patterns

Now that you have a few key functions in your toolbox, let’s extend their usefulness by talking about patterns.

In each of the above examples, we explicitly defined our patterns; what we’ll do now is review how you can get there more quickly in the case of more complex criteria.

We will review the following:

  • Digits
  • Words
  • Spaces

Digits

Similar to a previous example, we’ll use findall; but in this case we’ll do so to find every occurrence of a number. Let’s say that our monthly sales for q1 have been recorded in a string and we’d like to extract those numbers. Pay close attention to the pattern that we pass.

string = 'Jan: 52000, Feb: 7000, Mar: 9100'
print(re.findall(r"\d+", string))

Let’s break this command into it’s different parts:

  • r indicates to python that we’ll be using regex, this helps python not get confused about what you’re trying to do.
  • We use the backslash(\) to tell python to treat the next character literally. There are cases in which a “special” character tells python to do something; in this case, python will know not to do anything snazzy.
  • d is what we use to denote that we want digits.
  • Running the same thing without the + would treat each individual digit as its own item in the list. + indicates that we would like the complete word following our specified criteria as an individual item.

Words

Let’s do this again, just swapping out digits for words. We’ll say we want to extract the month values.

print(re.findall(r"[A-z]\w+", string))

We see a lot of the same things included here: r, the backslash, the +; but we now see that instead of d, we’ve included w. w is an indication of any character, which technically can also be extended to other characters as well, so it’s prudent that we specify.

Preceding the \w+ statement we can modify the pattern for specific character types we’d like to allow. In this case, we are specifying the inclusion of all capital and low case letters via [A-z].

Spaces

Let’s revisit the example we made for re.split earlier.

Let’s say that rather than wanting to do the split on comma, we wanted to do so according to spaces.

print(re.split(r"\s", technologies))

As you can see, this really isn’t incrementally useful given how the commas are now being included in the individual items. This would be a far more useful approach if there were no commas.

Conclusion

There you have it! In almost no time at all we’ve covered quite a lot.

You learned:

  • 3 key regex functions:
    • re.search()
    • re.split()
    • re.findall()
  • 3 handy patterns:
    • Digits
    • Words
    • Spaces
  • and a handful of rules that should help you make sense of the world of regex.

I hope this can prove to be useful scaffolding that you can use to build out your regex knowledge and experience.

I hope you’ve enjoyed this post and that it helps you in your work! Please share whatever works and whatever doesn’t!

Feel free to check out some of my other posts at datasciencelessons.com

Happy Data Science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: