Selecting DataFrame Rows Based on String Contents

When working with text, it is often useful to select rows that contain a specific string. The .str.contains(...) function allows us to test each row's data to determine if a specific string exists in the text.

To explore this function, we'll use a DataFrame of the five tallest mountains in the world:

import pandas as pd

df = pd.DataFrame([
  {"mountain": "Mount Everest", "feet": 29029, "location": "Nepal/China"},
  {"mountain": "K2", "feet": 28255, "location": "Pakistan/China"},
  {"mountain": "Kangchenjunga", "feet": 28169, "location": "Nepal/India"},
  {"mountain": "Lhotse", "feet": 27940, "location": "Nepal"},
  {"mountain": "Makalu", "feet": 27838, "location": "Nepal"},

Select All Rows Containing a String

In the DataFrame, we can see four of the five tallest mountains are in Nepal. If we use a == comparison, our row selection only selects two of the four mountains in Nepal since == asks Python to find rows "EXACTLY equals to" a value.

df[df.location == "Nepal"]
Using == to select all rows with the location EXACTLY equal to Nepal

Instead, .str.contains(...) allows us check if the string contains a specific string anywhere within the string. Looking for the locations that contains Nepal, we find four mountains:

0Mount Everest29029Nepal/China
Using .str.contains(...) to select all rows that contains Nepal

Select All Rows Containing Two Strings

The .str.contains(...) operation can be combined with & to test for the presence of two strings within one field. For example, we can test for all mountains that are located in BOTH Nepal and China:

df[df.location.str.contains("Nepal") & df.location.str.contains("China")]
0Mount Everest29029Nepal/China
Using .str.contains(...) to select all rows that contains Nepal AND China

Pandas Documentation

pandas.Series.str.contains contains the full pandas documentation for the str.contains function.