Selecting Rows that are IN and NOT IN a DataFrame
The .isin
function is commonly used to check if values in one DataFrame are in another DataFrame. It can also be used to select rows from a DataFrame depending on if they are in another DataFrame.
To explore this function, we'll use a DataFrame of 10 songs released in 2022:
import pandas as pd
df1 = pd.DataFrame([
{'title': 'As It Was', 'artist': 'Harry Styles', 'album': "Harry's House", 'month': 4 },
{'title': 'Bad Habit', 'artist': 'Steve Lacy', 'album': 'Gemini Rights', 'month': 6 },
{'title': 'Constant Repeat', 'artist': 'Charli XCX', 'album': "CRASH", 'month': 3 },
{'title': 'Pink Funeral', 'artist': 'Beach House', 'album': "Once Twice Melody", 'month': 2 },
{'title': 'BREAK MY SOUL', 'artist': 'Beyonce', 'album': "RENAISSANCE", 'month': 7 },
{'title': 'Sidelines', 'artist': 'Phoebe Bridgers', 'album': "Sidelines", 'month': 4 },
{'title': 'Glimpse of Us', 'artist': 'Joji', 'album': "Glimpse of Us", 'month': 6 },
{'title': 'Pink Venom', 'artist': 'BLACKPINK', 'album': "Pink Venom", 'month': 8 },
{'title': 'Vegas', 'artist': 'Doja Cat', 'album': "ELVIS (Original Motion Picture Soundtrack)", 'month': 6 },
{'title': 'First Class', 'artist': 'Jack Harlow', 'album': "Come Home The Kids Miss You", 'month': 5 },
])
We create two DataFrames by choosing 5 random songs from our main DataFrame df1
:
# Create two new DataFrames containing 5 songs from our first DataFrame:
sample1 = df1.sample(5)
title | artist | album | month | |
---|---|---|---|---|
8 | Vegas | Doja Cat | ELVIS (Original Motion Picture Soundtrack) | 6 |
4 | BREAK MY SOUL | Beyonce | RENAISSANCE | 7 |
0 | As It Was | Harry Styles | Harry's House | 4 |
7 | Pink Venom | BLACKPINK | Pink Venom | 8 |
6 | Glimpse of Us | Joji | Glimpse of Us | 6 |
sample2 = df1.sample(5)
title | artist | album | month | |
---|---|---|---|---|
6 | Glimpse of Us | Joji | Glimpse of Us | 6 |
9 | First Class | Jack Harlow | Come Home The Kids Miss You | 5 |
0 | As It Was | Harry Styles | Harry\'s House | 4 |
8 | Vegas | Doja Cat | ELVIS (Original Motion Picture Soundtrack) | 6 |
1 | Bad Habit | Steve Lacy | Gemini Rights | 6 |
df1
.Select all rows from a DataFrame IN in another DataFrame
Since we know both samples came from the same original DataFrame with unique index values, we can use the .index.isin
function to find which songs are in a sample. This function will select the row ONLY when the row is in the target DataFrame contained inside of the function.
For example, we can select all of the rows in df1
that are in sample1
:
# Selects all rows from DataFrame `df1` that are in DataFrame `sample1`:
df1.index.isin( sample1.index )
title | artist | album | month | |
---|---|---|---|---|
8 | Vegas | Doja Cat | ELVIS (Original Motion Picture Soundtrack) | 6 |
4 | BREAK MY SOUL | Beyonce | RENAISSANCE | 7 |
0 | As It Was | Harry Styles | Harry's House | 4 |
7 | Pink Venom | BLACKPINK | Pink Venom | 8 |
6 | Glimpse of Us | Joji | Glimpse of Us | 6 |
.index.isin
to select all of rows from df1
that is in sample1
Selecting all rows contained in MULTIPLE DataFrames
You may have noticed that some of our songs appear in both of our sample DataFrames. By using &
(AND) conditional, we use .index.isin
twice to select only the rows contained in both of two different DataFrames.
Using our two sample DataFrames, we find the rows in df
that are contained in both sample1
and sample2
:
df1[ df1.index.isin(sample1.index) & df1.index.isin(sample2.index) ]
title | artist | album | month | |
---|---|---|---|---|
6 | Glimpse of Us | Joji | Glimpse of Us | 6 |
0 | As It Was | Harry Styles | Harry\'s House | 4 |
8 | Vegas | Doja Cat | ELVIS (Original Motion Picture Soundtrack) | 6 |
.index.isin
to find values in another DataFrameSelect all rows from a DataFrame NOT in another DataFrame
Using the .index.isin
function, we can also select all rows that are NOT in another DataFrame. Using pandas, the ~
operator denotes NOT. Therefore, to find all rows NOT in another DataFrame, we modify the .index.isin
function with the ~
operator.
The following code finds all songs from df1
NOT contained sample1
:
df1[ ~df1.index.isin(sample1.index) ]
title | artist | album | month | |
---|---|---|---|---|
1 | Bad Habit | Steve Lacy | Gemini Rights | 6 |
2 | Constant Repeat | Charli XCX | CRASH | 3 |
3 | Pink Funeral | Beach House | Once Twice Melody | 2 |
5 | Sidelines | Phoebe Bridgers | Sidelines | 4 |
9 | First Class | Jack Harlow | Come Home The Kids Miss You | 5 |
Pandas Documentation
Click Here for the full pandas documentation for the isin
function