Row Selection with DataFrames


One of the most important activities when working with data is to select a subset of the rows. Below is the full Course Catalog dataset, displaying all 8,589 courses offered in Fall 2019 at The University of Illinois. When Python displays the contents of a large DataFrame, it will show three things:

  1. The first five rows of data,
  2. A row of ..., and
  3. The last five rows.

An example of displaying the full Course Catalog dataset:

import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/course-catalog.csv")
df
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
02019Fall2019-faAAS100Intro Asian American StudiesInterdisciplinary introduction to the basic co...3 hours.NaNSocial & Beh Sci - Soc Sci, and Cultural Studi...
12019Fall2019-faAAS105Introduction to Arab American StudiesInterdisciplinary introduction to the basic co...3 hours.NaNCultural Studies - US Minority course.
22019Fall2019-faAAS120Intro to Asian Am Pop CultureIntroductory understanding of the way U.S. pop...3 hours.NaNCultural Studies - US Minority course.
32019Fall2019-faAAS199Undergraduate Open SeminarMay be repeated to a maximum of 6 hours.1 TO 5 hours.NaNNaN
42019Fall2019-faAAS200U.S. Race and EmpireInvites students to examine histories and narr...3 hours.Same as LLS 200.Humanities - Hist & Phil, and Cultural Studies...
.................................
85842019Fall2019-faZULU202Elementary Zulu IIContinuation of ZULU 201 with introduction of ...5 hours.Same as AFST 252. Participation in the languag...NaN
85852019Fall2019-faZULU403Intermediate Zulu ISurvey of more advanced grammar; emphasis on i...4 hours.NaNNaN
85862019Fall2019-faZULU404Intermediate Zulu IIContinuation of ZULU 403; emphasis on increasi...4 hours.NaNNaN
85872019Fall2019-faZULU405Advanced Zulu IThird year Zulu with emphasis on conversationa...3 hours.NaNNaN
85882019Fall2019-faZULU406Advanced Zulu IIContinuation of Zulu 405 with increased emphas...3 hours.NaNNaN

The full 2019 Course Catalog dataset stored in a DataFrame (8,589 rows).

What is Row Selection?

Row selection is any operation that selects only a subset of all the rows in a DataFrame. All of the following are examples of selecting a subset of rows from the Course Catalog dataset:

  • Selecting all courses offered in the STAT department,
  • Selecting a random 1% sample of all courses offered at Illinois,
  • Selecting five random courses offered at Illinois,
  • Selecting the first ten courses from the dataset,
  • Selecting the last five courses from the dataset,
  • Selecting all courses with a Number in the 100s, or
  • Selecting all courses with a course number greater than or equal to 241.

Quick Reference: Pandas Cheat Sheet

The pandas project provides an official "cheat sheet" of incredibly useful commands that can be found at https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf. This sheet is incredibly useful, and many data scientists even have one printed out at their workspace!

Pandas Cheat Sheet, Page 1
Pandas Cheat Sheet, Page 2

We will explore areas of this cheat sheet so you can become familiar with it as a reference sheet. As you do more and more data science, you will find you will start to memorize the commands you use most often.

Pandas Cheat Sheet Section: "Subset Observation (Rows)"

The section in the bottom-center of the pandas cheat sheet is titled "Subset Observations (Rows)" and provides us with the most common functions to select a subset of the rows (also called observations) from our data:

Pandas Cheat Sheet, Page 1 with the "Subset Observation (Rows)" section highlighted
"Subset Observation (Rows)" section from Page 1 of the Pandas Cheat Sheet

Row Selection Based on the Contents of the Row

One of the most useful forms of row selection is to select only rows that match a specific criteria.

To select the rows that match a criteria, we need a conditional and to follow the first example given on the cheat sheet: df[df.Length > 7]. This line of code has four components:

  1. The outer of df[ ___________ ] specifies that we will be selecting data from the DataFrame named df. The "square brackets" are necessary.

  2. The inner df.Length specifies that we want to look at the column named Length inside of the DataFrame named df.

  3. The > is a logical operation meaning "greater than". This is one of six common options that include:

    • >, for strictly greater than,
    • >=, for greater than or equal to
    • <, for strictly less than,
    • <=, for less than or equal to,
    • ==, for exactly equal to, and
    • !=, for NOT exactly equal to,

  4. The 7 is value we are checking the data against, specifically checking if the data in each row is greater than 7.

    • Since 7 is a number, quotes are NOT used.
    • If we had a string like "Discovery", quotes MUST be used.

Example: Selecting the Courses Taught in Statistics

Let's use this syntax to select all courses taught by the STAT department. Referring to the dataset, the department that teaches the courses in the dataset is contained in the column named Subject:

YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
.................................

Column headers of the Course Catalog dataset, with the Subject column highlighted.

Therefore, we need to check if the contents of each row's Subject data is equal to "STAT":

df[ df.Subject == "STAT" ]
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
80022019Fall2019-faSTAT100StatisticsFirst course in probability and statistics at ...3 hours.Credit is not given for both STAT 100 and any ...Quantitative Reasoning I course.
80032019Fall2019-faSTAT107Data Science DiscoveryData Science Discovery is the intersection of ...4 hours.Same as CS 107 and IS 107.Quantitative Reasoning I course.
80042019Fall2019-faSTAT199Undergraduate Open SeminarSee course schedule for topics. Approved for L...1 TO 5 hours.NaNNaN
80052019Fall2019-faSTAT200Statistical AnalysisSurvey of statistical concepts, data analysis,...3 hours.Credit is not given for both STAT 200 and STAT...Quantitative Reasoning I course.
80062019Fall2019-faSTAT212BiostatisticsApplication of statistical reasoning and stati...3 hours.Credit is not given for both STAT 212 and STAT...Quantitative Reasoning I course.
80072019Fall2019-faSTAT361Probability &amp; Statistics for Computer ScienceSame as CS 361. See CS 361.3 hours.Same as CS 361. See CS 361.NaN
80082019Fall2019-faSTAT385Statistics Programming MethodsStatisticians must be savvy in programming met...3 hours.Prerequisite: STAT 200 or STAT 212.NaN
80092019Fall2019-faSTAT390Individual StudyMay be repeated to a maximum of 8 hours. Prere...1 OR 2 hours.May be repeated to a maximum of 8 hours. Prere...NaN
80102019Fall2019-faSTAT391Honors Individual StudyMay be repeated to a maximum of 8 hours. Prere...1 OR 2 hours.May be repeated to a maximum of 8 hours. Prere...NaN
80112019Fall2019-faSTAT400Statistics and Probability IIntroduction to mathematical statistics that d...4 hours.Same as MATH 463. 4 undergraduate hours. 4 gra...NaN
80122019Fall2019-faSTAT408Actuarial Statistics IExamines elementary theory of probability, inc...4 hours.NaNNaN
80132019Fall2019-faSTAT409Actuarial Statistics IIContinuation of STAT 408. Examines parametric ...4 hours.Same as ASRM 402. 4 undergraduate hours. 4 gra...NaN
80142019Fall2019-faSTAT410Statistics and Probability IIContinuation of STAT 400. Includes moment-gene...3 OR 4 hours.Same as MATH 464. 3 undergraduate hours. 4 gra...NaN
80152019Fall2019-faSTAT420Methods of Applied StatisticsSystematic, calculus-based coverage of the mor...3 OR 4 hours.Same as ASRM 450. 3 undergraduate hours. 4 gra...NaN
80162019Fall2019-faSTAT424Analysis of VarianceEstimation and hypotheses testing in linear mo...3 OR 4 hours.NaNNaN
80172019Fall2019-faSTAT425Applied Regression and DesignExplores linear regression, least squares esti...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80182019Fall2019-faSTAT426Sampling and Categorical DataSampling: simple random, stratified, systemati...3 OR 4 hours.NaNNaN
80192019Fall2019-faSTAT427Statistical ConsultingStudents, working in groups under the supervis...3 OR 4 hours.NaNNaN
80202019Fall2019-faSTAT428Statistical ComputingExamines statistical packages, numerical analy...3 OR 4 hours.Same as CSE 428. 3 undergraduate hours. 4 grad...NaN
80212019Fall2019-faSTAT429Time Series AnalysisStudies theory and data analysis for time seri...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80222019Fall2019-faSTAT430Topics in Applied StatisticsFormulation and analysis of mathematical model...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. May b...NaN
80232019Fall2019-faSTAT431Applied Bayesian AnalysisIntroduction to the concepts and methodology o...3 OR 4 hours.NaNNaN
80242019Fall2019-faSTAT432Basics of Statistical LearningTopics in supervised and unsupervised learning...3 OR 4 hours.Same as ASRM 451. 3 undergraduate hours. 4 gra...NaN
80252019Fall2019-faSTAT433Stochastic ProcessesA stochastic process is a random process that ...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80262019Fall2019-faSTAT434Survival AnalysisIntroduction to the analysis of time-to-event ...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80272019Fall2019-faSTAT440Statistical Data ManagementThe critical elements of data storage, data cl...3 OR 4 hours.Same as CSE 440. 3 undergraduate hours. 4 grad...NaN
80282019Fall2019-faSTAT443Professional StatisticsThis project-based course emphasizes written, ...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80292019Fall2019-faSTAT448Advanced Data AnalysisSeveral of the most widely used techniques of ...4 hours.Same as CSE 448. 4 undergraduate hours. 4 grad...NaN
80302019Fall2019-faSTAT458Math Modeling in Life SciencesSame as ANSC 448 and IB 487. See ANSC 448.3 OR 4 hours.NaNNaN
80312019Fall2019-faSTAT480Data Science FoundationsExamines the methods of data management and an...3 OR 4 hours.3 undergraduate hours. 4 graduate hours. Prere...NaN
80322019Fall2019-faSTAT510Mathematical Statistics IDistributions, transformations, order-statisti...4 hours.Prerequisite: STAT 410.NaN
80332019Fall2019-faSTAT511Mathematical Statistics IIBayes estimates, minimaxity, admissibility; ma...4 hours.NaNNaN
80342019Fall2019-faSTAT525Computational StatisticsVarious topics, such as ridge regression; robu...4 hours.NaNNaN
80352019Fall2019-faSTAT527Advanced Regression AnalysisAn advanced introduction to regression analysi...4 hours.4 graduate hours. No professional credit. Prer...NaN
80362019Fall2019-faSTAT530BioinformaticsSame as ANSC 543, CHBE 571, and MCB 571. See C...4 hours.NaNNaN
80372019Fall2019-faSTAT534Advanced Survival AnalysisIntroduction to the analysis of time-to-event ...4 hours.NaNNaN
80382019Fall2019-faSTAT538Clinical Trials MethodologyThe topics of the course focus on clinical tri...4 hours.NaNNaN
80392019Fall2019-faSTAT541Predictive AnalyticsSame as ASRM 552. See ASRM 552.4 hours.NaNNaN
80402019Fall2019-faSTAT542Statistical LearningModern techniques of predictive modeling, clas...4 hours.Same as ASRM 551 and CSE 542. 4 graduate hours...NaN
80412019Fall2019-faSTAT543Appl. Multivariate StatisticsSame as CPSC 543. See CPSC 543.4 hours.NaNNaN
80422019Fall2019-faSTAT545Spatial StatisticsTheory and methods for analyzing univariate an...4 hours.NaNNaN
80432019Fall2019-faSTAT546Machine Learning in Data ScienceTrains students to analyze large complex data ...4 hours.NaNNaN
80442019Fall2019-faSTAT551Theory of Probability ISame as MATH 561. See MATH 561.4 hours.NaNNaN
80452019Fall2019-faSTAT552Theory of Probability IISame as MATH 562. See MATH 562.4 hours.Same as MATH 562. See MATH 562.NaN
80462019Fall2019-faSTAT553Probability and Measure IMeasures and probabilities; integration and ex...4 hours.Prerequisite: MATH 447 or consent of instructor.NaN
80472019Fall2019-faSTAT554Probability and Measure IIMeasure extensions, Lebesque-Stieltjes measure...4 hours.NaNNaN
80482019Fall2019-faSTAT555Applied Stochastic ProcessesSame as MATH 564. See MATH 564.4 hours.Same as MATH 564. See MATH 564.NaN
80492019Fall2019-faSTAT558Risk Modeling and AnalysisSame as MATH 563. See MATH 563.4 hours.Same as MATH 563. See MATH 563.NaN
80502019Fall2019-faSTAT571Multivariate AnalysisInference in multivariate statistical populati...4 hours.Prerequisite: STAT 410 and MATH 415, or consen...NaN
80512019Fall2019-faSTAT575Large Sample TheoryLimiting distribution of maximum likelihood es...4 hours.NaNNaN
80522019Fall2019-faSTAT578Topics in StatisticsMay be repeated if topics vary. Prerequisite: ...4 hours.May be repeated if topics vary. Prerequisite: ...NaN
80532019Fall2019-faSTAT587Hierarchical Linear ModelsSame as PSYC 587 and EPSY 587. See EPSY 587.4 hours.NaNNaN
80542019Fall2019-faSTAT588Covar Struct and Factor ModelsSame as EPSY 588, PSYC 588, and SOC 588. See P...4 hours.NaNNaN
80552019Fall2019-faSTAT590Individual Study and ResearchDirected reading and research. Approved for le...0 TO 8 hours.Approved for letter and S/U grading. May be re...NaN
80562019Fall2019-faSTAT593STAT InternshipSupervised, off-campus experience in a field i...0 TO 8 hours.Approved for letter and S/U grading. Prerequis...NaN
80572019Fall2019-faSTAT595Preparing Future FacultyPrepares Ph.D. students who are interested in ...2 hours.NaNNaN
80582019Fall2019-faSTAT599Thesis ResearchApproved for S/U grading only. May be repeated...0 TO 16 hours.Approved for S/U grading only. May be repeated...NaN

The subset of rows where the Subject is exactly equal to STAT (57 rows).

Notice that the first row has an index of 8002 and the subject is STAT for ALL of the rows! 🎉

A few important things to note:

  1. We are checking if the subject is exactly equal to to the string "STAT". To check if something is exactly equal to another value in Python, we use == or "double equals".

  2. The data we're checking is NOT all numbers, so the data type of the subject names are strings. This requires that the value is contained in quotes (either 'STAT' or "STAT").

  3. The column name (Subject) is also a string, but the use of the "dot notation" is a shortcut and one of the few exceptions where a string does not need quotes. If we wanted to be more explicit, the longer way to write this same statement is: df[ df["Subject"] == "STAT" ]. This style will be required if the column name contains a space.

Example: Selecting all Graduate / Professional Courses

At the University of Illinois, courses are specifically graduate or professional courses if their course number is at least 500 (specifically 500-799, but there are no courses higher than that).

The column name that contains the course number is Number and we can select the rows where the course number is greater than or equal to 500:

# Dot Notation:
df[ df.Number >= 500 ]
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
442019Fall2019-faAAS501Theory and Methods in AASFoundational gateway course for graduate study...4 hours.NaNNaN
452019Fall2019-faAAS539Youth, Culture and SocietySame as EPS 539 and HDFS 539. See HDFS 539.4 hours.NaNNaN
462019Fall2019-faAAS561Race and Cultural CritiqueIntroduction to graduate level theoretical and...4 hours.NaNNaN
472019Fall2019-faAAS589Readings in Asian Am StudiesIndividual guidance in intensive readings in t...1 TO 4 hours.NaNNaN
482019Fall2019-faAAS590Asian Am Studies SeminarApproved for letter and S/U grading. May be re...2 TO 4 hours.NaNNaN
.................................
85612019Fall2019-faVM635Veterinary Medical SpanishIn this course second year veterinary students...2 hours.No graduate credit. 2 professional hours. Appr...NaN
85622019Fall2019-faVM642Contemporary Issues in Vet MedAn introductory course for first year veterina...1 hours.NaNNaN
85632019Fall2019-faVM643Fundamentals of ManagementAn introductory course for second year veterin...1 hours.No graduate credit. 1 professional hour. Appro...NaN
85642019Fall2019-faVM645Communications in PracticeAn introductory course for third year veterina...1 hours.NaNNaN
85652019Fall2019-faVM694Veterinary MedicineTo be used to designate a trial or experimenta...1 TO 4 hours.NaNNaN

The subset of rows at the graduate or professional level (2,773 total rows).

# Standard Notation:
df[ df['Number'] >= 500 ]
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
442019Fall2019-faAAS501Theory and Methods in AASFoundational gateway course for graduate study...4 hours.NaNNaN
452019Fall2019-faAAS539Youth, Culture and SocietySame as EPS 539 and HDFS 539. See HDFS 539.4 hours.NaNNaN
462019Fall2019-faAAS561Race and Cultural CritiqueIntroduction to graduate level theoretical and...4 hours.NaNNaN
472019Fall2019-faAAS589Readings in Asian Am StudiesIndividual guidance in intensive readings in t...1 TO 4 hours.NaNNaN
482019Fall2019-faAAS590Asian Am Studies SeminarApproved for letter and S/U grading. May be re...2 TO 4 hours.NaNNaN
.................................
85612019Fall2019-faVM635Veterinary Medical SpanishIn this course second year veterinary students...2 hours.No graduate credit. 2 professional hours. Appr...NaN
85622019Fall2019-faVM642Contemporary Issues in Vet MedAn introductory course for first year veterina...1 hours.NaNNaN
85632019Fall2019-faVM643Fundamentals of ManagementAn introductory course for second year veterin...1 hours.No graduate credit. 1 professional hour. Appro...NaN
85642019Fall2019-faVM645Communications in PracticeAn introductory course for third year veterina...1 hours.NaNNaN
85652019Fall2019-faVM694Veterinary MedicineTo be used to designate a trial or experimenta...1 TO 4 hours.NaNNaN

The subset of rows at the graduate or professional level (2,773 total rows). We can see that the output is the same for both statements.

Random Row Selection

In many areas of data science, it will be important to divide the data into random groups or randomly sample a subset of the collected data. In both cases, we are taking a "random sample" of our data. The pandas library provides two methods to randomly sample the data:

  1. Sample a fraction of the data by using df.sample() with a frac parameter:

    • Ex: df.sample(frac=0.5) samples 50% (0.5) of our data randomly.
    • Ex: df.sample(frac=0.01) samples 1% (0.01) of our data randomly.

  2. Sample a specific number of rows/observations from the data by using df.sample() with a n parameter (or no parameter at all!):

    • Ex: df.sample(n=10) samples exactly 10 rows randomly.
    • Ex: df.sample(n=107) samples exactly 107 rows randomly.
    • Ex: df.sample(10) also samples exactly 10 rows randomly (the n is not required, but frac is required).

Since this is a random sample, every time you run df.sample the result will be different!

Example: A Random Course Schedule of Five Courses

An average student at the University of Illinois will take around five courses each semester. Using the Course Catalog dataset, we can randomly pick five courses to create a random schedule:

df.sample(n=5)
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
36652019Fall2019-faFIN520Financial ManagementIntroduction to financial management and decis...4 hours.Prerequisite: Enrollment in the Executive MBA,...NaN
35232019Fall2019-faESE100Sustainable EarthSame as GCL 124. See GCL 124.3 hours.NaNNat Sci & Tech - Phys Sciences course.
10962019Fall2019-faARTF106Visualization DrawingThis studio course introduces students to the ...3 hours.Additional fees may apply. See Class Schedule....NaN
85582019Fall2019-faVM623Research Project IIIn this lecture/discussion course, which is a ...2 hours.NaNNaN
32612019Fall2019-faENGL486History of TranslationSame as CLCV 430, CWL 430, GER 405, SLAV 430, ...3 OR 4 hours.NaNNaN

A random selection of five rows (5 total rows).

Running the exact same code again, our results are completely different since the selection is random:

df.sample(n=5)
YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
10932019Fall2019-faARTF103Design ITheory and practice in the elements, processes...3 hours.NaNNaN
4632019Fall2019-faAFST231Elementary Swahili ISame as SWAH 201. See SWAH 201.5 hours.Same as SWAH 201. See SWAH 201.NaN
41822019Fall2019-faGWS325Lesbian/Queer Media CulturesDiscusses how various LGBT/Q communities were ...3 hours.NaNNaN
77912019Fall2019-faSOCW200Introduction to Social WorkBroad survey of the field of social work; intr...3 hours.NaNSocial & Beh Sci - Soc Sci course.
77902019Fall2019-faSOCW199Undergraduate Open SeminarApproved for letter and S/U grading. May be re...1 TO 4 hours.Approved for letter and S/U grading. May be re...NaN

Another random selection of five rows (5 total rows).

Interested in learning different methods to select rows in a DataFrame? Check out our Guides on Row Selection using DataFrames:


Example Walk-Throughs with Worksheets

Video 1: Row Selection with Pandas (Part 1)

Follow along with the worksheet to work through the problem:

Video 2: Row Selection with Pandas (Part 2)

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: Which answer shows two statements that will create the same subset of rows?
Q2: I want to randomly select 5 movies from a dataset titled "topMovies.csv" holding the top 100 user rated movies on IMDB. How would I do this in Python?
Q3: You're looking for a new apartment for you and your cat and using a dataset "apartments" to look at listings. How would you filter the dataset in Python to only show pet-friendly listings, using the column "PetFriendly" which is either True or False?
Q4: In order to select the first 10 students from a class roster dataset titled "STAT207", what command would you use in Python?
Q5: A dataset "restaurants" contains the columns "Name", "Location", and "Rating" with ratings ranging from 0-100. How would you filter to only restaurants with a rating greater than 90 in Python?
Q6: If you had a dataframe titled "df" with a row titled "scores", how would you select the rows with the highest 3 scores in Python?