Row Selection with DataFrames


One of the most important activities when working with data is to select a subset of the rows. Below is the full Course Catalog dataset, displaying all 8,589 courses offered in Fall 2019 at The University of Illinois. When Python displays the contents of a large DataFrame, it will show three things:

  1. The first five rows of data,
  2. A row of ..., and
  3. The last five rows.

An example of displaying the full Course Catalog dataset:

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
0 2019 Fall 2019-fa AAS 100 Intro Asian American Studies Interdisciplinary introduction to the basic co... 3 hours. NaN Social & Beh Sci - Soc Sci, and Cultural Studi...
1 2019 Fall 2019-fa AAS 105 Introduction to Arab American Studies Interdisciplinary introduction to the basic co... 3 hours. NaN Cultural Studies - US Minority course.
2 2019 Fall 2019-fa AAS 120 Intro to Asian Am Pop Culture Introductory understanding of the way U.S. pop... 3 hours. NaN Cultural Studies - US Minority course.
3 2019 Fall 2019-fa AAS 199 Undergraduate Open Seminar May be repeated to a maximum of 6 hours. 1 TO 5 hours. NaN NaN
4 2019 Fall 2019-fa AAS 200 U.S. Race and Empire Invites students to examine histories and narr... 3 hours. Same as LLS 200. Humanities - Hist & Phil, and Cultural Studies...
... ... ... ... ... ... ... ... ... ... ...
8584 2019 Fall 2019-fa ZULU 202 Elementary Zulu II Continuation of ZULU 201 with introduction of ... 5 hours. Same as AFST 252. Participation in the languag... NaN
8585 2019 Fall 2019-fa ZULU 403 Intermediate Zulu I Survey of more advanced grammar; emphasis on i... 4 hours. NaN NaN
8586 2019 Fall 2019-fa ZULU 404 Intermediate Zulu II Continuation of ZULU 403; emphasis on increasi... 4 hours. NaN NaN
8587 2019 Fall 2019-fa ZULU 405 Advanced Zulu I Third year Zulu with emphasis on conversationa... 3 hours. NaN NaN
8588 2019 Fall 2019-fa ZULU 406 Advanced Zulu II Continuation of Zulu 405 with increased emphas... 3 hours. NaN NaN

The full 2019 Course Catalog dataset stored in a DataFrame (8,589 rows).

What is Row Selection?

Row selection is any operation that selects only a subset of all the rows in a DataFrame. All of the following are examples of selecting a subset of rows from the Course Catalog dataset:

  • Selecting all courses offered in the STAT department,
  • Selecting a random 1% sample of all courses offered at Illinois,
  • Selecting five random courses offered at Illinois,
  • Selecting the first ten courses from the dataset,
  • Selecting the last five courses from the dataset,
  • Selecting all courses with a Number in the 100s, or
  • Selecting all courses with a course number greater than or equal to 241.

Quick Reference: Pandas Cheat Sheet

The pandas project provides an official "cheat sheet" of incredibly useful commands that can be found at https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf. This sheet is incredibly useful, and many data scientists even have one printed out at their workspace!

Pandas Cheat Sheet, Page 1
Pandas Cheat Sheet, Page 2

We will explore areas of this cheat sheet so you can become familiar with it as a reference sheet. As you do more and more data science, you will find you will start to memorize the commands you use most often.

Pandas Cheat Sheet Section: "Subset Observation (Rows)"

The section in the bottom-center of the pandas cheat sheet is titled "Subset Observations (Rows)" and provides us with the most common functions to select a subset of the rows (also called observations) from our data:

Pandas Cheat Sheet, Page 1 with the "Subset Observation (Rows)" section highlighted
"Subset Observation (Rows)" section from Page 1 of the Pandas Cheat Sheet

Row Selection Based on the Contents of the Row

One of the most useful forms of row selection is to select only rows that match a specific criteria.

To select the rows that match a criteria, we need a conditional and to follow the first example given on the cheat sheet: df[df.Length > 7]. This line of code has four components:

  1. The outer of df[ ___________ ] specifies that we will be selecting data from the DataFrame named df. The "square brackets" are necessary.

  2. The inner df.Length specifies that we want to look at the column named Length inside of the DataFrame named df.

  3. The > is a logical operation meaning "greater than". This is one of six common options that include:

    • >, for strictly greater than,
    • >=, for greater than or equal to
    • <, for strictly less than,
    • <=, for less than or equal to,
    • ==, for exactly equal to, and
    • !=, for NOT exactly equal to,

  4. The 7 is value we are checking the data against, specifically checking if the data in each row is greater than 7.

    • Since 7 is a number, quotes are NOT used.
    • If we had a string like "Discovery", quotes MUST be used.

Example: Selecting the Courses Taught in Statistics

Let's use this syntax to select all courses taught by the STAT department. Referring to the dataset, the department that teaches the courses in the dataset is contained in the column named Subject:

YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
.................................

Column headers of the Course Catalog dataset, with the Subject column highlighted.

Therefore, we need to check if the contents of each row's Subject data is equal to "STAT":

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
8002 2019 Fall 2019-fa STAT 100 Statistics First course in probability and statistics at ... 3 hours. Credit is not given for both STAT 100 and any ... Quantitative Reasoning I course.
8003 2019 Fall 2019-fa STAT 107 Data Science Discovery Data Science Discovery is the intersection of ... 4 hours. Same as CS 107 and IS 107. Quantitative Reasoning I course.
8004 2019 Fall 2019-fa STAT 199 Undergraduate Open Seminar See course schedule for topics. Approved for L... 1 TO 5 hours. NaN NaN
8005 2019 Fall 2019-fa STAT 200 Statistical Analysis Survey of statistical concepts, data analysis,... 3 hours. Credit is not given for both STAT 200 and STAT... Quantitative Reasoning I course.
8006 2019 Fall 2019-fa STAT 212 Biostatistics Application of statistical reasoning and stati... 3 hours. Credit is not given for both STAT 212 and STAT... Quantitative Reasoning I course.
8007 2019 Fall 2019-fa STAT 361 Probability &amp; Statistics for Computer Science Same as CS 361. See CS 361. 3 hours. Same as CS 361. See CS 361. NaN
8008 2019 Fall 2019-fa STAT 385 Statistics Programming Methods Statisticians must be savvy in programming met... 3 hours. Prerequisite: STAT 200 or STAT 212. NaN
8009 2019 Fall 2019-fa STAT 390 Individual Study May be repeated to a maximum of 8 hours. Prere... 1 OR 2 hours. May be repeated to a maximum of 8 hours. Prere... NaN
8010 2019 Fall 2019-fa STAT 391 Honors Individual Study May be repeated to a maximum of 8 hours. Prere... 1 OR 2 hours. May be repeated to a maximum of 8 hours. Prere... NaN
8011 2019 Fall 2019-fa STAT 400 Statistics and Probability I Introduction to mathematical statistics that d... 4 hours. Same as MATH 463. 4 undergraduate hours. 4 gra... NaN
8012 2019 Fall 2019-fa STAT 408 Actuarial Statistics I Examines elementary theory of probability, inc... 4 hours. NaN NaN
8013 2019 Fall 2019-fa STAT 409 Actuarial Statistics II Continuation of STAT 408. Examines parametric ... 4 hours. Same as ASRM 402. 4 undergraduate hours. 4 gra... NaN
8014 2019 Fall 2019-fa STAT 410 Statistics and Probability II Continuation of STAT 400. Includes moment-gene... 3 OR 4 hours. Same as MATH 464. 3 undergraduate hours. 4 gra... NaN
8015 2019 Fall 2019-fa STAT 420 Methods of Applied Statistics Systematic, calculus-based coverage of the mor... 3 OR 4 hours. Same as ASRM 450. 3 undergraduate hours. 4 gra... NaN
8016 2019 Fall 2019-fa STAT 424 Analysis of Variance Estimation and hypotheses testing in linear mo... 3 OR 4 hours. NaN NaN
8017 2019 Fall 2019-fa STAT 425 Applied Regression and Design Explores linear regression, least squares esti... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8018 2019 Fall 2019-fa STAT 426 Sampling and Categorical Data Sampling: simple random, stratified, systemati... 3 OR 4 hours. NaN NaN
8019 2019 Fall 2019-fa STAT 427 Statistical Consulting Students, working in groups under the supervis... 3 OR 4 hours. NaN NaN
8020 2019 Fall 2019-fa STAT 428 Statistical Computing Examines statistical packages, numerical analy... 3 OR 4 hours. Same as CSE 428. 3 undergraduate hours. 4 grad... NaN
8021 2019 Fall 2019-fa STAT 429 Time Series Analysis Studies theory and data analysis for time seri... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8022 2019 Fall 2019-fa STAT 430 Topics in Applied Statistics Formulation and analysis of mathematical model... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. May b... NaN
8023 2019 Fall 2019-fa STAT 431 Applied Bayesian Analysis Introduction to the concepts and methodology o... 3 OR 4 hours. NaN NaN
8024 2019 Fall 2019-fa STAT 432 Basics of Statistical Learning Topics in supervised and unsupervised learning... 3 OR 4 hours. Same as ASRM 451. 3 undergraduate hours. 4 gra... NaN
8025 2019 Fall 2019-fa STAT 433 Stochastic Processes A stochastic process is a random process that ... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8026 2019 Fall 2019-fa STAT 434 Survival Analysis Introduction to the analysis of time-to-event ... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8027 2019 Fall 2019-fa STAT 440 Statistical Data Management The critical elements of data storage, data cl... 3 OR 4 hours. Same as CSE 440. 3 undergraduate hours. 4 grad... NaN
8028 2019 Fall 2019-fa STAT 443 Professional Statistics This project-based course emphasizes written, ... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8029 2019 Fall 2019-fa STAT 448 Advanced Data Analysis Several of the most widely used techniques of ... 4 hours. Same as CSE 448. 4 undergraduate hours. 4 grad... NaN
8030 2019 Fall 2019-fa STAT 458 Math Modeling in Life Sciences Same as ANSC 448 and IB 487. See ANSC 448. 3 OR 4 hours. NaN NaN
8031 2019 Fall 2019-fa STAT 480 Data Science Foundations Examines the methods of data management and an... 3 OR 4 hours. 3 undergraduate hours. 4 graduate hours. Prere... NaN
8032 2019 Fall 2019-fa STAT 510 Mathematical Statistics I Distributions, transformations, order-statisti... 4 hours. Prerequisite: STAT 410. NaN
8033 2019 Fall 2019-fa STAT 511 Mathematical Statistics II Bayes estimates, minimaxity, admissibility; ma... 4 hours. NaN NaN
8034 2019 Fall 2019-fa STAT 525 Computational Statistics Various topics, such as ridge regression; robu... 4 hours. NaN NaN
8035 2019 Fall 2019-fa STAT 527 Advanced Regression Analysis An advanced introduction to regression analysi... 4 hours. 4 graduate hours. No professional credit. Prer... NaN
8036 2019 Fall 2019-fa STAT 530 Bioinformatics Same as ANSC 543, CHBE 571, and MCB 571. See C... 4 hours. NaN NaN
8037 2019 Fall 2019-fa STAT 534 Advanced Survival Analysis Introduction to the analysis of time-to-event ... 4 hours. NaN NaN
8038 2019 Fall 2019-fa STAT 538 Clinical Trials Methodology The topics of the course focus on clinical tri... 4 hours. NaN NaN
8039 2019 Fall 2019-fa STAT 541 Predictive Analytics Same as ASRM 552. See ASRM 552. 4 hours. NaN NaN
8040 2019 Fall 2019-fa STAT 542 Statistical Learning Modern techniques of predictive modeling, clas... 4 hours. Same as ASRM 551 and CSE 542. 4 graduate hours... NaN
8041 2019 Fall 2019-fa STAT 543 Appl. Multivariate Statistics Same as CPSC 543. See CPSC 543. 4 hours. NaN NaN
8042 2019 Fall 2019-fa STAT 545 Spatial Statistics Theory and methods for analyzing univariate an... 4 hours. NaN NaN
8043 2019 Fall 2019-fa STAT 546 Machine Learning in Data Science Trains students to analyze large complex data ... 4 hours. NaN NaN
8044 2019 Fall 2019-fa STAT 551 Theory of Probability I Same as MATH 561. See MATH 561. 4 hours. NaN NaN
8045 2019 Fall 2019-fa STAT 552 Theory of Probability II Same as MATH 562. See MATH 562. 4 hours. Same as MATH 562. See MATH 562. NaN
8046 2019 Fall 2019-fa STAT 553 Probability and Measure I Measures and probabilities; integration and ex... 4 hours. Prerequisite: MATH 447 or consent of instructor. NaN
8047 2019 Fall 2019-fa STAT 554 Probability and Measure II Measure extensions, Lebesque-Stieltjes measure... 4 hours. NaN NaN
8048 2019 Fall 2019-fa STAT 555 Applied Stochastic Processes Same as MATH 564. See MATH 564. 4 hours. Same as MATH 564. See MATH 564. NaN
8049 2019 Fall 2019-fa STAT 558 Risk Modeling and Analysis Same as MATH 563. See MATH 563. 4 hours. Same as MATH 563. See MATH 563. NaN
8050 2019 Fall 2019-fa STAT 571 Multivariate Analysis Inference in multivariate statistical populati... 4 hours. Prerequisite: STAT 410 and MATH 415, or consen... NaN
8051 2019 Fall 2019-fa STAT 575 Large Sample Theory Limiting distribution of maximum likelihood es... 4 hours. NaN NaN
8052 2019 Fall 2019-fa STAT 578 Topics in Statistics May be repeated if topics vary. Prerequisite: ... 4 hours. May be repeated if topics vary. Prerequisite: ... NaN
8053 2019 Fall 2019-fa STAT 587 Hierarchical Linear Models Same as PSYC 587 and EPSY 587. See EPSY 587. 4 hours. NaN NaN
8054 2019 Fall 2019-fa STAT 588 Covar Struct and Factor Models Same as EPSY 588, PSYC 588, and SOC 588. See P... 4 hours. NaN NaN
8055 2019 Fall 2019-fa STAT 590 Individual Study and Research Directed reading and research. Approved for le... 0 TO 8 hours. Approved for letter and S/U grading. May be re... NaN
8056 2019 Fall 2019-fa STAT 593 STAT Internship Supervised, off-campus experience in a field i... 0 TO 8 hours. Approved for letter and S/U grading. Prerequis... NaN
8057 2019 Fall 2019-fa STAT 595 Preparing Future Faculty Prepares Ph.D. students who are interested in ... 2 hours. NaN NaN
8058 2019 Fall 2019-fa STAT 599 Thesis Research Approved for S/U grading only. May be repeated... 0 TO 16 hours. Approved for S/U grading only. May be repeated... NaN

The subset of rows where the Subject is exactly equal to STAT (57 rows).

Notice that the first row has an index of 8002 and the subject is STAT for ALL of the rows! 🎉

A few important things to note:

  1. We are checking if the subject is exactly equal to to the string "STAT". To check if something is exactly equal to another value in Python, we use == or "double equals".

  2. The data we're checking is NOT all numbers, so the data type of the subject names are strings. This requires that the value is contained in quotes (either 'STAT' or "STAT").

  3. The column name (Subject) is also a string, but the use of the "dot notation" is a shortcut and one of the few exceptions where a string does not need quotes. If we wanted to be more explicit, the longer way to write this same statement is: df[ df["Subject"] == "STAT" ]. This style will be required if the column name contains a space.

Example: Selecting all Graduate / Professional Courses

At the University of Illinois, courses are specifically graduate or professional courses if their course number is at least 500 (specifically 500-799, but there are no courses higher than that).

The column name that contains the course number is Number and we can select the rows where the course number is greater than or equal to 500:

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
44 2019 Fall 2019-fa AAS 501 Theory and Methods in AAS Foundational gateway course for graduate study... 4 hours. NaN NaN
45 2019 Fall 2019-fa AAS 539 Youth, Culture and Society Same as EPS 539 and HDFS 539. See HDFS 539. 4 hours. NaN NaN
46 2019 Fall 2019-fa AAS 561 Race and Cultural Critique Introduction to graduate level theoretical and... 4 hours. NaN NaN
47 2019 Fall 2019-fa AAS 589 Readings in Asian Am Studies Individual guidance in intensive readings in t... 1 TO 4 hours. NaN NaN
48 2019 Fall 2019-fa AAS 590 Asian Am Studies Seminar Approved for letter and S/U grading. May be re... 2 TO 4 hours. NaN NaN
... ... ... ... ... ... ... ... ... ... ...
8561 2019 Fall 2019-fa VM 635 Veterinary Medical Spanish In this course second year veterinary students... 2 hours. No graduate credit. 2 professional hours. Appr... NaN
8562 2019 Fall 2019-fa VM 642 Contemporary Issues in Vet Med An introductory course for first year veterina... 1 hours. NaN NaN
8563 2019 Fall 2019-fa VM 643 Fundamentals of Management An introductory course for second year veterin... 1 hours. No graduate credit. 1 professional hour. Appro... NaN
8564 2019 Fall 2019-fa VM 645 Communications in Practice An introductory course for third year veterina... 1 hours. NaN NaN
8565 2019 Fall 2019-fa VM 694 Veterinary Medicine To be used to designate a trial or experimenta... 1 TO 4 hours. NaN NaN

The subset of rows at the graduate or professional level (2,773 total rows).

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
44 2019 Fall 2019-fa AAS 501 Theory and Methods in AAS Foundational gateway course for graduate study... 4 hours. NaN NaN
45 2019 Fall 2019-fa AAS 539 Youth, Culture and Society Same as EPS 539 and HDFS 539. See HDFS 539. 4 hours. NaN NaN
46 2019 Fall 2019-fa AAS 561 Race and Cultural Critique Introduction to graduate level theoretical and... 4 hours. NaN NaN
47 2019 Fall 2019-fa AAS 589 Readings in Asian Am Studies Individual guidance in intensive readings in t... 1 TO 4 hours. NaN NaN
48 2019 Fall 2019-fa AAS 590 Asian Am Studies Seminar Approved for letter and S/U grading. May be re... 2 TO 4 hours. NaN NaN
... ... ... ... ... ... ... ... ... ... ...
8561 2019 Fall 2019-fa VM 635 Veterinary Medical Spanish In this course second year veterinary students... 2 hours. No graduate credit. 2 professional hours. Appr... NaN
8562 2019 Fall 2019-fa VM 642 Contemporary Issues in Vet Med An introductory course for first year veterina... 1 hours. NaN NaN
8563 2019 Fall 2019-fa VM 643 Fundamentals of Management An introductory course for second year veterin... 1 hours. No graduate credit. 1 professional hour. Appro... NaN
8564 2019 Fall 2019-fa VM 645 Communications in Practice An introductory course for third year veterina... 1 hours. NaN NaN
8565 2019 Fall 2019-fa VM 694 Veterinary Medicine To be used to designate a trial or experimenta... 1 TO 4 hours. NaN NaN

The subset of rows at the graduate or professional level (2,773 total rows). We can see that the output is the same for both statements.

Random Row Selection

In many areas of data science, it will be important to divide the data into random groups or randomly sample a subset of the collected data. In both cases, we are taking a "random sample" of our data. The pandas library provides two methods to randomly sample the data:

  1. Sample a fraction of the data by using df.sample() with a frac parameter:

    • Ex: df.sample(frac=0.5) samples 50% (0.5) of our data randomly.
    • Ex: df.sample(frac=0.01) samples 1% (0.01) of our data randomly.

  2. Sample a specific number of rows/observations from the data by using df.sample() with a n parameter (or no parameter at all!):

    • Ex: df.sample(n=10) samples exactly 10 rows randomly.
    • Ex: df.sample(n=107) samples exactly 107 rows randomly.
    • Ex: df.sample(10) also samples exactly 10 rows randomly (the n is not required, but frac is required).

Since this is a random sample, every time you run df.sample the result will be different!

Example: A Random Course Schedule of Five Courses

An average student at the University of Illinois will take around five courses each semester. Using the Course Catalog dataset, we can randomly pick five courses to create a random schedule:

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
3665 2019 Fall 2019-fa FIN 520 Financial Management Introduction to financial management and decis... 4 hours. Prerequisite: Enrollment in the Executive MBA,... NaN
3523 2019 Fall 2019-fa ESE 100 Sustainable Earth Same as GCL 124. See GCL 124. 3 hours. NaN Nat Sci & Tech - Phys Sciences course.
1096 2019 Fall 2019-fa ARTF 106 Visualization Drawing This studio course introduces students to the ... 3 hours. Additional fees may apply. See Class Schedule.... NaN
8558 2019 Fall 2019-fa VM 623 Research Project II In this lecture/discussion course, which is a ... 2 hours. NaN NaN
3261 2019 Fall 2019-fa ENGL 486 History of Translation Same as CLCV 430, CWL 430, GER 405, SLAV 430, ... 3 OR 4 hours. NaN NaN

A random selection of five rows (5 total rows).

Running the exact same code again, our results are completely different since the selection is random:

Reset Code Python Output:
Year Term YearTerm Subject Number Name Description Credit Hours Section Info Degree Attributes
1093 2019 Fall 2019-fa ARTF 103 Design I Theory and practice in the elements, processes... 3 hours. NaN NaN
463 2019 Fall 2019-fa AFST 231 Elementary Swahili I Same as SWAH 201. See SWAH 201. 5 hours. Same as SWAH 201. See SWAH 201. NaN
4182 2019 Fall 2019-fa GWS 325 Lesbian/Queer Media Cultures Discusses how various LGBT/Q communities were ... 3 hours. NaN NaN
7791 2019 Fall 2019-fa SOCW 200 Introduction to Social Work Broad survey of the field of social work; intr... 3 hours. NaN Social & Beh Sci - Soc Sci course.
7790 2019 Fall 2019-fa SOCW 199 Undergraduate Open Seminar Approved for letter and S/U grading. May be re... 1 TO 4 hours. Approved for letter and S/U grading. May be re... NaN

Another random selection of five rows (5 total rows).

Interested in learning different methods to select rows in a DataFrame? Check out our Guides on Row Selection using DataFrames:


Example Walk-Throughs with Worksheets

Video 1: DataFrames

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: I want to randomly select 5 movies from a dataset titled "topMovies.csv" holding the top 100 user rated movies on IMDB. How would I do this in Python?
Q2: In order to select the first 10 students from a class roster dataset titled "STAT207", what command would you use in Python?
Q3: If you had a dataframe titled "df" with a row titled "scores", how would you select the rows with the highest 3 scores in Python?