Python for Data Science: Introduction to DataFrames
One key aspect of Data Science is computation. Python, a programming language, is well suited for data science, specifically because of:
- The extensive collection of libraries that extend the functionality of the programming language for data science-related tasks,
- The relative ease of the syntax of the programming language (it does not look too cryptic!), and
- The wide-scale existing adoption of the Python language (there are millions of people who know Python and can help us out)
Understanding Python Programs
Running Python programs requires a Python interpreter that will interpret your Python code and run it on a CPU. There are many Python interpreters that exist in the cloud -- such as Google Colab that we will use for online examples -- and you can install a Python interpreter on your own computer!
You will be using Python as a tool to help you preform data science on real-world data.
Basic Python Data Types
The first and most important bit to know about Python is that there are just three basic data types that we will be using initially. A data type describes the type (or format) of the data and not the data itself. The three data types include:
Numbers, which include both integers (whole numbers like
0) and numbers with a decimal (floating point numbers like
Strings, which includes all non-numeric data. Strings must be surrounded by quotes (ex:
DataFrames, which are tables that are similar to a spreadsheet containing rows and columns of data. Each DataFrame usually consists of an entire dataset or some subset of the full dataset.
Everything we initially work with will be one of the three types of data.
Python allows us to store data inside of named variables, which allows us a way to store and interact with data. This is similar to basic mathematics where you might have:
x = 5, the value of
xis set to
y = x + 2, the value of
yis set to the value of
x(which is 5) plus
z = x + y, the value of
zis set to the sum of
Variables will store the value you assign to them until you assign them a new value. In the example above,
x will always have the value of
x is set again.
In data science, variables that store DataFrames will contain the data from our datasets. To load in any data into a variable, we will always preform three steps:
- We must first import the
pandaslibrary, giving us access to DataFrames. This is done via the
importcommand and will always be the following line of code:
# Imports the `pandas` library to be used in our Python program:
import pandas as pd
- We must then load the dataset, assigning a variable to become a DataFrame. Unless we are working with many datasets at the same time, it is common to call the variable storing the dataset
df, short for DataFrame. The
pd.read_csv(...)command loads the dataset from a provided URL. The code we will use is always the following:
# Loads the "GPA dataset" into the variable `df` using `pd.read_csv`:
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/gpa.csv")
- In any Python notebook environment, writing the name of the variable as the last line in a cell will always display the variable's contents to the screen. To verify we loaded our data correctly, we'll display the contents of the variable
dfthat we just assigned to be a DataFrame in the previous step:
# Displays the contents of the variable `df`:
# (`df` contains the GPA dataset after the previous pd.read_csv(...) step)
Running Python Programs
When you see Python code, the best way to understand what it's doing is to run it and nerd out with the code! Every page with code in Data Science Discovery will have a notebook just one click away:► Run this code on Google Colab
The notebooks will be read-only notebooks that allow you to edit the code. You should create your own copies of the notebooks, code the notebooks, modify the code, etc. by using File -> Save a Copy in Drive. This will create your own private copy of the code that you can change, modify, and keep on your own private Google Drive!
In the labs and projects, you will use Python on your own computer and learn how to create files offline.
Example Walk-Throughs with Worksheets
Video 1: DataFrames
Practice QuestionsQ1: Which of the following are all examples of data types in Python?
Q2: What best describes the code: "x = 4"?
Q3: Before loading a dataset, or using any other functionality in the panda's library, what Python code must run first?
Q4: What is the value of x after the following lines of code are run?
Mastery-Based AssessmentA mastery-based assessment is available for Python for Data Science: Introduction to DataFrames:
- Access PrairieLearn (prairielearn.org)
- Complete the
m1-03Python for Data Science: Introduction to DataFrames mastery assessment on PrairieLearn
- Continue to master material and earn 100% mastery on all assessments in the "Basics of Data Science with Python" section to earn the Basics of Data Science with Python Mastery Badge!