Python for Data Science: Introduction to DataFrames

← What is Data Science? Next: Row Selection with DataFrames →

One key aspect of Data Science is computation. Python, a programming language, is well suited for data science, specifically because of:

The extensive collection of libraries that extend the functionality of the programming language for data science-related tasks
The relative ease of the syntax of the programming language (it does not look too cryptic!)
The wide-scale existing adoption of the Python language (there are millions of people who know Python and can help us out)

Understanding Python Programs

Running Python programs requires a Python interpreter that will interpret your Python code and run it on a CPU. There are many Python interpreters that exist in the cloud -- such as Google Colab which we will use for online examples -- and you can install a Python interpreter on your own computer!

You will be using Python as a tool to help you perform data science on real-world data!

Basic Python Data Types

The first and most important bit to know about Python is that there are three basic data types that we will be using initially. A data type describes the format, or category, of the data and not the data itself. The three data types include:

Numbers, which include both integers (whole numbers like 4, 8, -3, and 0) and numbers with a decimal (floating point numbers like 3.14, 2.0001, and -42.1).
Strings, which includes all non-numeric data. Strings must be surrounded by quotes (ex: "Hello", "Ice Cream", "STAT107").
DataFrames, which are tables that are similar to a spreadsheet containing rows and columns of data. Each DataFrame usually consists of an entire dataset or a subset of the full dataset.

Everything we initially work with will be one of these three types of data.

Using Variables

Python allows us to store data inside of named variables, which are a way to store and interact with data. This is similar to basic mathematics where you might have:

x = 5, the value of x is set to 5,
y = x + 2, the value of y is set to the value of x (which is 5) plus 2 = 7, and
z = x + y, the value of z is set to the sum of x (5) and y (7) = 12

Variables will store the value you assign to them until you assign them a new value. In the example above, x will always have the value of 5 until x is set again.

DataFrames

In data science, variables that store DataFrames will contain the data from our datasets. To load in any data into a variable, we will always perform three steps:

We must first import the pandas library, giving us access to DataFrames. This is done via the import command and will always be the following line of code:

# Imports the `pandas` library to be used in our Python program:\nimport pandas as pd

Reset Code Python Output:

(Run your code to see your code result's here.)

We must then load the dataset, assigning a variable to become a DataFrame. Unless we are working with many datasets at the same time, it is common to call the variable storing the dataset df, short for DataFrame. The pd.read_csv(...) command loads the dataset from a provided URL that contains a CSV file.
In any Python notebook environment, writing the name of the variable as the last line in a cell will always display the variable's contents to the screen. To verify we loaded our data correctly, we'll display the contents of the variable df that we just assigned to be a DataFrame in the previous step:

# Step 2:\n# Loads the "GPA dataset" into the variable `df` using `pd.read_csv`:\ndf = pd.read_csv("https://waf.cs.illinois.edu/discovery/gpa.csv")\n&nbsp;\n# Step 3\n# Displays the contents of the variable `df`:\n# (`df` contains the GPA dataset after the previous pd.read_csv(...) step)\ndf

Reset Code Python Output:

Make sure to run the previous cell first!

Interested in learning different methods to load datasets into a DataFrame? Check out our Guides on Reading and Importing Data into DataFrames:

Python for Data Science: Introduction to DataFrames

Understanding Python Programs

Basic Python Data Types

Using Variables

DataFrames

Example Walk-Throughs with Worksheets

Video 1: Experimental Design Examples

Video 2: Blocking Examples