Python for Data Science: Introduction to DataFrames


One key aspect of Data Science is computation. Python, a programming language, is well suited for data science, specifically because of:

  • The extensive collection of libraries that extend the functionality of the programming language for data science-related tasks
  • The relative ease of the syntax of the programming language (it does not look too cryptic!)
  • The wide-scale existing adoption of the Python language (there are millions of people who know Python and can help us out)

Understanding Python Programs

Running Python programs requires a Python interpreter that will interpret your Python code and run it on a CPU. There are many Python interpreters that exist in the cloud -- such as Google Colab which we will use for online examples -- and you can install a Python interpreter on your own computer!

You will be using Python as a tool to help you perform data science on real-world data!

Basic Python Data Types

The first and most important bit to know about Python is that there are three basic data types that we will be using initially. A data type describes the format, or category, of the data and not the data itself. The three data types include:

  1. Numbers, which include both integers (whole numbers like 4, 8, -3, and 0) and numbers with a decimal (floating point numbers like 3.14, 2.0001, and -42.1).

  2. Strings, which includes all non-numeric data. Strings must be surrounded by quotes (ex: "Hello", "Ice Cream", "STAT107").

  3. DataFrames, which are tables that are similar to a spreadsheet containing rows and columns of data. Each DataFrame usually consists of an entire dataset or a subset of the full dataset.

Everything we initially work with will be one of these three types of data.

Using Variables

Python allows us to store data inside of named variables, which are a way to store and interact with data. This is similar to basic mathematics where you might have:

  • x = 5, the value of x is set to 5,
  • y = x + 2, the value of y is set to the value of x (which is 5) plus 2 = 7, and
  • z = x + y, the value of z is set to the sum of x (5) and y (7) = 12

Variables will store the value you assign to them until you assign them a new value. In the example above, x will always have the value of 5 until x is set again.

DataFrames

In data science, variables that store DataFrames will contain the data from our datasets. To load in any data into a variable, we will always perform three steps:

  1. We must first import the pandas library, giving us access to DataFrames. This is done via the import command and will always be the following line of code:
# Imports the `pandas` library to be used in our Python program:
import pandas as pd
  1. We must then load the dataset, assigning a variable to become a DataFrame. Unless we are working with many datasets at the same time, it is common to call the variable storing the dataset df, short for DataFrame. The pd.read_csv(...) command loads the dataset from a provided URL that contains a CSV file. The code we will use is always the following:
# Loads the "GPA dataset" into the variable `df` using `pd.read_csv`:
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/gpa.csv")
  1. In any Python notebook environment, writing the name of the variable as the last line in a cell will always display the variable's contents to the screen. To verify we loaded our data correctly, we'll display the contents of the variable df that we just assigned to be a DataFrame in the previous step:
# Displays the contents of the variable `df`:
# (`df` contains the GPA dataset after the previous pd.read_csv(...) step)
df

Interested in learning different methods to load datasets into a DataFrame? Check out our Guides on Reading and Importing Data into DataFrames:


Example Walk-Throughs with Worksheets

Video 1: DataFrames

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: Which of the following are all examples of data types in Python?
Q2: What best describes the code: "x = 4"?
Q3: Before loading a dataset, or using any other functionality in the panda's library, what Python code must run first?
Q4: What is the value of x after the following lines of code are run?