Python for Data Science: Introduction to DataFrames
One key aspect of Data Science is computation. Python, a programming language, is well suited for data science, specifically because of:
- The extensive collection of libraries that extend the functionality of the programming language for data science-related tasks
- The relative ease of the syntax of the programming language (it does not look too cryptic!)
- The wide-scale existing adoption of the Python language (there are millions of people who know Python and can help us out)
Understanding Python Programs
Running Python programs requires a Python interpreter that will interpret your Python code and run it on a CPU. There are many Python interpreters that exist in the cloud -- such as Google Colab which we will use for online examples -- and you can install a Python interpreter on your own computer!
You will be using Python as a tool to help you perform data science on real-world data!
Basic Python Data Types
The first and most important bit to know about Python is that there are three basic data types that we will be using initially. A data type describes the format, or category, of the data and not the data itself. The three data types include:
Numbers, which include both integers (whole numbers like
4
,8
,-3
, and0
) and numbers with a decimal (floating point numbers like3.14
,2.0001
, and-42.1
).Strings, which includes all non-numeric data. Strings must be surrounded by quotes (ex:
"Hello"
,"Ice Cream"
,"STAT107"
).DataFrames, which are tables that are similar to a spreadsheet containing rows and columns of data. Each DataFrame usually consists of an entire dataset or a subset of the full dataset.
Everything we initially work with will be one of these three types of data.
Using Variables
Python allows us to store data inside of named variables, which are a way to store and interact with data. This is similar to basic mathematics where you might have:
x = 5
, the value ofx
is set to5
,y = x + 2
, the value ofy
is set to the value ofx
(which is 5) plus2
=7
, andz = x + y
, the value ofz
is set to the sum ofx
(5) andy
(7) =12
Variables will store the value you assign to them until you assign them a new value. In the example above, x
will always have the value of 5
until x
is set again.
DataFrames
In data science, variables that store DataFrames will contain the data from our datasets. To load in any data into a variable, we will always perform three steps:
- We must first import the
pandas
library, giving us access to DataFrames. This is done via theimport
command and will always be the following line of code:
We must then load the dataset, assigning a variable to become a DataFrame. Unless we are working with many datasets at the same time, it is common to call the variable storing the dataset
df
, short for DataFrame. Thepd.read_csv(...)
command loads the dataset from a provided URL that contains a CSV file.In any Python notebook environment, writing the name of the variable as the last line in a cell will always display the variable's contents to the screen. To verify we loaded our data correctly, we'll display the contents of the variable
df
that we just assigned to be a DataFrame in the previous step:
Make sure to run the previous cell first!
Interested in learning different methods to load datasets into a DataFrame? Check out our Guides on Reading and Importing Data into DataFrames:
Example Walk-Throughs with Worksheets
Video 1: DataFrames
Practice Questions
Q1: Which of the following are all examples of data types in Python?Q2: What best describes the code: "x = 4"?
Q3: Before loading a dataset, or using any other functionality in the panda's library, what Python code must run first?
Q4: What is the value of x after the following lines of code are run?
