Finding Descriptive Statistics for Columns in a DataFrame

When we're presented with a new DataFrame, it can be a lot to deal with. A great way to familiarize ourselves with all the new information is to look at descriptive statistics (sometimes known as summary statistics) for all applicable variables.

The Movie Dataset

To demonstrate these functions, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf

Reset Code Python Output:


  
    
      
      movie
      release date
      domestic box office
      worldwide box office
      personal rating
      international box office
    
  
  
    
      0
      The Truman Show
      1996-06-05
      125618201
      264118201
      10
      138500000
    
    
      1
      Rogue One: A Star Wars Story
      2016-12-16
      532177324
      1055135598
      9
      522958274
    
    
      2
      Iron Man
      2008-05-02
      318604126
      585171547
      7
      266567421
    
    
      3
      Blade Runner
      1982-06-25
      32656328
      39535837
      8
      6879509
    
    
      4
      Breakfast at Tiffany's
      1961-10-05
      9551904
      9794721
      7
      242817

	movie	release date	domestic box office	worldwide box office	personal rating	international box office
0	The Truman Show	1996-06-05	125618201	264118201	10	138500000
1	Rogue One: A Star Wars Story	2016-12-16	532177324	1055135598	9	522958274
2	Iron Man	2008-05-02	318604126	585171547	7	266567421
3	Blade Runner	1982-06-25	32656328	39535837	8	6879509
4	Breakfast at Tiffany's	1961-10-05	9551904	9794721	7	242817

List of Functions

Pandas has a great selection of functions for calculating descriptive statistics. In most cases, we only want to use these on columns with float and int dtypes, not strings. For example, we can't calculate the average movie title!

We'll go into detail about how to use these later. But for now, here are the most common and useful functions.

.count()
- Returns how many non-null values are in a column
- In other words, how many rows actually have a value for this column?
.sum()
- Returns the sum of all values in a column
.mean()
- Returns the mean (average) of the values in a column
.median()
- Returns the median of the values in a column
.var()
- Returns the variance of the values in a column
.std()
- Returns the standard deviation of the values in the column
- aka the square root of the variance
- NOTE: Pandas automatically calculates the sample standard deviation, not the population standard deviation. To calculate the population standard deviation, switch the degrees of freedom to 0 by typing the parameter ddof = 0 in the parenthesis.
.min() and .max()
- Returns the minimum value in a column
- See the guide for Finding Specific Values in a Column
.quantile()
- Returns the quantiles of the values in a column
- Must input a parameter to specify the quantile
- See the guide for Finding Quantiles of a Column in a DataFrame

Now, let's see these functions in action.

Finding a Descriptive Statistic for a Single Column

The most practical use of descriptive statistics is to apply the functions to a single column. This allows us to store the result in a variable and save it for future analysis.

We do this by specifying the column in brackets before applying the function. Let's say we wanted to find the average personal rating of these 5 movies.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf["personal rating"].mean()

Reset Code Python Output:

```
8.2
```

Finding a Descriptive Statistic for All Columns

If we don't specify the column first, the function will return a list of that statistic for each column. But be careful: this could produce an error, since not every column in the DataFrame contains floats and ints!

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf.median(numeric_only=True)

Reset Code Python Output:

```
domestic box office         125618201.0
worldwide box office        264118201.0
personal rating                     8.0
international box office    138500000.0
dtype: float64
```

The Holy Grail: Finding All of the Basic Descriptive Statistic

All of the aforementioned functions find one descriptive statistic at a time. But if we want a simple way to see all this information at once, there's also a function for that: .describe(). There are a few different ways to use this function, which are detailed below.

Entire DataFrame

If we apply .describe() to an entire DataFrame, it returns a brand new DataFrame with rows that correspond to all essential descriptive statistics. By default, it will only include the columns with integer and float dtypes.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf.describe()

Reset Code Python Output:


  
    
      
      domestic box office
      worldwide box office
      personal rating
      international box office
    
  
  
    
      count
      5.000000e+00
      5.000000e+00
      5.00000
      5.000000e+00
    
    
      mean
      2.037216e+08
      3.907512e+08
      8.20000
      1.870296e+08
    
    
      std
      2.203103e+08
      4.369559e+08
      1.30384
      2.172975e+08
    
    
      min
      9.551904e+06
      9.794721e+06
      7.00000
      2.428170e+05
    
    
      25%
      3.265633e+07
      3.953584e+07
      7.00000
      6.879509e+06
    
    
      50%
      1.256182e+08
      2.641182e+08
      8.00000
      1.385000e+08
    
    
      75%
      3.186041e+08
      5.851715e+08
      9.00000
      2.665674e+08
    
    
      max
      5.321773e+08
      1.055136e+09
      10.00000
      5.229583e+08

	domestic box office	worldwide box office	personal rating	international box office
count	5.000000e+00	5.000000e+00	5.00000	5.000000e+00
mean	2.037216e+08	3.907512e+08	8.20000	1.870296e+08
std	2.203103e+08	4.369559e+08	1.30384	2.172975e+08
min	9.551904e+06	9.794721e+06	7.00000	2.428170e+05
25%	3.265633e+07	3.953584e+07	7.00000	6.879509e+06
50%	1.256182e+08	2.641182e+08	8.00000	1.385000e+08
75%	3.186041e+08	5.851715e+08	9.00000	2.665674e+08
max	5.321773e+08	1.055136e+09	10.00000	5.229583e+08

That one line of code returns something pretty powerful.

One Column

If you want to find all descriptive statistics for a single column at once, .describe() can do that, too. With only one column, the results are returned as a list.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf["worldwide box office"].describe()

Reset Code Python Output:

```
count    5.000000e+00
mean     3.907512e+08
std      4.369559e+08
min      9.794721e+06
25%      3.953584e+07
50%      2.641182e+08
75%      5.851715e+08
max      1.055136e+09
Name: worldwide box office, dtype: float64
```

However, when we apply .describe() to a column of strings, we don't get an error. Instead, .describe() gives us a list of statistics that are more applicable to the string dtype.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf["movie"].describe()

Reset Code Python Output:

```
count                   5
unique                  5
top       The Truman Show
freq                    1
Name: movie, dtype: object
```

Subsets of Columns

We can describe smaller subsets of columns, too. Just use double brackets to insert a list of the column names, with each name separated by a comma. The result will be a DataFrame.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf[["domestic box office", "worldwide box office"]].describe()\n&nbsp;

Reset Code Python Output:


  
    
      
      domestic box office
      worldwide box office
    
  
  
    
      count
      5.000000e+00
      5.000000e+00
    
    
      mean
      2.037216e+08
      3.907512e+08
    
    
      std
      2.203103e+08
      4.369559e+08
    
    
      min
      9.551904e+06
      9.794721e+06
    
    
      25%
      3.265633e+07
      3.953584e+07
    
    
      50%
      1.256182e+08
      2.641182e+08
    
    
      75%
      3.186041e+08
      5.851715e+08
    
    
      max
      5.321773e+08
      1.055136e+09

	domestic box office	worldwide box office
count	5.000000e+00	5.000000e+00
mean	2.037216e+08	3.907512e+08
std	2.203103e+08	4.369559e+08
min	9.551904e+06	9.794721e+06
25%	3.265633e+07	3.953584e+07
50%	1.256182e+08	2.641182e+08
75%	3.186041e+08	5.851715e+08
max	5.321773e+08	1.055136e+09

However, this is only effective when both columns contain numbers (floats and/or ints) or when both columns contain strings. If you select columns with contrasting dtypes, it will only show the numerical descriptive statistics by default.

import pandas as pd\n&nbsp;\n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n])\ndf[["movie", "personal rating"]].describe()\n&nbsp;

Reset Code Python Output:


  
    
      
      personal rating
    
  
  
    
      count
      5.00000
    
    
      mean
      8.20000
    
    
      std
      1.30384
    
    
      min
      7.00000
    
    
      25%
      7.00000
    
    
      50%
      8.00000
    
    
      75%
      9.00000
    
    
      max
      10.00000