Introduction to Exploratory Data Analysis

DISCOVERY lab assignments and mastery platform homeworks are many students' very first exposure to programming! As exciting as this is, learning something new always leads to something much less exciting: getting stuck. For many students, the first major roadblocks begin as we transition from introductory programming into Exploratory Data Analysis, usually referred to as EDA.

EDA is our first opportunity to answer real questions about real data. Out in industry, it is used as a preliminary step to get to know the dataset we're about to work with; the information uncovered in EDA will be crucial in guiding our decisions for further analysis. Mastering the common methods here is our first big step in developing the toolbox we'll bring with us everywhere as data scientists. In other words, we're starting a very important part of our journey in DISCOVERY!

That being said, EDA is a big step! Using huge datasets to answer tough questions can be just that--tough! Suddenly, we really need to understand the rules of programming, the functions we've learned, the data we're working with, the question we've been asked, and at least one way to put all of these things together. In the beginning, EDA may seem convoluted and complicated and frustrating. This is completely normal. Remember, these frustrations are just growing pains: temporary discomforts to remind you of how much you're about to be grow--or in our case, learn. Every time you get stuck, you are at an opportunity to do something more challenging than you've done before as a data scientist. This should be exciting! Try to remember this moving forward. :)

All that being said, let's discuss some ways to make this process easier.

I like to break this down into three questions.

Three questions necessary for any EDA:
1. What are you trying to find?
2. What were you given to try to find it?
3. How can you turn what you've been given into what you're trying to find?

What are you trying to find?

While this first question might seem like an simple one, there can be more to it than meets the eye. Most people will eventually find themselves poking at a problem, only to realize after a few minutes that they're not sure what they're supposed to be doing!

Whether or not you've been in this position yet, you will eventually need to ask yourself the first question: What am I trying to find?

To answer this, there are a few important sub questions to consider:

  1. Most importantly, what actually is it?
    Make sure you can say in your own words what it is that you're looking for. A thorough understanding of both the question and the goal is crucial for designing a game plan. If you're really not sure what a question is asking, ask for help!
  2. What format should my answer be in?
    Another DataFrame? A single number? A plot? A word? A single column? Some additional columns in a DataFrame? Make sure to know what you're trying to make before you try to make it.
  3. What is a valid range I expect my answer to be in?
    Percentages (0-1)? Count values (non-negative integers)? Whole numbers? Decimals? An average of some kind? Having an idea of what this range should be will help you decide if you believe your answer is correct.

Now that you have a strong understanding of your goal, we can move onto the next question...

What were you given to try to find it?

This question is essentially encouraging a thorough investigation of the data provided to you. Sometimes, you may have a very large, complicated dataset (or potentially multiple datasets!) to work with. Living in the age of big data, this is often the case for data scientists, with considerable time being necessary for this step.

Luckily for us, the datasets in DISCOVERY tend to be on the simpler, cleaner side, so this step can be as straightforward as a quick skim of the variable names and some of their values. The .head(), .sample(), and .columns() functions will be your friends here.

Regardless of the complexity of your dataset, look at your data before starting your EDA! Familiarize yourself with the information available to you, where it is, and how you can access it so that you can properly make a plan for finding your answer.

Speaking of which, we have our third, final, and most important question...

How can you turn what you've been given into what you're trying to find?

This question is where the real action is. You know what you were given, you know what you're trying to find, and now you need to find a way to get from one to the other.

🚨 🚨 Don't just start writing code! 🚨 🚨

Unless the goal can be completed in a line or two of code, it is always best to think through your process in words before thinking about it in Python. Talk through it, draw it out, do whatever it takes for you to understand what your code will be doing before trying to write the code itself.

Here are some sub questions to aid your planning:

  1. What information do I need to calculate my answer?
    Figuring out the pieces necessary to perform your final calculation should be your first step. Are you calculating an average? A total? A GPA? Maybe you're making a histogram or box plot? Once you know the pieces of information necessary for your final step (and how many pieces there are), you can start to think about how to access and store them.

  2. Does my answer involve all of the data?
    Sometimes, the answers we're looking for only involve a subset of the data. In other words, we may only want to look at rows of the DataFrame that meet a certain criteria.

  3. Is the data in the format I need to get my answer?
    Think about the type of information in each row of the DataFrame you were given. Maybe each row is a student, a section, a team, or a client. Then think about the type of information you'll need to get your final answer. Are we looking for information on specific groups of these students, sections, teams, or clients? Or will we continue to be looking at them individually?

  4. How many intermediate calculations will be involved?
    Do these intermediate calculations involve everyone, just a subset, or entire groups within the data? Which column holds the information necessary for my calculations? Do these calculations need to be stored somewhere to use later?

  5. What order should I be doing this in?
    Read the given question and instructions carefully. Often, these can give hints about the order you should be going in. Do you want to be looking at data for groups composed of only certain individuals? Or maybe you only want to see entire groups meeting a certain criteria. While these two tasks involve the same functions, the order of the functions would be different. Which is which?

Chances are, at least a few of those questions made you think of specific actions in Python. In case you didn't catch them all, here's my list of translations and some resources:

That's it!
With time and practice, practice, practice, your ability to translate common EDA tasks into lines of Python code will become second nature. For now, the pandas cheat sheet is an excellent resource to remind you of the tools you have available. And of course, always remember that there are course staff excited to help you make this step.

Now get out there and find some answers, data scientist!