Starting Your Own Data Science Project


Starting a data science project of your own can be an intimidating task. In this guide, we will help give you a head start on building your own project. We'll walk through: how to find a dataset, ways to start analyzing it, and some steps to spark ideas for you to explore!

Collecting Your Data

First, we need to find a dataset to perform our analysis on. It's best to find a dataset about a topic you are passionate about. As well, having prior knowledge about the data you're analyzing will help you understand the context behind certain columns which inform your analysis. Kaggle.com is a great resource to find a dataset on just about anything. If you are interested in sports for example, you can find more specific datasets at websites such as sports-reference.com.

For this example, we'll use a dataset of songs from our favorite artist at DISCOVERY; Taylor Swift!

The dataset we'll be using can be found on Kaggle, and contains information about every Taylor Swift song. Kaggle datasets also contain information about what the columns in the dataset represent.

Our columns in this dataset are Name (song name), Album, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, valence, and tempo. This may seem like a lot of information, so we'll take it slow and analyze a few of the variables at a time. Our first step will be to load the DataFrame, and become familiar with it so we can perform our analysis.

df = pd.read_csv('spotify_taylorswift.csv')
df.head(5)
namealbumartistrelease_datelengthpopularitydanceabilityacousticnessenergyinstrumentalnesslivenessloudnessspeechinessvalencetempo
0Tim McGrawTaylor SwiftTaylor Swift2006-10-24232106490.5800.5750.4910.00.1210-6.4620.02510.42576.009
1Picture To BurnTaylor SwiftTaylor Swift2006-10-24173066540.6580.1730.8770.00.0962-2.0980.03230.821105.586
2Teardrops On My Guitar - Radio Single RemixTaylor SwiftTaylor Swift2006-10-24203040590.6210.2880.4170.00.1190-6.9410.02310.28999.953
3A Place in this WorldTaylor SwiftTaylor Swift2006-10-24199200490.5760.0510.7770.00.3200-2.8810.03240.428115.028
4Cold As YouTaylor SwiftTaylor Swift2006-10-24239013500.4180.2170.4820.00.1230-5.7690.02660.261175.558
Loading Taylor Swift Dataframe

Exploring the Dataset

This dataset has 171 rows, and by using the pandas function df.dtypes(), we can see the data types of each column. Variables such as album name, artist name etc. are coded as objects, while audio information is numerical and is represented by integer or float values.

When working with datasets from the internet, the data might not always be perfectly formatted. There could be missing values, or data may have been entered incorrectly due to human error. Due to the imperfect nature of some datasets, it is important to understand what each column is supposed to be representing. This dataset seems to have the data types we'd expect for each column, so we can proceed.

If your dataset has null or missing values, or an incorrect data type (such as an object where an int or float value should be), you can look at our guide on handling missing data here to see how to correct it.

print('Dataframe Length: ' + str(len(df)))
print(df.dtypes)
Dataframe length: 171
name                 object
album                object
artist               object
release_date         object
length                int64
popularity            int64
danceability        float64
acousticness        float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
speechiness         float64
valence             float64
tempo               float64
dtype: object
Dataframe Length and Column Types

Descriptive Analysis

Our next step will be to explore the dataset in order to find relationships that interest us. It can be helpful to plot out variables of interest, and as you can see below, we plotted histograms for the popularity, danceability, and tempo variables.

Popularity is given as an integer ranging from 0-100, danceability is a metric ranging from 0-1 measuring how suitable a song is for dancing, and tempo measures beats per minute. Note that danceability appears to be normally distributed.

We can use functions such as df.describe() or df.corr() to see if there are any strong numerical relationships or variables with especially high/low values that we may want to inspect further.

df.describe()
indexlengthpopularitydanceabilityacousticnessenergyinstrumentalnesslivenessloudnessspeechinessvalencetempo
count171.0171.0171.0171.0171.0171.0171.0171.0171.0171.0171.0
mean236663.5261.2280.5890.3220.5860.0020.146-7.3220.0660.423124.141
std40456.7211.9050.1150.3340.190.0190.092.8790.1060.19331.484
min107133.00.00.2920.00.1180.00.034-17.9320.0230.0568.534
25%211833.058.00.5270.030.4620.00.093-8.8620.030.27896.052
50%234000.063.00.5930.1560.6060.00.115-6.6980.0370.416121.956
75%254447.067.00.6560.6740.7320.00.168-5.3360.0550.545146.04
max403887.082.00.8970.9710.9440.1790.657-2.0980.9120.942207.476
Descriptive Analysis
df.corr()
indexlengthpopularitydanceabilityacousticnessenergyinstrumentalnesslivenessloudnessspeechinessvalencetempo
length1.00.0118-0.30160.0387-0.1148-0.0813-0.14840.0441-0.4144-0.42040.0104
popularity0.01181.00.0726-0.11780.12750.0356-0.40670.1226-0.47830.0342-0.0157
danceability-0.30160.07261.0-0.14310.0627-0.0518-0.01580.00260.18390.3798-0.2354
acousticness0.0387-0.1178-0.14311.0-0.71010.1407-0.0654-0.73660.1431-0.2312-0.1345
energy-0.11480.12750.0627-0.71011.00.00030.04640.785-0.17930.49040.2099
instrumentalness-0.08130.0356-0.05180.14070.00031.0-0.0591-0.0842-0.02970.02010.0433
liveness-0.1484-0.4067-0.0158-0.06540.0464-0.05911.00.01630.3579-0.01730.0349
loudness0.04410.12260.0026-0.73660.785-0.08420.01631.0-0.40960.29990.1715
speechiness-0.4144-0.47830.18390.1431-0.1793-0.02970.3579-0.40961.00.1204-0.0278
valence-0.42040.03420.3798-0.23120.49040.0201-0.01730.29990.12041.0-0.0061
tempo0.0104-0.0157-0.2354-0.13450.20990.04330.03490.1715-0.0278-0.00611.0
Correlation Matrix
df.popularity.plot.hist()
plt.xlabel('Popularity')
df.danceability.plot.hist()
plt.xlabel('Danceability')
df.tempo.plot.hist()
plt.xlabel('Tempo')

Popularity Histogram
Danceability Histogram
Popularity Histogram

Histograms of Columns

Generating/ Exploring Hypothesis

After looking over the dataset as a whole, we can start to manipulate it to look into our areas of interest.

By this point, we should have a solid understanding of the structure of our dataset. Now, you will think of 2-3 hypotheses or questions. These hypotheses or questions will be the basis of your project.

For example, how does X affect Y in this dataset? Can we predict Z using A? After formulating our hypothesis, creating DataFrames that take subsets of our main df, or grouping the data, is an excellent way to test our theories.

In this example, we'll examine differences in various categories of audio information based on album. First, let's group the data by album, and create different DataFrames for each albums. This will allow us to examine differences in Taylor Swift songs depending on the album a song was from.

One question we might like to answer is: "Is the median danceability in Taylor Swift songs different in each Album?"

As we can see in the box plot below, the median danceability of her songs in Lover and Folklore are quite different. Creating plots and visualizations like this can help us to see patterns in our data that we might otherwise miss.

df.groupby('album').agg('mean').reset_index()
indexalbumlengthpopularitydanceabilityacousticnessenergyinstrumentalnesslivenessloudnessspeechinessvalencetempo
01989 (Deluxe)217139.368454.42110.63320.24460.62480.00070.2032-7.92410.17350.4542127.0331
1Fearless (Taylor's Version)245865.065.57690.5510.21410.63910.00.1624-6.19650.03790.4219131.2372
2Lover206187.833372.11110.65820.33370.54520.00070.1152-8.01330.09910.4814119.9727
3Red (Deluxe Edition)247294.318260.50.63340.14880.60080.00180.1191-7.380.03660.4681110.2965
4Speak Now (Deluxe Package)275969.549.72730.5590.22650.65940.00010.167-4.80690.03520.4297132.8357
5Taylor Swift213971.133350.13330.54530.1830.66430.00010.1608-4.73170.03270.4265126.0538
6evermore (deluxe version)243816.235365.47060.52680.79410.49410.02060.1136-9.78160.05790.4335120.7073
7folklore (deluxe version)236964.470662.64710.54190.71760.41580.00030.1105-10.33610.03950.3614119.8844
8reputation223020.071.86670.65790.13850.58290.00.1522-7.67240.09510.2934127.5401
Dataframe Grouped by Album
evermore = df[ df.album == 'evermore (deluxe version)']
lover = df[ df.album == 'Lover']
print(lover.danceability.mean())
print(evermore.danceability.mean())

0.6582222222222223
0.5268235294117648

Dataframe Grouped by Album
sns.set(rc={'figure.figsize':(5,5)})
sns.boxplot(x = 'album', y = 'danceability', data = df)
sns.set(rc={'figure.figsize':(5,5)})
sns.boxplot(x = 'album', y = 'acousticness', data = df)

Danceability Boxplot
Acousticness Boxplot

Boxplots of Dancebaility and Acousticness by Album

Statistical and Predictive Analysis

After diving further into your hypothesis, the time has come to perform our final analysis.

Now, we can take an observation we had (or something we want to predict) and create hypothesis tests, confidence intervals, linear regressions, or clustering models on our data. The steps above should help you identify variables that may be related or may affect another variable. Use these variables to conduct the above mentioned tests and prove, or disprove, your hypotheses.

This guide is only a suggestion on ways to get started. Your datasets will likely look quite different from the one we used here, but the principles we discussed can help you get started if you aren't sure how to begin! Good luck future data scientist.