Starting Your Own Data Science Project

Starting a data science project of your own can be an intimidating task. In this guide, we will help give you a head start on building your own project. We'll walk through: how to find a dataset, ways to start analyzing it, and some steps to spark ideas for you to explore!

Collecting Your Data

First, we need to find a dataset to perform our analysis on. It's best to find a dataset about a topic you are passionate about. As well, having prior knowledge about the data you're analyzing will help you understand the context behind certain columns which inform your analysis. Kaggle.com is a great resource to find a dataset on just about anything. If you are interested in sports for example, you can find more specific datasets at websites such as sports-reference.com.

For this example, we'll use a dataset of songs from our favorite artist at DISCOVERY; Taylor Swift!

The dataset we'll be using can be found on Kaggle, and contains information about every Taylor Swift song. Kaggle datasets also contain information about what the columns in the dataset represent.

Our columns in this dataset are Name (song name), Album, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, valence, and tempo. This may seem like a lot of information, so we'll take it slow and analyze a few of the variables at a time. Our first step will be to load the DataFrame, and become familiar with it so we can perform our analysis.

df = pd.read_csv('spotify_taylorswift.csv')
df.head(5)

	name	album	artist	release_date	length	popularity	danceability	acousticness	energy	liveness	loudness	speechiness	valence	tempo
0	Tim McGraw	Taylor Swift	Taylor Swift	2006-10-24	232106	49	0.580	0.575	0.491	0.1210	-6.462	0.0251	0.425	76.009
1	Picture To Burn	Taylor Swift	Taylor Swift	2006-10-24	173066	54	0.658	0.173	0.877	0.0962	-2.098	0.0323	0.821	105.586
2	Teardrops On My Guitar - Radio Single Remix	Taylor Swift	Taylor Swift	2006-10-24	203040	59	0.621	0.288	0.417	0.1190	-6.941	0.0231	0.289	99.953
3	A Place in this World	Taylor Swift	Taylor Swift	2006-10-24	199200	49	0.576	0.051	0.777	0.3200	-2.881	0.0324	0.428	115.028
4	Cold As You	Taylor Swift	Taylor Swift	2006-10-24	239013	50	0.418	0.217	0.482	0.1230	-5.769	0.0266	0.261	175.558

Loading Taylor Swift Dataframe

Exploring the Dataset

This dataset has 171 rows, and by using the pandas function df.dtypes(), we can see the data types of each column. Variables such as album name, artist name etc. are coded as objects, while audio information is numerical and is represented by integer or float values.

When working with datasets from the internet, the data might not always be perfectly formatted. There could be missing values, or data may have been entered incorrectly due to human error. Due to the imperfect nature of some datasets, it is important to understand what each column is supposed to be representing. This dataset seems to have the data types we'd expect for each column, so we can proceed.

If your dataset has null or missing values, or an incorrect data type (such as an object where an int or float value should be), you can look at our guide on handling missing data here to see how to correct it.

print('Dataframe Length: ' + str(len(df)))
print(df.dtypes)

Dataframe length: 171
name                 object
album                object
artist               object
release_date         object
length                int64
popularity            int64
danceability        float64
acousticness        float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
speechiness         float64
valence             float64
tempo               float64
dtype: object

Dataframe Length and Column Types

Descriptive Analysis

Our next step will be to explore the dataset in order to find relationships that interest us. It can be helpful to plot out variables of interest, and as you can see below, we plotted histograms for the popularity, danceability, and tempo variables.

Popularity is given as an integer ranging from 0-100, danceability is a metric ranging from 0-1 measuring how suitable a song is for dancing, and tempo measures beats per minute. Note that danceability appears to be normally distributed.

We can use functions such as df.describe() or df.corr() to see if there are any strong numerical relationships or variables with especially high/low values that we may want to inspect further.

df.describe()

index	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo
count	171.0	171.0	171.0	171.0	171.0	171.0	171.0	171.0	171.0	171.0	171.0
mean	236663.52	61.228	0.589	0.322	0.586	0.002	0.146	-7.322	0.066	0.423	124.141
std	40456.72	11.905	0.115	0.334	0.19	0.019	0.09	2.879	0.106	0.193	31.484
min	107133.0	0.0	0.292	0.0	0.118	0.0	0.034	-17.932	0.023	0.05	68.534
25%	211833.0	58.0	0.527	0.03	0.462	0.0	0.093	-8.862	0.03	0.278	96.052
50%	234000.0	63.0	0.593	0.156	0.606	0.0	0.115	-6.698	0.037	0.416	121.956
75%	254447.0	67.0	0.656	0.674	0.732	0.0	0.168	-5.336	0.055	0.545	146.04
max	403887.0	82.0	0.897	0.971	0.944	0.179	0.657	-2.098	0.912	0.942	207.476

Descriptive Analysis

df.corr()

index	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo
length	1.0	0.0118	-0.3016	0.0387	-0.1148	-0.0813	-0.1484	0.0441	-0.4144	-0.4204	0.0104
popularity	0.0118	1.0	0.0726	-0.1178	0.1275	0.0356	-0.4067	0.1226	-0.4783	0.0342	-0.0157
danceability	-0.3016	0.0726	1.0	-0.1431	0.0627	-0.0518	-0.0158	0.0026	0.1839	0.3798	-0.2354
acousticness	0.0387	-0.1178	-0.1431	1.0	-0.7101	0.1407	-0.0654	-0.7366	0.1431	-0.2312	-0.1345
energy	-0.1148	0.1275	0.0627	-0.7101	1.0	0.0003	0.0464	0.785	-0.1793	0.4904	0.2099
instrumentalness	-0.0813	0.0356	-0.0518	0.1407	0.0003	1.0	-0.0591	-0.0842	-0.0297	0.0201	0.0433
liveness	-0.1484	-0.4067	-0.0158	-0.0654	0.0464	-0.0591	1.0	0.0163	0.3579	-0.0173	0.0349
loudness	0.0441	0.1226	0.0026	-0.7366	0.785	-0.0842	0.0163	1.0	-0.4096	0.2999	0.1715
speechiness	-0.4144	-0.4783	0.1839	0.1431	-0.1793	-0.0297	0.3579	-0.4096	1.0	0.1204	-0.0278
valence	-0.4204	0.0342	0.3798	-0.2312	0.4904	0.0201	-0.0173	0.2999	0.1204	1.0	-0.0061
tempo	0.0104	-0.0157	-0.2354	-0.1345	0.2099	0.0433	0.0349	0.1715	-0.0278	-0.0061	1.0

Correlation Matrix

Popularity Histogram — Histograms of Columns

Danceability Histogram — Histograms of Columns

Generating/ Exploring Hypothesis

After looking over the dataset as a whole, we can start to manipulate it to look into our areas of interest.

By this point, we should have a solid understanding of the structure of our dataset. Now, you will think of 2-3 hypotheses or questions. These hypotheses or questions will be the basis of your project.

For example, how does X affect Y in this dataset? Can we predict Z using A? After formulating our hypothesis, creating DataFrames that take subsets of our main df, or grouping the data, is an excellent way to test our theories.

In this example, we'll examine differences in various categories of audio information based on album. First, let's group the data by album, and create different DataFrames for each albums. This will allow us to examine differences in Taylor Swift songs depending on the album a song was from.

One question we might like to answer is: "Is the median danceability in Taylor Swift songs different in each Album?"

As we can see in the box plot below, the median danceability of her songs in Lover and Folklore are quite different. Creating plots and visualizations like this can help us to see patterns in our data that we might otherwise miss.

df.groupby('album').agg('mean').reset_index()

index	album	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo
0	1989 (Deluxe)	217139.3684	54.4211	0.6332	0.2446	0.6248	0.0007	0.2032	-7.9241	0.1735	0.4542	127.0331
1	Fearless (Taylor's Version)	245865.0	65.5769	0.551	0.2141	0.6391	0.0	0.1624	-6.1965	0.0379	0.4219	131.2372
2	Lover	206187.8333	72.1111	0.6582	0.3337	0.5452	0.0007	0.1152	-8.0133	0.0991	0.4814	119.9727
3	Red (Deluxe Edition)	247294.3182	60.5	0.6334	0.1488	0.6008	0.0018	0.1191	-7.38	0.0366	0.4681	110.2965
4	Speak Now (Deluxe Package)	275969.5	49.7273	0.559	0.2265	0.6594	0.0001	0.167	-4.8069	0.0352	0.4297	132.8357
5	Taylor Swift	213971.1333	50.1333	0.5453	0.183	0.6643	0.0001	0.1608	-4.7317	0.0327	0.4265	126.0538
6	evermore (deluxe version)	243816.2353	65.4706	0.5268	0.7941	0.4941	0.0206	0.1136	-9.7816	0.0579	0.4335	120.7073
7	folklore (deluxe version)	236964.4706	62.6471	0.5419	0.7176	0.4158	0.0003	0.1105	-10.3361	0.0395	0.3614	119.8844
8	reputation	223020.0	71.8667	0.6579	0.1385	0.5829	0.0	0.1522	-7.6724	0.0951	0.2934	127.5401

Dataframe Grouped by Album

evermore = df[ df.album == 'evermore (deluxe version)']
lover = df[ df.album == 'Lover']
print(lover.danceability.mean())
print(evermore.danceability.mean())

0.6582222222222223
0.5268235294117648

Dataframe Grouped by Album

Danceability Boxplot — Boxplots of Dancebaility and Acousticness by Album

Acousticness Boxplot — Boxplots of Dancebaility and Acousticness by Album

Statistical and Predictive Analysis

After diving further into your hypothesis, the time has come to perform our final analysis.

Now, we can take an observation we had (or something we want to predict) and create hypothesis tests, confidence intervals, linear regressions, or clustering models on our data. The steps above should help you identify variables that may be related or may affect another variable. Use these variables to conduct the above mentioned tests and prove, or disprove, your hypotheses.

This guide is only a suggestion on ways to get started. Your datasets will likely look quite different from the one we used here, but the principles we discussed can help you get started if you aren't sure how to begin! Good luck future data scientist.