Types of Data


Before starting any data analysis, it is important for the data scientist to know what type of data they are working with! There are many ways to categorize data and we are going to start with the most broad type of categorization.

There are two broad categories of data:

  • Structured Data
    AND
  • Unstructured Data

Structured Data

Structured data refers to data that has been organized and categorized in a well-defined format. In DISCOVERY, we work with a lot of structured data in the form of a CSV or "comma-separated values" file. CSV files are easily read by Data Science tools, making them the most universal format for structured data.

The format of a CSV file has two basic rules:

  1. Each line contains a row in the dataset
  2. Each value in the row, also known as the column value, is separated by a comma

Below is the first four lines of the 2019 Course Catalog dataset, a dataset containing every course at Illinois, in raw CSV format:

Year,Term,YearTerm,Subject,Number,Name,Description,Credit Hours,Section Info,Degree Attributes
2019,Fall,2019-fa,AAS,100,Intro Asian American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Asian American Studies. Surveys the various dimensions of Asian American experiences including history, social organization, literature, arts, and politics.",3 hours.,,"Social & Beh Sci - Soc Sci, and Cultural Studies - US Minority course."
2019,Fall,2019-fa,AAS,105,Introduction to Arab American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Arab American Studies. Addresses the issues of history, race, social organization, politics, literature, and art related to Arab American experiences.",3 hours.,,Cultural Studies - US Minority course.
2019,Fall,2019-fa,AAS,120,Intro to Asian Am Pop Culture,Introductory understanding of the way U.S. popular culture has affected Asian Americans and the contributions Asian Americans have made to U.S. media and popular culture since the mid 1880's.,3 hours.,,Cultural Studies - US Minority course.
The first four lines of course-catalog.csv, the 2019 Course Catalog dataset. The full dataset has a 8,590 total lines (one line of headers and 8,589 lines of data).

When we highlight the commas, it's easier to notice the comma-separated columns:

Year,Term,YearTerm,Subject,Number,Name,Description,Credit Hours,Section Info,Degree Attributes
2019,Fall,2019-fa,AAS,100,Intro Asian American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Asian American Studies. Surveys the various dimensions of Asian American experiences including history, social organization, literature, arts, and politics.",3 hours.,,"Social & Beh Sci - Soc Sci, and Cultural Studies - US Minority course."
2019,Fall,2019-fa,AAS,105,Introduction to Arab American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Arab American Studies. Addresses the issues of history, race, social organization, politics, literature, and art related to Arab American experiences.",3 hours.,,Cultural Studies - US Minority course.
2019,Fall,2019-fa,AAS,120,Intro to Asian Am Pop Culture,Introductory understanding of the way U.S. popular culture has affected Asian Americans and the contributions Asian Americans have made to U.S. media and popular culture since the mid 1880's.,3 hours.,,Cultural Studies - US Minority course.
The same first four lines of course-catalog.csv, but with commas highlighted.

We can use Python to transform the raw CSV file into a DataFrame, an organized table of rows and columns with column headers. Each column is often referred to as a variable and each row is referred to as an observation:

YearTermYearTermSubjectNumberNameDescriptionCredit HoursSection InfoDegree Attributes
02019Fall2019-faAAS100Intro Asian American StudiesInterdisciplinary introduction to the basic co...3 hours.NaNSocial & Beh Sci - Soc Sci, and Cultural Studi...
12019Fall2019-faAAS105Introduction to Arab American StudiesInterdisciplinary introduction to the basic co...3 hours.NaNCultural Studies - US Minority course.
22019Fall2019-faAAS120Intro to Asian Am Pop CultureIntroductory understanding of the way U.S. pop...3 hours.NaNCultural Studies - US Minority course.
32019Fall2019-faAAS199Undergraduate Open SeminarMay be repeated to a maximum of 6 hours.1 TO 5 hours.NaNNaN
42019Fall2019-faAAS200U.S. Race and EmpireInvites students to examine histories and narr...3 hours.Same as LLS 200.Humanities - Hist & Phil, and Cultural Studies...
.................................
85842019Fall2019-faZULU202Elementary Zulu IIContinuation of ZULU 201 with introduction of ...5 hours.Same as AFST 252. Participation in the languag...NaN
85852019Fall2019-faZULU403Intermediate Zulu ISurvey of more advanced grammar; emphasis on i...4 hours.NaNNaN
85862019Fall2019-faZULU404Intermediate Zulu IIContinuation of ZULU 403; emphasis on increasi...4 hours.NaNNaN
85872019Fall2019-faZULU405Advanced Zulu IThird year Zulu with emphasis on conversationa...3 hours.NaNNaN
85882019Fall2019-faZULU406Advanced Zulu IIContinuation of Zulu 405 with increased emphas...3 hours.NaNNaN

The full 2019 Course Catalog dataset, as displayed by Python in a DataFrame.

CSV files can also be Excel Files, Google Sheets, or any other spreadsheet application. For example, below is the same Course Catalog dataset displayed in Excel:

The Course Catalog dataset opened in Microsoft Excel
The Course Catalog dataset opened in Microsoft Excel.

Unstructured Data

Unstructured data refers to all other data (not organized, or not categorized, or not in a well-defined format). Some examples of unstructured data include videos, images, and word documents.

An image of the Data Science DISCOVERY website from Fall 2019. This is an example of unstructured data since it's not organized into rows and columns with column headers.

Practice Questions

Q1: A spreadsheet or CSV file containing one row for each game of football played by Illinois, with columns for the date, Illini score, opponent score, and location, is an example of:
Q2: The return addresses on all of the mail delivered to your apartment/dorm/house in the past 30 days.
Q3: Text of a Daily Illini newspaper article.
Q4: A spreadsheet of UIUC gender demographics where every row contains a single major and columns are organized with the same data in each column (ex: Major Name, Gender, Year, …).
Q5: Lyrics to a Song/Parody.
Q6: A image/screenshot of a phone/computer web page.