Types of Data
Before starting any data analysis, it is important for the data scientist to know what type of data they are working with! There are many ways to categorize data and we are going to start with the most broad type of categorization.
There are two broad categories of data:
- Structured Data
— AND — - Unstructured Data
Structured Data
Structured data refers to data that has been organized and categorized in a well-defined format. In DISCOVERY, we work with a lot of structured data in the form of a CSV or "comma-separated values" file. CSV files are easily read by Data Science tools, making them the most universal format for structured data.
The format of a CSV file has two basic rules:
- Each line contains a row in the dataset
- Each value in the row, also known as the column value, is separated by a comma
Below is the first four lines of the 2019 Course Catalog dataset, a dataset containing every course at Illinois, in raw CSV format:
2019,Fall,2019-fa,AAS,100,Intro Asian American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Asian American Studies. Surveys the various dimensions of Asian American experiences including history, social organization, literature, arts, and politics.",3 hours.,,"Social & Beh Sci - Soc Sci, and Cultural Studies - US Minority course."
2019,Fall,2019-fa,AAS,105,Introduction to Arab American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Arab American Studies. Addresses the issues of history, race, social organization, politics, literature, and art related to Arab American experiences.",3 hours.,,Cultural Studies - US Minority course.
2019,Fall,2019-fa,AAS,120,Intro to Asian Am Pop Culture,Introductory understanding of the way U.S. popular culture has affected Asian Americans and the contributions Asian Americans have made to U.S. media and popular culture since the mid 1880's.,3 hours.,,Cultural Studies - US Minority course.
course-catalog.csv
, the 2019 Course Catalog dataset. The full dataset has a 8,590 total lines (one line of headers and 8,589 lines of data).When we highlight the commas, it's easier to notice the comma-separated columns:
2019,Fall,2019-fa,AAS,100,Intro Asian American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Asian American Studies. Surveys the various dimensions of Asian American experiences including history, social organization, literature, arts, and politics.",3 hours.,,"Social & Beh Sci - Soc Sci, and Cultural Studies - US Minority course."
2019,Fall,2019-fa,AAS,105,Introduction to Arab American Studies,"Interdisciplinary introduction to the basic concepts and approaches in Arab American Studies. Addresses the issues of history, race, social organization, politics, literature, and art related to Arab American experiences.",3 hours.,,Cultural Studies - US Minority course.
2019,Fall,2019-fa,AAS,120,Intro to Asian Am Pop Culture,Introductory understanding of the way U.S. popular culture has affected Asian Americans and the contributions Asian Americans have made to U.S. media and popular culture since the mid 1880's.,3 hours.,,Cultural Studies - US Minority course.
course-catalog.csv
, but with commas highlighted.We can use Python to transform the raw CSV file into a DataFrame, an organized table of rows and columns with column headers. Each column is often referred to as a variable and each row is referred to as an observation:
Year | Term | YearTerm | Subject | Number | Name | Description | Credit Hours | Section Info | Degree Attributes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2019 | Fall | 2019-fa | AAS | 100 | Intro Asian American Studies | Interdisciplinary introduction to the basic co... | 3 hours. | NaN | Social & Beh Sci - Soc Sci, and Cultural Studi... |
1 | 2019 | Fall | 2019-fa | AAS | 105 | Introduction to Arab American Studies | Interdisciplinary introduction to the basic co... | 3 hours. | NaN | Cultural Studies - US Minority course. |
2 | 2019 | Fall | 2019-fa | AAS | 120 | Intro to Asian Am Pop Culture | Introductory understanding of the way U.S. pop... | 3 hours. | NaN | Cultural Studies - US Minority course. |
3 | 2019 | Fall | 2019-fa | AAS | 199 | Undergraduate Open Seminar | May be repeated to a maximum of 6 hours. | 1 TO 5 hours. | NaN | NaN |
4 | 2019 | Fall | 2019-fa | AAS | 200 | U.S. Race and Empire | Invites students to examine histories and narr... | 3 hours. | Same as LLS 200. | Humanities - Hist & Phil, and Cultural Studies... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8584 | 2019 | Fall | 2019-fa | ZULU | 202 | Elementary Zulu II | Continuation of ZULU 201 with introduction of ... | 5 hours. | Same as AFST 252. Participation in the languag... | NaN |
8585 | 2019 | Fall | 2019-fa | ZULU | 403 | Intermediate Zulu I | Survey of more advanced grammar; emphasis on i... | 4 hours. | NaN | NaN |
8586 | 2019 | Fall | 2019-fa | ZULU | 404 | Intermediate Zulu II | Continuation of ZULU 403; emphasis on increasi... | 4 hours. | NaN | NaN |
8587 | 2019 | Fall | 2019-fa | ZULU | 405 | Advanced Zulu I | Third year Zulu with emphasis on conversationa... | 3 hours. | NaN | NaN |
8588 | 2019 | Fall | 2019-fa | ZULU | 406 | Advanced Zulu II | Continuation of Zulu 405 with increased emphas... | 3 hours. | NaN | NaN |
The full 2019 Course Catalog dataset, as displayed by Python in a DataFrame.
CSV files can also be Excel Files, Google Sheets, or any other spreadsheet application. For example, below is the same Course Catalog dataset displayed in Excel:

Unstructured Data
Unstructured data refers to all other data (not organized, or not categorized, or not in a well-defined format). Some examples of unstructured data include videos, images, and word documents.

Practice Questions
Q1: A spreadsheet or CSV file containing one row for each game of football played by Illinois, with columns for the date, Illini score, opponent score, and location, is an example of:
Q2: The return addresses on all of the mail delivered to your apartment/dorm/house in the past 30 days.
Q3: Text of a Daily Illini newspaper article.
Q4: A spreadsheet of UIUC gender demographics where every row contains a single major and columns are organized with the same data in each column (ex: Major Name, Gender, Year, …).

Q5: Lyrics to a Song/Parody.

Q6: A image/screenshot of a phone/computer web page.
