Types of Data
Before starting any data analysis, it is important that the data scientist knows what type of data they are working with! There are many ways to categorize data and we are going to start with the most broad type of categorization.
There are two broad categories of data:
- Structured Data
— AND —
- Unstructured Data
Structured data refers to data that has been organized and categorized in a well-defined format. In DISCOVERY, we work with a lot of structured data in the form of a CSV or "comma-separated values" file. CSV files are the most universal format that are easily read by Data Science tools.
The format of a CSV file has two basic rules:
- Each line contains a row in the dataset
- Each value in the row is separated by a comma
Below is the first four lines of the Course Catalog dataset, a dataset containing every course at Illinois, in raw CSV format:
When we highlight the commas, it's easier to notice the comma-separated columns:
We will use Python to transform the raw CSV file into a DataFrame, an organized table of rows and columns with column headers. Each column is often referred to as a variable and each row is referred to as an observation:
|Year||Term||YearTerm||Subject||Number||Name||Description||Credit Hours||Section Info||Degree Attributes|
|0||2019||Fall||2019-fa||AAS||100||Intro Asian American Studies||Interdisciplinary introduction to the basic co...||3 hours.||NaN||Social & Beh Sci - Soc Sci, and Cultural Studi...|
|1||2019||Fall||2019-fa||AAS||105||Introduction to Arab American Studies||Interdisciplinary introduction to the basic co...||3 hours.||NaN||Cultural Studies - US Minority course.|
|2||2019||Fall||2019-fa||AAS||120||Intro to Asian Am Pop Culture||Introductory understanding of the way U.S. pop...||3 hours.||NaN||Cultural Studies - US Minority course.|
|3||2019||Fall||2019-fa||AAS||199||Undergraduate Open Seminar||May be repeated to a maximum of 6 hours.||1 TO 5 hours.||NaN||NaN|
|4||2019||Fall||2019-fa||AAS||200||U.S. Race and Empire||Invites students to examine histories and narr...||3 hours.||Same as LLS 200.||Humanities - Hist & Phil, and Cultural Studies...|
|8584||2019||Fall||2019-fa||ZULU||202||Elementary Zulu II||Continuation of ZULU 201 with introduction of ...||5 hours.||Same as AFST 252. Participation in the languag...||NaN|
|8585||2019||Fall||2019-fa||ZULU||403||Intermediate Zulu I||Survey of more advanced grammar; emphasis on i...||4 hours.||NaN||NaN|
|8586||2019||Fall||2019-fa||ZULU||404||Intermediate Zulu II||Continuation of ZULU 403; emphasis on increasi...||4 hours.||NaN||NaN|
|8587||2019||Fall||2019-fa||ZULU||405||Advanced Zulu I||Third year Zulu with emphasis on conversationa...||3 hours.||NaN||NaN|
|8588||2019||Fall||2019-fa||ZULU||406||Advanced Zulu II||Continuation of Zulu 405 with increased emphas...||3 hours.||NaN||NaN|
CSV files can also be Excel Files, Google Sheets, and any other spreadsheet application. For example, below is the same Course Catalog dataset displayed in Excel:
Unstructured data refers to all other data (not organized, or not categorized, or not in a well-defined format).
Practice QuestionsQ1: A spreadsheet or CSV file containing one row for each game of football played by Illinois, with columns for the date, Illini score, opponent score, and location, is an example of:
Q2: The return addresses on all of the mail delivered to your apartment/dorm/house in the past 30 days.
Q3: Text of a Daily Illini newspaper article.
Q4: A spreadsheet of UIUC gender demographics where every row contains a single major and columns are organized with the same data in each column (ex: Major Name, Gender, Year, …).
Q5: Lyrics to a Song/Parody.
Q6: A image/screenshot of a phone/computer web page.