Lesson 5: Cleaning Data
External Tools | Analyzing | Group Skill Building
In this lesson, students begin working with the data that they have been collecting since the first lesson of the chapter in the class "data tracker." They are introduced to the first step in analyzing data: cleaning the data. Students will follow a guide in Code Studio, which demonstrates the common techniques of filtering and sorting data to familiarize themselves with its contents. Then they will correct errors they find in the data by either hand-correcting invalid values or deleting them. Finally they will categorize any free-text columns that were collected to prepare them for analysis. This lesson introduces many new skills with spreadsheets and reveals the sometimes subjective nature of data analysis.
The main purpose here is have students independently apply some of the data manipulation skills (in spreadsheets) that they've learned over the past few lessons, to a new dataset that is relatively uncurated. This is the beginning of the process of "extracting knowledge from data": look at the data and clean it up so that you can process it using computational tools.
Using computational tools to analyze data has made it much easier to find trends and patterns in large datasets. When preparing data for this kind of analysis, however, it’s important to remember that the computer is much less “intelligent” than we might imagine. Small discrepancies in the data may prevent accurate interpretation of trends and patterns or can even make it impossible to use the data in computation in the first place. Cleaning data is therefore an important step in analyzing it, and in many contexts, it may actually take the largest amount of time.
Getting Started (5 mins)
Activity (40 mins)
Wrap Up (5 mins)
Students will be able to:
- Filter and sort a dataset using a spreadsheet tool.
- Identify and correct invalid values in a dataset with the aid of computational tools
- Justify the need to clean data prior to analyzing it with computational tools.
- Prepare data collected from survey to share with students. Ensure that a “Teacher only” master copy is kept safely somewhere.
- Student partners will carry through the next lesson and Practice PT. You may wish to select these pairs beforehand.
Getting Started (5 mins)
Survey Reminder - Last one!
If you need to prepare the data ahead of time, you might not be able to squeeze in one last entry.
Survey Reminder: Do one more entry in the class data tracker. You'll be using this data today. Give students a few minutes to fill out the class tracker survey that you started in Lesson 7 - Introduction to Data.
Discuss: Why we need to clean data
Introduce the activity of the day and motivate the need to clean data before using it for analysis.
We have been collecting data about ourselves for several days. Now it’s time to look at that data and see if we can find any interesting patterns or trends within it.
"Before we get started, what challenges do you think we’ll encounter as we begin to peek into the data we've been collecting?"
Ask students to share their ideas with small groups or as a class.
- While there are presumably many challenges that will be mentioned, likely some of the comments will be related to the state of the data that was collected - in other words, how “clean” it is for analysis.
There are many challenges associated with analyzing data. Today we’re going to look at one that a lot of people don’t often think about. When we collect data, it’s usually “dirty,” which means that, for one reason or another, it’s not ready for analysis. We’re going to investigate what this looks like and learn to use some tools to help us look at and “clean” the data.
Activity (40 mins)
Clean Your Data
Place Students in Pairs:
Students will clean and categorize their data in pairs. They will be using this data that they cleaned later in the unit for the practice PT.
Sharing the Data:
Pairs are going to need their own copy of the data collected from the survey. You should make your own master copy that will not be changed. To share the data with students, you can:
- Send a copy by email.
- Post a link to a Google Spreadsheet (make sure it’s “View Only”).
- Note: Instructions in Code Studio explain to students how they can “Make a copy” of a Google Sheet for themselves. If you are using a different spreadsheet tool, you should still share a copy.
Students will be guided through a series of activities that walk them through filtering, sorting, cleaning, and categorizing data.
The activity should be done in three parts.
- You may wish to work through these set of activities as a class.
- When using Google Sheets or other online spreadsheet tools, it is possible for two students to clean the same dataset at the same time.
- Students should consult with their partners as they make their categorizations. Remind them that the goal is to have something they could analyze or chart later.
1. Familiarizing Yourself with the Data:
Students learn how to sort and filter in a spreadsheet tool. There is no need yet to actually change any of the values. They simply should learn how these tools work in the spreadsheet tool you are using. Students can move on when they know how to filter and sort data.
2. Cleaning the Data:
Ignore “freeform text” responses for now -- for example, the “What did you do to relax?” column -- and focus attention on values that should be numeric or single words. Students will using sorting and filtering to find invalid values and will either fix or delete them. Students can move on when they have cleaned all “non-freeform” columns.
3. Categorizing Data:
Now focus attention on “freeform text” columns. Students will need to manually create new columns that categorize the inputs. This is a necessary step in order to perform computation with the data but it won’t feel very “algorithmic.” They will need to make choices, which is fine and will be addressed in the wrap up. Students can move on when they have cleaned all “freeform” columns by creating new columns of categories.
Wrap Up (5 mins)
Reflection: Is data analysis objective?
Students should reflect on the often subjective nature of cleaning data. Even as data is being cleaned to be used by computers, there will often be a “human element” to how it is cleaned.
Prompt: (Also found on Code Studio)
"In order to analyze data with a computer, we need to clean the data first. Based on your experience today, would you say that data analysis is a perfectly objective process? Why or why not?"
Students should share their ideas in small groups before discussing as a class. The key ideas to touch on are:
- Data cleaning usually requires a human to make decisions about the data.
- There often will not be one “right” way to clean the data and different people will do it differently.
- Any categorizing in particular is quite subjective.
NOTE: Make sure to save the cleaned up data:
Pairs should save their data somewhere they can both access it. They will be using it in the following lesson.
Score or review a written response to the reflection prompt from the wrap up "Is data analysis objective?" (also found in code studio)
Make a simple rubric (a checklist basically) for the steps of the activity that students were supposed to go through:
- Used sorting in a spreadsheet
- Used filtering to help identify outliers for cleaning
- Added a column to categorize some form of free form text.
Multiple Choice (also on code studio)
Which of the following is the most accurate statement about cleaning and filtering data?
- Using computing tools to filter and clean raw data makes it impossible to analyze or draw accurate conclusions
- Filtering and cleaning data is a fully automated process that should not require human input or intervention
- Filtering and cleaning data is a human process that does not require the use of computers
- Filtering and cleaning data is necessary to ensure that data is in a form that is better for computers to process
- Getting to Know Your Data_2018
- Student Overview
Getting to Know Your Data
Before analyzing a dataset you'll want to familiarize yourself with the data. Computational tools like a spreadsheet make it easy to quickly get a sense of what your data looks like.
Make a copy of the data
You will need to get your own copy of the data to clean. If your data is currently in a Google Sheet you'll need to "Make a copy" of the data, as shown below.
Do This: Make your own copy of the data collected by your class
Making a copy of the dataset
Filtering Your Data
Your data may be too large to look at all at once. One way to address this problem is to filter the data so that you are only looking at some of the rows. Here's how you can do it in Google Sheets.
Do This: Follow the example in the animation below. Filter your own data on one of its columns
Filtering the dataset so that only rows in which a person slept for 7 hours are shown.
More Complex Filtering
Filters may include more than one value, just check all the values you'd like to include. You can even use a conditional statement as a filter. For example, you can make a filter that will only show values if the value is less than some number you specify. It's also possible to filter on multiple columns at the same time. More complex filters help refine how you look at your data.
Do This: Add filters to two different columns. In one column choose at least two values. In the other column use a Conditional filter, as shown below.
Filtering to include people who feel "Good" or "Great" and who slept 7 hours or more.
Sorting will reorder your rows of data by one of the columns. This makes it easy to see the smallest or largest value in each column. You may also notice patterns in your data once it is in sorted order.
Do This: Sort your data by at least one of the columns, both A -> Z and Z -> A
Sorting the dataset by number of hours worked, both from smallest to largest and from largest to smallest
- Cleaning Data_2018
- Student Overview
Why is it Important to Clean Data?
Data and data analytics have become a key factor in problem solving. Thanks to the internet, people have been able to crowd source data collection and develop even more insights about many prominent issues around the world-- ranging from election statistics to restaurant reviews. Leveraging the internet is an effective way to collect and share data, but can introduce problems when users input their information in different ways. To counteract the issues that come along with non-standardized data , it is necessary to 'clean' it. Through this process all inaccuracies, inconsistencies, and irrelevant data is either removed or corrected.
Limitations of the Computer
Using computational tools to analyze data has made it much easier to find trends and patterns in large datasets. When preparing data for this kind of analysis, however, it's important to remember that the computer is much less "intelligent" than we might imagine. Small discrepancies in the data may prevent accurate interpretation of trends and patterns and can even make it impossible to use the data at all. Cleaning data is therefore an important step in analyzing it, and in many contexts, it may actually take the largest amount of time.
In the Curriculum
In lesson 13 students will follow a guide in Code Studio, which demonstrates the common techniques of filtering and sorting data, to familiarize themselves with its contents. Then they will correct errors they find in the data by either hand-correcting invalid values or deleting them. Finally they will categorize any free-text columns that were collected to prepare them for analysis. This lesson introduces many new skills with spreadsheets and reveals the sometimes subjective nature of data analysis. The main purpose here is have students independently apply some of the data manipulation skills (in spreadsheets) that they've learned over the past few lessons, to a new dataset that is relatively uncurated. This is the beginning of the process of "extracting knowledge from data": look at the data and clean it up so that you can process it using computational tools.
- Categorizing Data_2018
- Student Overview
In order to use computers to analyze data we usually need it to be standardized in some way.
Data collected as "free form text" will be particularly susceptible to this problem. If you ask people "What did you do last night" you will likely get a different response from every single person. Making charts or tables with this information would be meaningless and confusing.
Free form text data like this may be useful for a human to read but cannot easily be used by a computer.
In order to fix this you might need to create new columns of data by hand which categorize free form text into data that is more useful for computation.
In the example below you'll see one possible way to categorize responses to the question, "What did you do to relax last night?" In this case the data is being categorized by what the person was doing. The resulting column of data will be much easier to use when analyzing this data with a computer.
Standardizing how people relaxed by how they do the activity. This is only one of many different ways this data could be categorized.
Note: This process is being done by hand using only one of many possible methods for creating categories. There are always many different ways to categorize data and there will rarely be one clearly "right" way. You'll have to make judgment calls about what makes the most sense.
Create at least one new column in your dataset that standardizes a column of data collected as "free form text".
To do this you will need to invent a set of categories that applies to your data. This may take a while to do. Consult with your partners and make sure you have a generally agreed-upon set of rules as you start.
For reference, this question is based on the CSP Framework essential knowledge statement:
3.1.1B Digital information can be filtered and cleaned by using computers to process information.
In order to analyze data with a computer, we need to clean the data first. Based on your experience today, would you say that data analysis is a perfectly objective process? Why or why not? (Limit to about 100 words)
Computer Science Principles
1.2 - Computing enables people to use creative development processes to create computational artifacts for creative expression or to solve a problem.
1.2.1 - Create a computational artifact for creative expression. [P2]
- 1.2.1A - A computational artifact is anything created by a human using a computer and can be, but is not limited to, a program, an image, audio, video, a presentation, or a web page file.
- 1.2.1B - Creating computational artifacts requires understanding and using software tools and services.
- 1.2.1C - Computing tools and techniques are used to create computational artifacts and can include, but are not limited to, programming IDEs, spreadsheets, 3D printers, or text editors.
- 1.2.1D - A creatively developed computational artifact can be created by using nontraditional, nonprescribed computing techniques.
- 1.2.1E - Creative expressions in a computational artifact can reflect personal expressions of ideas or interests.
1.2.4 - Collaborate in the creation of computational artifacts. [P6]
- 1.2.4A - A collaboratively created computational artifact reflects effort by more than one person.
- 1.2.4B - Effective collaborative teams consider the use of online collaborative tools.
3.1 - People use computer programs to process information to gain insight and knowledge.
3.1.1 - Use computers to process information, find patterns, and test hypotheses about digitally processed information to gain insight and knowledge. [P4]
- 3.1.1A - Computers are used in an iterative and interactive way when processing digital information to gain insight and knowledge.
- 3.1.1B - Digital information can be filtered and cleaned by using computers to process information.
3.1.2 - Collaborate when processing information to gain insight and knowledge. [P6]
- 3.1.2A - Collaboration is an important part of solving data driven problems.
- 3.1.2B - Collaboration facilitates solving computational problems by applying multiple perspectives, experiences, and skill sets.
- 3.1.2C - Communication between participants working on data driven problems gives rise to enhanced insights and knowledge.
- 3.1.2D - Collaboration in developing hypotheses and questions, and in testing hypotheses and answering questions, about data helps participants gain insight and knowledge.
- 3.1.2E - Collaborating face-to-face and using online collaborative tools can facilitate processing information to gain insight and knowledge.
- 3.1.2F - Investigating large data sets collaboratively can lead to insight and knowledge not obtained when working alone.
3.2 - Computing facilitates exploration and the discovery of connections in information.
3.2.1 - Extract information from data to discover and explain connections, patterns, or trends. [P1]
- 3.2.1A - Large data sets provide opportunities and challenges for extracting information and knowledge.
- 3.2.1B - Large data sets provide opportunities for identifying trends, making connections in data, and solving problems.
- 3.2.1C - Computing tools facilitate the discovery of connections in information within large data sets.
- 3.2.1D - Search tools are essential for efficiently finding information.
- 3.2.1E - Information filtering systems are important tools for finding information and recognizing patterns in the information.
- 3.2.1F - Software tools, including spreadsheets and databases, help to efficiently organize and find trends in information.
3.2.2 - Use large data sets to explore and discover information and knowledge. [P3]
- 3.2.2B - The storing, processing, and curating of large data sets is challenging.
- 3.2.2C - Structuring large data sets for analysis can be challenging.
- 3.2.2G - The effective use of large data sets requires computational solutions.
7.1 - Computing enhances communication, interaction, and cognition.
7.1.2 - Explain how people participate in a problem solving process that scales. [P4]
- 7.1.2C - Human computation harnesses contributions from many humans to solve problems related to digital data and the Web.
- 7.1.2D - Human capabilities are enhanced by digitally enabled collaboration.
CSTA K-12 Computer Science Standards (2017)
DA - Data & Analysis
- 3A-DA-11 - Create interactive data visualizations using software tools to help others better understand real-world phenomena.
- 3B-DA-05 - Use data analysis tools and techniques to identify patterns in data representing complex systems.
- 3B-DA-06 - Select data collection tools and techniques to generate data sets that support a claim or communicate information.
- 3B-DA-07 - Evaluate the ability of models and simulations to test and support the refinement of hypotheses.
IC - Impacts of Computing
- 3A-IC-25 - Test and refine computational artifacts to reduce bias and equity deficits.