Lesson 3: Check Your Assumptions

Overview

This lesson asks students to consider carefully the assumptions they make when interpreting data and data visualizations. The class begins by examining how the Google Flu Trends project tried and failed to use search trends to predict flu outbreaks. They will then read a report on the Digital Divide which highlights how access to technology differs widely by personal characteristics like race and income. This report challenges a widespread assumption that data collected online is representative of the population at large. To practice identifying assumptions in data analysis, students are provided a series of scenarios in which data-driven decisions are made based on flawed assumptions. They will need to identify the assumptions being made (most notably those related to the digital divide) and explain why these assumptions lead to incorrect conclusions.

Purpose

In this lesson we look deeper into why we separate the what from the why when looking at data. The main purpose here is to raise awareness of the assumptions that we (all people) make when looking at data and try to call them out. Some of these assumptions lie hidden beneath the surface and we want to shed some light on them by looking at some examples from the news. This is a useful mode of reflection that will serve students well when doing reflective writing on the performance tasks.

Analyzing and interpreting data will typically require some assumptions to be made about the accuracy of the data and the cause of the relationships observed within it. When decisions are made based on a collection of data, they will often rest just as much on that set of assumptions about the data as the data itself. Identifying and validating (or disproving) assumptions is therefore an important part of data analysis. Furthermore, clear communication about how data was interpreted should also include an account of the assumptions made along the way.

Agenda

Getting Started (15 mins)

Activity (25 mins)

Wrap-up (15 mins)

Assessment

Extended Learning

View on Code Studio

Objectives

Students will be able to:

  • Define the digital divide as the variation in access or use of technology by various demographic characteristics.
  • Identify assumptions made when drawing conclusions from data and data visualizations

Preparation

Links

Heads Up! Please make a copy of any documents you plan to share with students.

For the Students

Teaching Guide

Getting Started (15 mins)

  • Introduce the idea that incorrect assumptions about a dataset can lead to faulty conclusions.
  • Earlier prediction of flu outbreaks could limit the number of people who get sick or die from the flu each year.
  • More accurate and earlier detection of flu outbreaks can ensure resources for combating outbreaks are allocated and deployed earlier (e.g., clinics could be deployed to affected neighborhoods).

Show this Google Trends Video - Video video, which describes how Google used the trending data students saw earlier in the unit to predict outbreaks of the flu.

Thinking Prompt: What are the potential beneficial effects of using a tool like Google Flu Trends?

Discuss: Students should share their responses in small groups or as a class. In general, responses should be centered around the following ideas.

Distribute:

Share one or more of these articles with the class. They detail why Google Flu Trends eventually failed and should serve as a basis of discussion for some of the potential negative effects of large-scale data analysis.

Teaching Tip

Reading Strategy: Most of these articles are somewhat more sophisticated in their analysis of the problems with Google Trends than is necessary for discussion. You may wish to read one of these articles together as a class and just touch on the key points outlined below.

Thinking Prompt:

"Why did Google Flu Trends eventually fail? What assumptions did they make about their data or their model that ultimately proved not to be true?"

Discuss:

Once students have read one of the articles, review the key points from your article. The most important points about Google Flu trends can be found below:

  • Google Flu Trends worked well in some instances but often over-estimated, under-estimated, or entirely missed flu outbreaks. A notable example occurred when Google Flu Trends largely missed the outbreak of the H1N1 flu virus.

  • Just because someone is reading about the flu doesn’t mean they actually have it.

  • Some search terms like “high school basketball” might be good predictors of the flu one year but clearly shouldn’t be used to measure whether someone has the flu.

  • In general, many terms may have been good predictors of the flu for a while only because, like high school basketball, they are more searched in the winter when more people get the flu.

  • Google began recommending searches to users, which skewed what terms people searched for. As a result, the tool was measuring Google-generated suggested searches as well, which skewed results.

Transitional Remarks

The amount of data now available makes it very tempting to draw conclusions from it. There are certainly many beneficial results of analyzing this data, but we need to be very careful. To interpret data usually means making key assumptions. If those assumptions are wrong, our entire analysis may be wrong as well. Even when you’re not conducting the analysis yourself, it’s important to start thinking about what assumptions other people are making when they analyze data, too.

Activity (25 mins)

The Digital Divide and Checking Your Assumptions

Distribute: Activity Guide - Digital Divide and Checking Assumptions - Activity Guide

Part 1: The Digital Divide

This activity guide begins with a link to a report from Pew Research which examines the “digital divide.” Students should look through the visualizations in this report and record responses to the questions found in the activity guide.

Discuss:

In small groups or as a class, students should discuss the answers they have recorded in their activity guides. Key points for the following discussion include:

  • Access and use of the Internet differs by income, race, education, age, disability, and geography.
  • As a result, some groups are over- or under-represented when looking at activity online.
  • When we see behavior on the Internet, like search trends, we may be tempted to assume that access to the Internet is universal and so we are taking a representative sample of everyone.
  • In reality, a “digital divide” leads to some groups being over- or under-represented. Some people may not be on the Internet at all.

Part 2: Checking Your Assumptions

Students should complete the second half of the activity guide. They are presented a set of scenarios in which data was used to make a decision. Students will be asked to examine and critique the assumptions used to make these decisions. Then they will suggest additional data they would like to collect or other ways their decision could be made more reliably.

Wrap-up (15 mins)

Discussion Goal

Students should practice identifying when data is being interpreted and what assumptions are made to do so, by sharing their work from the activity guide.

Discuss: In small groups or as a class, students should share their responses on the activity guide. Use this opportunity to reinforce a group understanding of what kinds of assumptions are being made to interpret the data. Some possible types of assumptions are listed below.

  • The data collected is representative of the population at large (e.g., ignoring the “digital divide”).
  • Activity online will lead to activity in the real world (e.g., people expressing interest in a candidate online means they will vote for him or her in real life).
  • Data is being collected in the manner intended (e.g., ratings are generated by actual customers, instead of business owners or robots).
  • Many other assumptions regarding data are possible.

Teaching Tip

Leading the Discussion:

The answer key to the activity guide contains possible assumptions that could be made in each data scenario presented. In most instances, there will be many other possible assumptions. The focus here should be primarily on building a habit of checking assumptions before jumping to conclusions about trends in data.

Closing Remarks

Would anyone like to revise the explanation they gave for their google trends research in the previous lesson?

Has what you’ve learned today changed your perspective on the “story” you thought the data was telling?

In this course, we will be looking at a lot of data, so it is important early on to get in the habit of recognizing what assumptions we are making when we interpret that data.

In general, it is a good idea to call out explicitly your assumptions and think critically about what assumptions other people are making when they interpret data.

We may not become expert data analysts in this class, and even organizations like Google can make mistakes when interpreting data. Sometimes, the best we can do is just be honest with ourselves and other people about what assumptions we’re making, correct our wrong assumptions where we can, and keep an eye out for the assumptions other people are making when they try to tell us “what the data is saying.”

Assessment

Assessment Posibilities

Code Studio: Assessment questions are available on the Code Studio.

Score or peer review the activity guide:

  • There is an answer key to the questions listed in the activity guide.

Extended Learning

Share this article with students criticizing inaccurate or misleading ways of using Google Trends to write news stories. https://medium.com/@dannypage/stop-using-google-trends-a5014dd32588#.dd7bifrl5

  • Check Your Understanding
  • 2
  • 3
  • 4
  • (click tabs to see student view)
View on Code Studio

Student Instructions

View on Code Studio

Student Instructions

Consider the following statement from the CS Principles course framework:

7.4.1C The global distribution of computing resources raises issues of equity, access, and power.

Briefly describe one of these issues that you learned about in the lesson and how it affects your life or the lives of people you know. Keep your response to about 100 words (about 3-5 sentences).

View on Code Studio

Student Instructions

On a survey of high school seniors they are asked:

  • What state do you live in?
  • How likely are you to attend college in your home state? (on a scale of 1-5, 5 meaning "very likely")
  • What do you plan to study?

A student, Amara, plans to use the survey data to create a visualization and short summary of students' plans for college. First she wants to learn more about how the data was collected. Of the following things she might learn about the survey, which are the most likely sources of bias in the results based how it was collected?

Choose two answers.

Standards Alignment

View full course alignment

Computer Science Principles

3.1 - People use computer programs to process information to gain insight and knowledge.
3.1.1 - Use computers to process information, find patterns, and test hypotheses about digitally processed information to gain insight and knowledge. [P4]
  • 3.1.1E - Patterns can emerge when data is transformed using computational tools.
3.1.2 - Collaborate when processing information to gain insight and knowledge. [P6]
  • 3.1.2A - Collaboration is an important part of solving data driven problems.
  • 3.1.2B - Collaboration facilitates solving computational problems by applying multiple perspectives, experiences, and skill sets.
  • 3.1.2C - Communication between participants working on data driven problems gives rise to enhanced insights and knowledge.
  • 3.1.2D - Collaboration in developing hypotheses and questions, and in testing hypotheses and answering questions, about data helps participants gain insight and knowledge.
  • 3.1.2F - Investigating large data sets collaboratively can lead to insight and knowledge not obtained when working alone.
3.2 - Computing facilitates exploration and the discovery of connections in information.
3.2.1 - Extract information from data to discover and explain connections, patterns, or trends. [P1]
  • 3.2.1A - Large data sets provide opportunities and challenges for extracting information and knowledge.
  • 3.2.1B - Large data sets provide opportunities for identifying trends, making connections in data, and solving problems.
  • 3.2.1C - Computing tools facilitate the discovery of connections in information within large data sets.
7.4 - Computing innovations influence and are influenced by the economic, social, and cultural contexts in which they are designed and used.
7.4.1 - Explain the connections between computing and economic, social, and cultural contexts. [P1]
  • 7.4.1A - The innovation and impact of social media and online access is different in different countries and in different socioeconomic groups.
  • 7.4.1B - Mobile, wireless, and networked computing have an impact on innovation throughout the world.
  • 7.4.1C - The global distribution of computing resources raises issues of equity, access, and power.
  • 7.4.1D - Groups and individuals are affected by the “digital divide” — differing access to computing and the Internet based on socioeconomic or geographic characteristics.

CSTA K-12 Computer Science Standards (2017)

DA - Data & Analysis
  • 3B-DA-06 - Select data collection tools and techniques to generate data sets that support a claim or communicate information.
IC - Impacts of Computing
  • 3A-IC-24 - Evaluate the ways computing impacts personal, ethical, social, economic, and cultural practices.
  • 3B-IC-26 - Evaluate the impact of equity, access, and influence on the distribution of computing resources in a global society.