Lesson 5: Lossy Compression and File Formats

Overview

This lesson is mostly an investigation of different kinds of file formats that exist in the real world. The lesson begins with students exploring a mock “lossy” text compression scheme as a way to learn about “lossy” compression. Then we do a jigsaw “rapid research” activity in which pairs of student research a real image, text, or sound encoding file format and determine what kind of compression it uses and the theory behind it. This lesson also sets the stage for the practice Performance Task (Encode a Complex Thing) that follows this lesson.

Purpose

The main purpose of this lesson is straightforward: understand what lossy compression is and when/why it might be used. It's mostly used in visual or audio formats where a loss in precision is undetectable to human eyes and ears. Beyond that we, want to continue to build students' skills and comfort with rapidly doing research online, reporting back, and verifying that the information they got was good. This is good life skill but will also serve students well for the Explore Performance task. The hope with this lesson is that students will have greater insight into these technical articles that they know a bit about the binary make up of things -- many of the image file format articles actually show the binary file format and what bits mean what.

In particular, students might discover, or you might point out that the BMP image format is basically the image encoding format used in a previous lesson, and that the GIF image format and ZIP compression scheme are versions of the text compression scheme we used as well. In the case of GIF, it uses a dictionary of up to 255 different colors and each pixel is stored as small number that refers to the dictionary.

Agenda

Getting Started (10 mins)

Activity

Wrap-up (5 mins)

Assessment

Extended Learning

View on Code Studio

Objectives

Students will be able to:

  • Explain the difference between lossy and lossless compression.
  • Identify common computer file types and whether they are compressed or not, and whether compression is lossy or lossless.
  • Read a technical article on the web and sift its contents for targeted information.

Preparation

  • Copies of File Formats Rapid Research worksheet for students

Links

Heads Up! Please make a copy of any documents you plan to share with students.

For the Teachers

For the Students

Vocabulary

  • Lossless Compression - a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data.
  • Lossy Compression - (or irreversible compression) a data compression method that uses inexact approximations, discarding some data to represent the content. Most commonly seen in image formats like .jpg.

Teaching Guide

Getting Started (10 mins)

Quick Discovery: Lossy Text Compression

  • With a partner, go to the Lossy Text Compression App - App Lab.
  • Answer the following questions:
    • What is happening in the app?
    • Should this “count” as text compression? Why or why not?
    • What do you think “lossy” refers to?

Group discussion (brief)

  • Verify that students saw the text was being reduced by keeping the first letter of every word and throwing out all the vowels.
  • Get some opinions about whether it should count as text compression.
    • Opinions might vary, but it is true that the amount of text was reduced.
    • However, the work of reconstructing was left to human intelligence and intuition.

Lossless vs. Lossy compression

Remarks

  • When we did text compression a few lessons ago, that kind of compression is known as lossless compression because in doing the compression, and in reconstructing the original text, nothing was lost; every character that was part of the original text could be recovered.
  • Lossy compression -- yes, that’s the official word -- does something else. Lossy compression schemes are ones in which “useless” or less-than-totally-necessary information is thrown out in order to reduce the size of the data.
    • The lossy text compression app did that, and for the most part, you could probably make out what the text was supposed to say.
    • But it’s not perfect. If you saw the word “fd” it could be “food”, “feed”, “feud”, or “fad”. By reading it in context, you might know what it was supposed to be, but there’s no real way to know what the original word was. The original word is lost.

Transition:

We’ve been looking at image file formats. And we’ve also seen text compression. Both of those attempted to render perfectly every piece of information.

Both the image file format and the text compression scheme we used were lossless. Lossy compression schemes usually take advantage of the fact that a human is supposed to interpret the data at the other end, and human brains are good at filling the gaps when information is missing.

Activity

Today you and a partner will do some rapid research and reporting on some of the most common file formats. Use the web as your research tool.

Optional:

Content Corner

  • Students might discover, or you might point out, that the BMP image format is basically the image encoding format used in a previous lesson.
  • The GIF image format and ZIP compression scheme are versions of the text compression scheme we used as well. In the case of GIF, it uses a dictionary of up to 255 different colors and each pixel is stored as small number that refers to the dictionary.
  • The bit layouts of BMP and GIF should be understandable for students.

Jigsaw research.

  • Distribute File Formats Rapid Research - Worksheet.
  • Assign pairs or small groups one of the file format types listed in the table. It’s OK if two groups research the same type.
  • Each pair/group should research the file format assigned to it and fill in one row of the table.

Teaching Tip

You can use any sharing strategy you like. The goal is for every student to have her file format table filled in for the first two columns (data type and compression type). Knowing how they work is also good, but some are rather complicated. It might have to be left a mystery.

Share results.

Ask for a volunteer to read what he found for the file type he was assigned. Ask if anyone else who researched that type has anything to add (or clarify) about what the first person said. Do this for each of the file types.

Wrap-up (5 mins)

Content Corner

The file extension you often see on a file (for example: myPhoto.jpg) is really just an indicator to the computer of how the underlying bits are organized, so the computer can interpret them. If you change the name of the file to myPhoto.gif, that does not magically change the underlying bits; all you’ve done is confuse the computer. It won’t be able to open the file because it will attempt to interpret the file as a GIF when really the bits are in JPG format.

    • There was a question at the bottom of the worksheet that asked if you had ever heard of any other file type that you were curious about. What were those?
  • Do a whip around and write what students say on the board. Types might include: .doc, .pdf, .docx, .mp4, .mov, .html, etc.
  • All of these are specialized file formats in which some person or group decided how to organize (and in some cases, compress) the bits that make up the file type. There is nothing magical about them.

Assessment

Assessment Posibilities

Matching: Match the encoding type with the data type and compression. (In Code Studio.)

Extended Learning

  • GIF and PNG are both lossless image compression formats. Which one is better?
  • Read Blown to Bits (www.bitsbook.com), Chapter 3, Ghosts in the Machine, pp. 88-90 (Reducing Data, Sometimes Without Losing Information), then answer the following question:
    • Do you think the need for file compression will always be needed, considering the advances in data storage, the speed of computers, and speed of the Internet?
  • Read Blown to Bits (www.bitsbook.com), Chapter 3, Ghosts in the Machine, pp. 90-94 (Technological Birth and Death), then answer the following questions:
    • Data formats are constantly changing. What challenges does this present for historians? For a given document, movie, or audio file, what are all the component pieces that need to be preserved along with it?
    • There is concern about Microsoft’s de-facto “.doc” format. Do similar concerns exist for cloud services such as Cloud Data formats and Cloud APIs? What are some such APIs and what will the dangers be if those de-facto standards are adopted?
  • Lesson Vocabulary & Resources
  • 1
  • (click tabs to see student view)
View on Code Studio

Student Instructions

Unit 2: Lesson 5 - Lossy Compression and File Formats

Background

File formats such as JPEG or WAV or MP3 are encoding schemes for organizing and saving the bits that represent images, sounds, or other data. Sometimes all of the bits in data need to be saved, and sometimes they don’t.

Vocabulary

  • Lossless: A compression scheme in which every bit of the original data can be recovered from the compressed file.
  • Lossy: A compression scheme in which “useless” or less-than-totally-necessary information is thrown out in order to reduce the size of the data. The eliminated data is unrecoverable.

Lesson

  • Jigsaw Rapid Research on file formats
  • Share your findings

Resources

  • Check Your Understanding
  • 2
  • (click tabs to see student view)
View on Code Studio

Student Instructions

Standards Alignment

View full course alignment

CSTA K-12 Computer Science Standards (2011)

CD - Computers & Communication Devices
  • CD.L2:4 - Use developmentally appropriate, accurate terminology when communicating about technology.
CL - Collaboration
  • CL.L2:3 - Collaborate with peers, experts and others using collaborative practices such as pair programming, working in project teams and participating in-group active learning activities.
CT - Computational Thinking
  • CT.L2:7 - Represent data in a variety of ways including text, sounds, pictures and numbers.
  • CT.L3A:6 - Analyze the representation and trade-offs among various forms of digital information.

Computer Science Principles

3.3 - There are trade offs when representing information as digital data.
3.3.1 - Analyze how data representation, storage, security, and transmission of data involve computational manipulation of information. [P4]
  • 3.3.1A - Digital data representations involve trade offs related to storage, security, and privacy concerns.
  • 3.3.1C - There are trade offs in using lossy and lossless compression techniques for storing and transmitting data.
  • 3.3.1D - Lossless data compression reduces the number of bits stored or transmitted but allows complete reconstruction of the original data
  • 3.3.1E - Lossy data compression can significantly reduce the number of bits stored or transmitted at the cost of being able to reconstruct only an approximation of the original data.
  • 3.3.1G - Data is stored in many formats depending on its characteristics (e.g., size and intended use)