Lesson 2: Text Compression

Overview

At some point we reach a physical limit of how fast we can send bits and if we want to send a large amount of information faster, we have to find a way to represent the same information with fewer bits - we must compress the data.

In this lesson, students will use the Text Compression Widget to compress segments of English text by looking for patterns and substituting symbols for larger patterns of text. After some experimentation students are asked to come up with a process (or algorithm) for arriving at a "good" amount of compression despite the fact that there is no way to know what is best or optimal. In developing a so-called "heuristic approach" to this problem, students will grapple with the tradeoffs in compressing data and begin to develop a sense of computing problems that are “hard” to solve.

Purpose

This is a big lesson that covers a lot of bases. It should easily take 2 or more days of class. First and foremost it covers two or three topics directly from the CSP framework.

1. lossless compression

The basic principle behind compression is to develop a method or protocol for using fewer bits to represent the original information. The way we represent compressed data in this lesson, with a “dictionary” of repeated patterns is similar to the LZW compression scheme, but it should be noted that LZW is slightly different from what students do in this lesson. Students invent their own way here. LZW is used not only for text (zip files), but also with the GIF image file format.

2. heuristics

The lesson touches on computationally hard problems and heuristics but please note that computationally hard problems and heuristics will be revisited later on. A general "hand-wavy" understanding is all that's needed from this lesson.

We do want students to see, however, that there is no single correct way to compress text using the method we use in this lesson because a) there is no known algorithm for finding an optimal solution, and b) we don’t even know a way to verify whether a given solution is optimal. There is no way to prove it or derive it beyond trying all possibilities by brute force. This is an example of an algorithm that cannot run in a “reasonable amount of time” - one of the CSP learning objectives.

3. Foreshadowing programming behaviors

Lastly, the Text Compression Activity is an important lesson to refer back to when students start programming. The activity engages students in thinking and problem solving behaviors that foreshadow skills that are particularly useful for programming later down the line. In particular, when students recognize patterns that repeat, and then represent those patterns as abstract symbols, and then further recognize patterns within those patterns, it is very similar to the kinds of abstractions we develop when writing functions and procedures when programming. Decoding the message in the warm-up activity is very similar to tracing a sequence of function calls in a program.

Agenda

Getting Started (5-7 mins)

Activity (45 mins)

Activity 2 (30 mins)

Wrap-up (20 mins)

Assessment

Extended Learning

View on Code Studio

Objectives

Students will be able to:

  • Collaborate with a peer to find a solution to a text compression problem using the Text Compression Widget (lossless compression scheme).
  • Explain why the optimal amount of compression is impossible or “hard” to identify.
  • Explain some factors that make compression challenging.
  • Develop a strategy (heuristic algorithm) for compressing text.
  • Describe the purpose and rationale for lossless compression.

Preparation

  • Test out the Text Compression Widget
  • Review the teaching tips to decide which options you want to use

Links

Heads Up! Please make a copy of any documents you plan to share with students.

For the Teachers

For the Students

Vocabulary

  • Heuristic - a problem solving approach (algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible.
  • Lossless Compression - a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data.

Teaching Guide

Getting Started (5-7 mins)

Warm up: Abbr In Ur Txt Msgs (5-7 mins)

Discussion Goal

As a warm up to thinking about Text Compression, connect to ways that most people already compress text in their lives, through abbreviations and acronyms with which most people have some experience in text messages.

Motivate some ideas about why someone would want to compress text.

Prompt:

  • "When you send text messages to a friend, do you spell every word correctly?"
    • Do you use abbreviations for common words? List as many as you can.
    • Write some examples of things you might see in a text message that are not proper English.

Give students a minute to write, and to share with a neighbor?

  • "Why do you use these abbreviations? What is the benefit?"
    • Possible answers:
      • to save characters/keystrokes
      • to hide from parents/teachers
      • to be cool, clever, funny
      • to “speak in code”
      • to say the same thing in less space

What's this about? - Compression: Same Data, Fewer Bits

  • Today's class is about compression
  • When you abbreviate or use coded language to shorten the original text, you are “compressing text.” Computers do this too, in order to save time and space.
  • The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS.
  • Why is this important? One reason is that storage space is limited and you'd always prefer to use fewer bits if you could. A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet.
  • What if we need to send a large amount of text faster over the Internet, but we’ve reached the physical limit of how fast we can send bits? Our only choice is to somehow capture the same information with fewer bits; we call this compression.

Transition:

Let's look at an example of a text message that's been compressed in a clever way.

Activity (45 mins)

Decode this Mystery Text (10-15 mins)

  • Distribute or Display the Activity guide: Decode this message - Activity Guide
  • Put students into partners or work individually.
  • Task: What was the original text?
  • Give students a few minutes to decode the text. The text should be a short poem (see activity recap below)
Student Activity Guide Activity Recap
Distribute or Display Activity guide: Decode this message - Activity Guide (Display or draw yourself) Activity Recap: Activity Recap - Decode this Message - Activity Recap

Recap: How much was it compressed?

To answer, we need to compare the number of characters in the original poem to the number of characters needed to represent the compressed version.

Let's break it down.

  • Display or Demonstrate yourself ideas from: Activity Recap - Decode this Message - Activity Recap (shown in table above)

  • Important Note:

    • The compressed poem is not just this part: If you were to send this to someone over the Internet they would not be able to decode it.
    • The full compressed text includes BOTH the compressed text and the key to solve it.
    • Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you've compressed it over the original.

Transition

Now you're going to get to try your hand at compressing some things on your own.

Use theText Compression Widget

Content Corner

The video explains a little bit about compression in general - the difference between lossless compression and lossy compression. Todays class is about lossless compression we'll do lossy compression in a class or two after looking at image encoding.

Teaching Tip

Teacher's Choice whether to show the video to the whole class or let students watch it from within Code Studio. There are benefits and drawbacks to each.

Option to Consider: Get students into the text compression tool BEFORE showing the video. You might find students are more receptive to some of the information in the video if they have tried to use the tool first.

Communication and Collaboration: To develop communication and collaboration between students, include one of the following scenarios in class:

  • Have students who were assigned the same poem compare results, or seat them in the same area of the room.
  • Have a little friendly competition - but be careful not to let “bad” competition seep in - to see which pair can compress a poem the most. Use a poem that none of the students have compressed yet.
  • For each poem, have the group(s) who did it figure out the best in the class, and record it on the board or somewhere that people can see.
    • Have a class goal of getting the compression percentages for the four poems as high as possible.
    • The groups with the best compression percentages may be asked to share their strategy with the class.

Students may be reluctant to share if they feel they don’t have the best results, but students should see others’ work and offer advice and strategies.

Video: Text Compression with Aloe Blacc - Video

  • Video explains compression
  • Demonstrates the use of the Text Compression Tool.
  • NOTE: This video pops up automatically when students visit the text compression stage in Code Studio.
  • Divide students into groups of 2
  • Assign each pair one of the poems provided and challenge them, as a pair to compress their poem as much as possible.
  • Deliver or put simple instructions on the board so students can follow.
    • Challenge: compress your assigned poem as much as possible.
    • Compare with other groups to see if you can do better.
    • Try to develop a general strategy that will lead to a good compression.

  • After some time, have pairs that did the same poem get together to compare schemes. As a group their job is to come up with the best compression for that poem for the class.

Discuss properties and challenges with compression.

Ask groups to pause to discuss the questions at the end of the activity.

Prompts:

  • "What makes doing this compression hard?"

    • Invite responses. Some of these issues should surface: You can start in lots of different ways. Early choices affect later ones. Once you find one set of patterns, others emerge.
    • There is a tipping point: you might be making progress compressing, but at some point the scale tips and the dictionary starts to get so big that you lose the benefit of having it. But then you might start re-thinking the dictionary to tweak some bits out.
  • "Do we think that these compression amounts that we’ve found are the the best? Is there a way to know what the best compression is?"

    • We probably don’t know what’s best.
    • There are so many possibilities it’s hard to know. It turns out the only way to guarantee perfect compression is brute force. This means trying every possible set of substitutions. Even for small texts this will take far too long. The “best” is really just the best we’ve found so far.
  • "But is there a process a person can follow to find the best (or a pretty good) compression for a piece of text?"

    • Yes, but it’s imprecise -- you might leave this as a lingering question that leads to the next student task.

Activity 2 (30 mins)

Teaching Tip

You may elect to not do this heuristic activity and instead get the key take-aways (see Activity Goal below) across through discussion following the previous activity.

Develop a heuristic for doing compression

Distribute or Display: Activity Guide - Text Compression Heuristics - Activity Guide

In computer science there is a word for strategies to use when you're not sure what the exact or best solution to a problem is.

Vocabulary: heuristic a problem solving approach (typically an algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible.

Instructions:

  • Continue working on compressing your poem using the Text Compression Widget. As you do so, develop a set of rules, or a “heuristic” that generally seems to provide good results.

  • Record your heuristic as a list of steps that someone else unfamiliar with the problem could follow and still end up with decent compression.

Activity Goal

The point here is to establish:

  • There is no real way to determine for sure that you've got the best compression besides trying everything possible by brute force.
  • Heuristics are techniques for at least making progress toward a "good enough" solution.
  • Following the same heuristic might lead to different results.
  • Trade your heuristics with another group. Are they clear and specific enough that you always know what to do? If not, provide feedback to one another and improve your heuristics to provide clearer instructions.

  • Using another group’s heuristic, attempt to compress one or more of the poems in the tool. Record the amount of compression you achieve.

What's best?

Share Findings:

Have one member of each group give a summary of their heuristic and the results on each of the poems. If time is limited, these presentations can be done between groups instead in front of the entire class. The discussion questions below could also be done group to group.

Reflection Prompts (from the Activity Guide)

"Do you think it’s possible to describe (or write) a specific set of instructions that a person could follow that would always result in better text compression than your heuristic? Why or why not?"

  • Some compression programs (like zip) do a great job if the file is sufficiently large and has reasonable amounts of repetition.
  • However, it is also possible to create a “compressed file” that is larger than the original because the heuristic does work in every single case.

"Is there a way to know that a compressed piece of text is compressed the most possible? If yes, describe how you could determine it. If no, why not?"

  • Stress that there is no perfect solution.
  • The size and shape of the data will determine what the “best” answer is and we often cannot even be sure it is the best answer (only that it is better than other answers we have tried.)

Wrap-up (20 mins)

Recap Questions

"What did all groups’ processes for compression have in common?"

  • Pattern Recognition
  • Abstraction (patterns referring to other patterns)

"Will following this process always lead to the same compression? (i.e. two people following the process for the same poem, will result in the same compression?)"

  • No. It’s imprecise, but still OK. The text still gets compressed, no matter what.
  • Since there is no way to know what’s best, all we need is a process that comes up with some solution, and a way to make progress.

Terminology: Verify students know or use an *exit ticket on this vocabulary:

  • lossless compression v. lossy compression
  • heuristic

Compression in the Real World (.zip)

Teaching Tip

  • You do not have to review or demo LZW compression in depth here. It is an interesting real-world application of the activity done in class.
  • While details of LZW compression are not part of the AP course content, but the idea of lossless compression is.
  • Recommendation: demonstrate zip quickly.
  • Have a large text file at the ready, such as the plaintext version of Hamlet
  • Use the .zip utility on your computer to compress into a zip file and then compare the file size to the original. (We learned how to do this in the previous lesson).

Zip Compression

  • There is a compression algorithm called LZW compression upon which the common “zip” utility is based. Zip compression does something very similar to what you did today with the text compression widget.

  • Here is an animation of lzw in action. You can see the algorithm doesn't compress it the most, but it is following a heuristic that will lead to better and better compression over time.

  • Do you want to use zip compression for real? Most computers have it built in:

    • Windows: select a file or group of files, right-click, and choose “Send To...Compressed (zipped) Folder.”
    • Mac: select a file or group of files, ctrl+click, and choose “Compress Items.”
  • Warning: if you try this results may vary.

    • Zip works really well for text, but only on large files. If you try to compress the simple hello.txt file we used in a previous lesson, you'll see the resulting file is actually bigger.
    • Zip is meant for text. It might not work well on non-text files very well because they are already compressed or don’t have the same kinds of embedded patterns that text documents do.

Assessment

Questions:

  • If you send the compressed poem, would your friend will be able to read it? Why is the dictionary important?
    • Your friend would only be able to read it if she knew how it was encoded. The dictionary is necessary because it tells her how to decompress the information that she has.
  • Why do you want to compress anything? What’s the point?
    • It is useful for sending things faster or for smaller storage. It allows for optimization of limited resources.
  • For a piece of text, what is a “good” amount of compression? Is there a way to know when you’ve compressed it the most? Explain how you would know, or why you can’t know.

Case Study: A simple message has been compressed below:

Compressed message

  • What was the original message?
    • the_big_bug_bit_the_bull_but_the_bull_bit_the_big_bug_back
  • Approximately what was the percentage of compression? (count bytes in original vs. total bytes in compressed version)
    • approximately: 25% compression

Extended Learning

Real World: Zip Compression

  • Experiment with zip using text files with different contents. Are the results for small files as good as for large files? (On Macs, in the Finder choose “get info” for a file to see the actual number of bytes in the file, since the Finder display will show 4KB for any file that’s less than that.)
    • Warning: results may vary. Zip works really well for text, but it might not compress other files very well because they are already compressed or don’t have the same kinds of embedded patterns that text documents do.

Challenge: Research the LZW algorithm

  • .zip compression is based on the LZW Compression Scheme

  • While the idea behind the text compression tool is similar to LZW (zip) algorithm, tracing the path of compression and decompression is somewhat challenging. Learning more about LZW and what happens in the course of this algorithm would be an excellent extension project for some individuals.

  • Lesson Vocabulary & Resources
  • 1
  • (click tabs to see student view)
View on Code Studio

Student Instructions

Unit 2: Lesson 2 - Text Compression

Background

Compression is a method or protocol for using fewer bits to represent the original information. Compression can be achieved in a variety of methods including looking for patterns and substituting symbols for the larger patterns of data. Compression can be a "hard problem" for computers because it is difficult to know whether or not the compression you've found is optimal - if you keep trying would it get better? It's hard to know when to stop, and hard to verify that you've compressed it "enough". When it's impossible, or would take an unreasonable amount of time, to know an exact solution you can come with a strategy called a "heuristic" to define some rules about when the solution is good enough. See vocab below.

Vocabulary

  • Heuristic: a problem solving approach (algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible.
  • Lossless Compression: a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data.

Lesson

  • Compress a piece of text using the Text Compression Tool.
  • Develop a heuristic for text compression.
  • Describe why compression is challenging.

Resources

  • Decode this message - Activity Guide (PDF | DOCX)
  • Activity Guide - Text Compression - Activity Guide (PDF | DOCX)
  • Activity Guide - Text Compression Heuristics - Activity Guide (PDF | DOCX)
View on Code Studio

Look for patterns (repeated words or phrases) in the text. Enter the patterns you see into the dictionary on the right. As you type entries into the dictionary, the symbol for the entry is inserted into the text in place of the pattern.

  • Check Your Understanding
  • 4
  • 5
  • 6
  • 7
  • (click tabs to see student view)
View on Code Studio

Teaching Tip

Student Instructions

Below is a piece of text that has already been compressed, and shows some of the information about it. Show you know how this works by reconstructing the original text from the dictionary and compressed version.

In the text box below, enter the original text you've reconstructed from the compressed version above. Make sure you use _ (underscore) instead of spaces in your answer.

View on Code Studio

Student Instructions

Here's the same compressed text that you saw on the last level, but now we also see the size of the original, uncompressed text. On the previous level you reconstructed the text by tracing back through the dictionary. Now we're going to think about if this is a "good" compression rate.

In the text box below, answer the following two questions:

  • What is the compression rate, or the compressed text size + dictionary size compared to the original text size? (as a percentage)
  • Is this a "good" compression rate? Why or why not?
View on Code Studio

Student Instructions

Why do you want to compress anything? What's the point?

View on Code Studio

Student Instructions

Why is compression a "hard problem" for computers? Draw on your own experience compressing text with the text compression widget. Is there a way to know when you've compressed it the most? Explain why you can or can't know.

Standards Alignment

View full course alignment

CSTA K-12 Computer Science Standards (2011)

CL - Collaboration
  • CL.L2:3 - Collaborate with peers, experts and others using collaborative practices such as pair programming, working in project teams and participating in-group active learning activities.
CPP - Computing Practice & Programming
  • CPP.L2:4 - Demonstrate an understanding of algorithms and their practical application.
CT - Computational Thinking
  • CT.L2:9 - Interact with content-specific models and simulations (e.g., ecosystems, epidemics, molecular dynamics) to support learning and research.
  • CT.L3B:8 - Use models and simulations to help formulate, refine, and test scientific hypotheses.
  • CT.L3B:9 - Analyze data and identify patterns through modeling and simulation.

Computer Science Principles

2.1 - A variety of abstractions built upon binary sequences can be used to represent all digital data.
2.1.1 - Describe the variety of abstractions used to represent data. [P3]
  • 2.1.1A - Digital data is represented by abstractions at different levels.
  • 2.1.1B - At the lowest level, all digital data are represented by bits.
  • 2.1.1C - At a higher level, bits are grouped to represent abstractions, including but not limited to numbers, characters, and color.
2.2 - Multiple levels of abstraction are used to write programs or create other computational artifacts
2.2.1 - Develop an abstraction when writing a program or creating other computational artifacts. [P2]
  • 2.2.1B - An abstraction extracts common features from specific examples in order to generalize concepts.
3.1 - People use computer programs to process information to gain insight and knowledge.
3.1.1 - Use computers to process information, find patterns, and test hypotheses about digitally processed information to gain insight and knowledge. [P4]
  • 3.1.1A - Computers are used in an iterative and interactive way when processing digital information to gain insight and knowledge.
  • 3.1.1D - Insight and knowledge can be obtained from translating and transforming digitally represented information.
  • 3.1.1E - Patterns can emerge when data is transformed using computational tools.
3.1.2 - Collaborate when processing information to gain insight and knowledge. [P6]
  • 3.1.2A - Collaboration is an important part of solving data driven problems.
  • 3.1.2B - Collaboration facilitates solving computational problems by applying multiple perspectives, experiences, and skill sets.
  • 3.1.2C - Communication between participants working on data driven problems gives rise to enhanced insights and knowledge.
  • 3.1.2D - Collaboration in developing hypotheses and questions, and in testing hypotheses and answering questions, about data helps participants gain insight and knowledge.
3.1.3 - Explain the insight and knowledge gained from digitally processed data by using appropriate visualizations, notations, and precise language. [P5]
  • 3.1.3A - Visualization tools and software can communicate information about data.
  • 3.1.3E - Interactivity with data is an aspect of communicating.
3.3 - There are trade offs when representing information as digital data.
3.3.1 - Analyze how data representation, storage, security, and transmission of data involve computational manipulation of information. [P4]
  • 3.3.1A - Digital data representations involve trade offs related to storage, security, and privacy concerns.
4.2 - Algorithms can solve many but not all computational problems.
4.2.1 - Explain the difference between algorithms that run in a reasonable time and those that do not run in a reasonable time. [P1]
  • 4.2.1A - Many problems can be solved in a reasonable time.
  • 4.2.1B - Reasonable time means that as the input size grows, the number of steps the algorithm takes is proportional to the square (or cube, fourth power, fifth power, etc.) of the size of the input.
  • 4.2.1C - Some problems cannot be solved in a reasonable time, even for small input sizes.
  • 4.2.1D - Some problems can be solved but not in a reasonable time. In these cases, heuristic approaches may be helpful to find solutions in reasonable time.
4.2.2 - Explain the difference between solvable and unsolvable problems in computer science. [P1]
  • 4.2.2A - A heuristic is a technique that may allow us to find an approximate solution when typical methods fail to find an exact solution.
  • 4.2.2B - Heuristics may be helpful for finding an approximate solution more quickly when exact methods are too slow.
4.2.3 - Explain the existence of undecidable problems in computer science. [P1]
  • 4.2.3A - An undecidable problem may have instances that have an algorithmic solution, but there is no algorithmic solution that solves all instances of the problem.
  • 4.2.3B - A decidable problem is one in which an algorithm can be constructed to answer “yes” or “no” for all inputs (e.g., “is the number even?”)
  • 4.2.3C - An undecidable problem is one in which no algorithm can be constructed that always leads to a correct yes or no answer
4.2.4 - Evaluate algorithms analytically and empirically for efficiency, correctness, and clarity. [P4]
  • 4.2.4A - Determining an algorithm’s efficiency is done by reasoning formally or mathematically about the algorithm.
  • 4.2.4C - The correctness of an algorithm is determined by reasoning formally or mathematically about the algorithm, not by testing an implementation of the algorithm.
  • 4.2.4D - Different correct algorithms for the same problem can have different efficiencies.