Lesson 2: Text Compression
At some point we reach a physical limit of how fast we can send bits and if we want to send a large amount of information faster, we have to find a way to represent the same information with fewer bits - we must compress the data. In this lesson, students will use the Text Compression Widget to compress segments of English text by looking for patterns and substituting symbols for larger patterns of text.
The basic principle behind compression is to develop a method or protocol for using fewer bits to represent the original information. The way we represent compressed data in this lesson, with a “dictionary” of repeated patterns is similar to the LZW compression scheme, but it should be noted that LZW is slightly different from what students do in this lesson. Students invent their own way here. LZW is used not only for text (zip files), but also with the GIF image file format.
Getting Started (5-7 mins)
Activity (45 mins)
Wrap-up (10 mins)
Students will be able to:
- Collaborate with a peer to find a solution to a text compression problem using the Text Compression Widget (lossless compression scheme).
- Explain why the optimal amount of compression is impossible or “hard” to identify.
- Explain some factors that make compression challenging.
- Describe the purpose and rationale for lossless compression.
- Test out the Text Compression Widget
- Review the teaching tips to decide which options you want to use
For the Teacher
- Activity Recap - Decode this Message - Activity Recap
For the Students
- Decode this message - Activity Guide
- Activity Guide - Text Compression - Activity Guide
- Video: Text Compression with Aloe Blacc - Video (download)
- Text Compression Widget on Code Studio - Widget
- Lossless Compression - a data compression algorithm that allows the original data to be perfectly reconstructed from the compressed data.
Getting Started (5-7 mins)
Warm up: Abbr In Ur Txt Msgs (5-7 mins)
As a warm up to thinking about Text Compression, connect to ways that most people already compress text in their lives, through abbreviations and acronyms with which most people have some experience in text messages.
Motivate some ideas about why someone would want to compress text.
- "When you send text messages to a friend, do you spell every word correctly?"
- Do you use abbreviations for common words? List as many as you can.
- Write some examples of things you might see in a text message that are not proper English.
Give students a minute to write, and to share with a neighbor?
- "Why do you use these abbreviations? What is the benefit?"
- Possible answers:
- to save characters/keystrokes
- to hide from parents/teachers
- to be cool, clever, funny
- to “speak in code”
- to say the same thing in less space
- Possible answers:
What's this about? - Compression: Same Data, Fewer Bits
- Today's class is about compression
- When you abbreviate or use coded language to shorten the original text, you are “compressing text.” Computers do this too, in order to save time and space.
- The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS.
- Why is this important? One reason is that storage space is limited and you'd always prefer to use fewer bits if you could. A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet.
- What if we need to send a large amount of text faster over the Internet, but we’ve reached the physical limit of how fast we can send bits? Our only choice is to somehow capture the same information with fewer bits; we call this compression.
Let's look at an example of a text message that's been compressed in a clever way.
Activity (45 mins)
Decode this Mystery Text (10-15 mins)
- Distribute or Display the Activity guide: Decode this message - Activity Guide
- Put students into partners or work individually.
- Task: What was the original text?
- Give students a few minutes to decode the text. The text should be a short poem (see activity recap below)
|Student Activity Guide||Activity Recap|
|Distribute or Display Activity guide: Decode this message - Activity Guide||(Display or draw yourself) Activity Recap: Activity Recap - Decode this Message - Activity Recap|
Recap: How much was it compressed?
To answer, we need to compare the number of characters in the original poem to the number of characters needed to represent the compressed version.
Let's break it down.
Display or Demonstrate yourself ideas from: Activity Recap - Decode this Message - Activity Recap (shown in table above)
- The compressed poem is not just this part: If you were to send this to someone over the Internet they would not be able to decode it.
- The full compressed text includes BOTH the compressed text and the key to solve it.
- Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you've compressed it over the original.
Now you're going to get to try your hand at compressing some things on your own.
Use the Text Compression Widget
The video explains a little bit about compression in general - the difference between lossless compression and lossy compression. Todays class is about lossless compression we'll do lossy compression in a class or two after looking at image encoding.
Teacher's Choice whether to show the video to the whole class or let students watch it from within Code Studio. There are benefits and drawbacks to each.
Option to Consider: Get students into the text compression tool BEFORE showing the video. You might find students are more receptive to some of the information in the video if they have tried to use the tool first.
Communication and Collaboration: To develop communication and collaboration between students, include one of the following scenarios in class:
- Have students who were assigned the same poem compare results, or seat them in the same area of the room.
- Have a little friendly competition - but be careful not to let “bad” competition seep in - to see which pair can compress a poem the most. Use a poem that none of the students have compressed yet.
- For each poem, have the group(s) who did it figure out the best in the class, and record it on the board or somewhere that people can see.
- Have a class goal of getting the compression percentages for the four poems as high as possible.
- The groups with the best compression percentages may be asked to share their strategy with the class.
Students may be reluctant to share if they feel they don’t have the best results, but students should see others’ work and offer advice and strategies.
- Video explains compression
- Demonstrates the use of the Text Compression Tool.
- NOTE: This video pops up automatically when students visit the text compression stage in Code Studio.
- Divide students into groups of 2
- Assign each pair one of the poems provided and challenge them, as a pair to compress their poem as much as possible.
- Deliver or put simple instructions on the board so students can follow.
- Challenge: compress your assigned poem as much as possible.
- Compare with other groups to see if you can do better.
- Try to develop a general strategy that will lead to a good compression.
- After some time, have pairs that did the same poem get together to compare schemes. As a group their job is to come up with the best compression for that poem for the class.
- Optionally: you may hand out Text Compression (optional) - Activity Guide, which also explains the instructions and gives students tasks. It may work well as an out-of-class activity or assessment.
Wrap-up (10 mins)
Discuss properties and challenges with compression
Ask groups to pause to discuss the questions at the end of the activity.
"What makes doing this compression hard?"
- Invite responses. Some of these issues should surface: You can start in lots of different ways. Early choices affect later ones. Once you find one set of patterns, others emerge.
- There is a tipping point: you might be making progress compressing, but at some point the scale tips and the dictionary starts to get so big that you lose the benefit of having it. But then you might start re-thinking the dictionary to tweak some bits out.
"Do we think that these compression amounts that we’ve found are the the best? Is there a way to know what the best compression is?"
- We probably don’t know what’s best.
- There are so many possibilities it’s hard to know. It turns out the only way to guarantee perfect compression is brute force. This means trying every possible set of substitutions. Even for small texts this will take far too long. The “best” is really just the best we’ve found so far.
"But is there a process a person can follow to find the best (or a pretty good) compression for a piece of text?"
- Yes, but it’s imprecise -- you might leave this as a lingering question.
Real World: Zip Compression
- Experiment with zip using text files with different contents. Are the results for small files as good as for large files? (On Macs, in the Finder choose “get info” for a file to see the actual number of bytes in the file, since the Finder display will show 4KB for any file that’s less than that.)
- Warning: results may vary. Zip works really well for text, but it might not compress other files very well because they are already compressed or don’t have the same kinds of embedded patterns that text documents do.
Challenge: Research the LZW algorithm
.zip compression is based on the LZW Compression Scheme
While the idea behind the text compression tool is similar to LZW (zip) algorithm, tracing the path of compression and decompression is somewhat challenging. Learning more about LZW and what happens in the course of this algorithm would be an excellent extension project for some individuals.
CSTA K-12 Computer Science Standards (2011)
CL - Collaboration
- CL.L2:3 - Collaborate with peers, experts and others using collaborative practices such as pair programming, working in project teams and participating in-group active learning activities.
CPP - Computing Practice & Programming
- CPP.L2:4 - Demonstrate an understanding of algorithms and their practical application.
CT - Computational Thinking
- CT.L2:9 - Interact with content-specific models and simulations (e.g., ecosystems, epidemics, molecular dynamics) to support learning and research.
- CT.L3B:8 - Use models and simulations to help formulate, refine, and test scientific hypotheses.
- CT.L3B:9 - Analyze data and identify patterns through modeling and simulation.
Computer Science Principles
2.1 - A variety of abstractions built upon binary sequences can be used to represent all digital data.
2.1.1 - Describe the variety of abstractions used to represent data. [P3]
- 2.1.1A - Digital data is represented by abstractions at different levels.
- 2.1.1B - At the lowest level, all digital data are represented by bits.
- 2.1.1C - At a higher level, bits are grouped to represent abstractions, including but not limited to numbers, characters, and color.
2.2 - Multiple levels of abstraction are used to write programs or create other computational artifacts
2.2.1 - Develop an abstraction when writing a program or creating other computational artifacts. [P2]
- 2.2.1B - An abstraction extracts common features from specific examples in order to generalize concepts.
3.1 - People use computer programs to process information to gain insight and knowledge.
3.1.1 - Use computers to process information, find patterns, and test hypotheses about digitally processed information to gain insight and knowledge. [P4]
- 3.1.1A - Computers are used in an iterative and interactive way when processing digital information to gain insight and knowledge.
- 3.1.1D - Insight and knowledge can be obtained from translating and transforming digitally represented information.
- 3.1.1E - Patterns can emerge when data is transformed using computational tools.
3.1.2 - Collaborate when processing information to gain insight and knowledge. [P6]
- 3.1.2A - Collaboration is an important part of solving data driven problems.
- 3.1.2B - Collaboration facilitates solving computational problems by applying multiple perspectives, experiences, and skill sets.
- 3.1.2C - Communication between participants working on data driven problems gives rise to enhanced insights and knowledge.
- 3.1.2D - Collaboration in developing hypotheses and questions, and in testing hypotheses and answering questions, about data helps participants gain insight and knowledge.
3.1.3 - Explain the insight and knowledge gained from digitally processed data by using appropriate visualizations, notations, and precise language. [P5]
- 3.1.3A - Visualization tools and software can communicate information about data.
- 3.1.3E - Interactivity with data is an aspect of communicating.
3.3 - There are trade offs when representing information as digital data.
3.3.1 - Analyze how data representation, storage, security, and transmission of data involve computational manipulation of information. [P4]
- 3.3.1A - Digital data representations involve trade offs related to storage, security, and privacy concerns.