Lesson 1: Bytes and File Sizes
In this lesson students are introduced to the standard units for measuring the sizes of digital files, from a single byte, all the way up to terabytes and beyond. Students begin the lesson by comparing the size of a plain text file containing “hello” to a Word document with the same contents. Students are introduced to the units kilobyte, megabyte, gigabyte, and terabyte, and research the sizes of files they make use of every day, using the appropriate terminology. This lesson foreshadows an investigation of compression as a means for combatting the rapid growth of digital data.
The simple purposes of this lesson are:
- Get terminology out in the open
- Become somewhat conversant with file types and sizes
- Grapple with orders-of-magnitude differences between things.
The 8-bit byte has become the de-facto fundamental unit with which we measure the “size” of data on computers, and in fact, today most computers only let you save data as combinations of whole bytes; even if you only want to store 1 bit of information, you have to use a whole byte to do it. And many computer systems will require you store even more than that. Messages sent over the Internet are also typically structured as messages with byte-offsets.
Paralleling the explosion of computing power and speed, the sheer size of the digital data now created and consumed every day is staggering. Units of measure (terabytes) that previously seemed unfathomably large are now making their way into personal computing. This rapid growth of digital data presents many new opportunities and also poses new challenges to engineers and programmers. The implications of so-called Big Data will not be investigated until later in the course, but it's good and interesting to be thinking about the size of things now.
Getting Started (10 mins)
Activity (30 mins)
Students will be able to:
- Use appropriate terminology when describing the size of digital files.
- Identify and compare the size of familiar digital media.
- Solve small word problems that require reasoning about file sizes.
- You should verify that you know how to look at the sizes of files on computers that your students are using (see activity).
- For the getting started activity might want a Word processing program (such as MS Word) and plain text editor (such as Notepad or TextEdit) open and ready.
- The teaching remarks and content corners in this lesson contain lots of little bits of history that you might choose to share at various points in the lesson.
For the Teacher
- Activity Guide KEY - Bytes and File Sizes - 2018 - Answer Key
For the Students
Getting Started (10 mins)
Why is a Byte 8 bits?: The 8-bit byte was not always standard. Computers used many different "byte" sizes over the course of history, depending on hardware and how addressable memory worked. However, much of the early computing world relied on representing data and computer instructions encoded in ASCII text where every character is 8 bits. Thus, 8-bits was such a common chunk-size for representing information that it stuck and they gave it its own name - byte.
There are various accounts about why it was called a “byte” but most point to early days at IBM where “bite” was used to to refer to groups of 8-bits that a computer was processing, as in it could “bite” off 8 bits at time. The spelling was changed to “byte” to avoid confusion with “bit”.
Bytes became the fundamental unit with which we measure the “size” of data on computers, and in fact, today most computers only let you save data as combinations of whole bytes; even if you only want to store 1 bit of information, you have to use a whole byte to do it.
As we start a new unit about Data and Digital Information we need to get familiar with terminology about data and different types of data files.
Vocabulary: Recall that a single character of ASCII text requires 8 bits. The technical term for 8 bits of data is a byte.
A byte is the standard fundamental unit (or “chunk size”) underlying most computing systems today. You may have heard "megabyte", "kilobyte", "gigabyte", etc. which are all different amounts of a bytes. We're going to learn more about them today.
File Size Comparison: .txt vs .doc
Prompt: In addition to the actual text of a document, it is usually necessary to store the formatting information that allows the text to be displayed correctly. We might wonder just how much extra information, i.e. how many extra bytes, we need to store when we include all of this formatting. Let's find out!
If a single ASCII character is one byte then if we were to store the word “hello” in a plain ASCII text file in a computer, we would expect it to need 5 bytes (or 40 bits) of memory.
What about a Microsoft Word document that contains the single word "hello"? How many more bytes will a Word document require to store the word “hello” than a plain text document?
Discuss: Have students silently make their prediction, then share with a partner, then share with the group. Prompt a couple students to share why they chose the size they did.
Try a Live Demo: If you wish, it might be more fun to create these files in front of your students, saving them on the desktop for a quick demo. To make a plain ASCII text file you’ll need to use the correct program:
- PC/Windows: use Notepad
- Mac: use TextEdit (Note: TextEdit needs to be switched into plain text mode from rich text. Go to Format → Make Plain Text)
Demonstrate: Do a live demo where you show the size of the different files. Here are some files you can download to use.
NOTE: A 5-byte file is so small that some computers won't allocate a chunk of memory that small. For example you might see something like this:
Which indicates that even though the file is 5 bytes, it's taking up 4 Kilobytes of memory on your computer.
To find the actual size of a file on your computer, do one of the following:
- PC/Windows: Right-click and choose “Properties”
- Mac: Ctrl+click and choose “Get Info”
In general, the Word Doc should be thousands of times larger than the plain text. For the files above:
- hello.txt - 5 bytes
- hello.docx = 21,969 bytes
Review: Review students predictions to see how close they were.
The big difference in file size between .txt and .docx is due to the extensive formatting information included along with the actual text in .docx. Modern data files typically measure in the thousands, millions, billions or trillions of bytes. Let's get a little practice looking at files and how big they are.
Activity (30 mins)
There are some discrepancies in common usage of the kilo, mega, giga prefixes.
From the Stanford CS 101 website:
It's convenient within the computer to organize things in groups of powers of 2. For example, 210 is 1024, and so a program might group 1024 items together, as a sort of "round" number of things within the computer. The term "kilobyte" above refers to this group size of 1024 things. However, people also group things by thousands -- 1 thousand or 1 million items.
There's this problem with the word "megabyte" .. does it mean 1024 * 1024 bytes, i.e. 220 which is 1,048,576, or does it mean exactly 1 million, 1000 * 1000. It's just a 5% difference, but marketers tend to prefer the 1 million, interpretation, since it makes their hard drives etc. appear to hold a little bit more. In an attempt to fix this, the terms "kibibyte" "mebibyte" "gibibyte" "tebibyte" have been introduced to specifically mean the 1024 based units (see wikipedia kibibyte article). These terms do not seem to have caught on very strongly thus far.
If nothing else, remember that terms like "megabyte" have this little wiggle room in them between the 1024 and 1000 based meanings. For purposes of CS Principles the distinction is not important - "about a million bytes" is a fine, close-enough interpretation for "megabyte".
Finding Solutions: Note that answers to 3 of the 6 questions on the activity guide can be found on the Stanford CS 101 page linked to in the activity guide.
Perfect accuracy is not important for some sections in this activity, but using the correct terminology and achieving a rough estimate of size (one million bytes vs. one billion) is important. Encourage students to practice using terms like megabyte, gigabyte, and terabyte to gain comfort with them.
Activity Guide: Bytes and File Sizes - Activity Guide
Group: Put students in pairs to find answers or work individually.
Distribute: Activity Guide: Bytes and File Sizes - Activity Guide
- Introduces the terminology
- Refers to websites for students to use as reference
Has questions and space for students to write answers to questions like:
- How many bytes are in a Megabyte?
- Give an example of a file type that is measured in Gigabytes
- What is the typical size of a .jpg image, .mp3 audio etc.
Allow students time to finish this activity either individually or in pairs by conducting online research.
- There are 6 practice questions on the 2nd page of the activity guide.
Share: Provide students an opportunity to clear up any remaining confusion and share interesting pieces of information they came across.
Review: Answers to the questions on the Activity Guide.
Time Saving Tip: Time permitting you could do the warm up activity from the next lesson (Text Compression) here. That warm up activity asks students to write down common abbreviations they use when sending text messages to friends and family, and then asks why they do that. The answer is compression: to save time and space.
As you have seen data file size can grow very quickly in size. In the modern world there is a lot of data around us and usually we want it transmitted over the internet.
There is a problem though: If you want to transmit a lot of data you are limited by the speed of your internet connection. Even if you have a fast Internet connection there is a physical limit to how fast you can transmit bits.
What if the data you want to send is big enough that it takes an unreasonable amount of time to transmit it, even with a really fast internet connection. Assuming you can't make the Internet connection any faster, could you still transmit the data faster somehow?
The answer is yes and it's probably something you've done, or do every day!
Use the last 3 questions on the activity guide for assessment.
Respond to this prompt or to another as directed by your teacher.
The salesperson in a cell phone store is telling me that the phone I'm considering has 8GB of memory, which means I can save 10,000 photos taken with the phone's camera!
Is the salesperson telling me the truth? Why or why not?
Respond to this prompt or to another as directed by your teacher.
Shakespeare’s complete works have approximately 3.5 million characters. Which is bigger in file size: Shakespeare’s complete works stored in plain ASCII text or a 4 minute song on mp3? How much bigger?
CSTA K-12 Computer Science Standards (2011)
CT - Computational Thinking
- CT.L2:14 - Examine connections between elements of mathematics and computer science including binary numbers, logic, sets and functions.
- CT.L3A:6 - Analyze the representation and trade-offs among various forms of digital information.
- CT.L3A:7 - Describe how various types of data are stored in a computer system.
Computer Science Principles
2.1 - A variety of abstractions built upon binary sequences can be used to represent all digital data.
2.1.1 - Describe the variety of abstractions used to represent data. [P3]
- 2.1.1B - At the lowest level, all digital data are represented by bits.
- 2.1.1C - At a higher level, bits are grouped to represent abstractions, including but not limited to numbers, characters, and color.
2.1.2 - Explain how binary sequences are used to represent digital data. [P5]
- 2.1.2B - In many programming languages, the fixed number of bits used to represent characters or integers limits the range of integer values and mathematical operations; this limitation can result in overflow or other errors.
- 2.1.2C - In many programming languages, the fixed number of bits used to represent real numbers (as floating point numbers) limits the range of floating point values and mathematical operations; this limitation can result in round
- 2.1.2E - A sequence of bits may represent instructions or data.
- 2.1.2F - A sequence of bits may represent different types of data in different contexts.
3.3 - There are trade offs when representing information as digital data.
3.3.1 - Analyze how data representation, storage, security, and transmission of data involve computational manipulation of information. [P4]
- 3.3.1G - Data is stored in many formats depending on its characteristics (e.g., size and intended use)
CSTA K-12 Computer Science Standards (2017)
DA - Data & Analysis
- 3A-DA-10 - Evaluate the tradeoffs in how data elements are organized and where data is stored.