Mobile MDR Has Arrived – Safeguard Your Execs from Zero-Day Threats Today.

Log in

Get Demo

Home

Reader’s Question: How can a text file be a number?

Reader’s Question: How can a text file be a number?

Security Cybersecurity

Reader’s Question: How can a text file be a number?

Paul Ducklin

03/21/2024

Share this article:

Fun with source code and prime numbers.

Duck answers a reader’s question with some lightly technical fun for the weekend, along with some intriguing but gentle mathematics.

Learning from history

Earlier this week, we published Part 1 of an article about a practice known in the jargon as security through obscurity, which is pretty much exactly what it says: setting up your IT infrastructure in an undocumented or unusual way, keeping the details of how you did it a secret, and hoping that no one figures it out.

As our case study in Part 1, we recounted a fascinating tale from the turn of the millennium that surrounded a DVD ‘security’ system called CSS, short for content scramble system.

CSS was meant to control piracy and to ‘protect’ rights holders and legitimate users alike, but it also prevented you from backing up your own legally-purchased DVDs, from watching them on open-source operating systems, from using them in other countries, and more.

When the algorithm was extracted from a poorly-protected DVD player and published online in the form of working C code, the consortium behind CSS took the unpopular step of trying to use the legal system to force the genie back into the bottle.

It seems they thought that making it illegal to publish, or even to possess, the now-public source code would therefore restore its obscurity, and magically re-establish its security at the same time.

(Public scrutiny of the algorithm, something that is actively encouraged in the cryptographic community, had already revealed it to be 16 million times weaker than intended.)

One privacy advocate and cryptography enthusiast set out to demonstrate the pointlessness of this chase-it-through-the-courts approach to cybersecurity as follows:

Convert the source code to an number. A truly enormous number, to be sure, but the number of numbers is infinite.
Find a nearby number that happened to be prime. Prime numbers are a vital part of both mathematics and cybersecurity, and are therefore important in their own right.
Publish it on a website dedicated to interesting primes. The size of the number involved was such that formally proving it to be prime was, at the time, a noteworthy achievement in mathematical computation, making its publication both useful and informative.

Prime numbers, to be clear, are numbers that can only be divided by themselves and 1, such as 2=1×2, 3=1×3, 5=1×5, 7=1×7 and 11=1×11, but not 4=2×2, 6=2×3, 8=2x2x2, 9=3×3 or 10=2×5.

The leaked CSS source code could be directly, and fairly easily, recovered from this prime number, so publishing it was also a provocative way to ask, “How can a number be illegal?”

Cybersecurity history is full of wackily ironic questions of this sort, and we can learn enormously from being aware of them, while being entertained at the same time.

How does it work?

Well, a reader emailed us (use amos@solcyber.com if you want to get in touch; we won’t reveal your name or alias unless you want us to) to ask the following:

I enjoyed Paul Ducklin’s latest article about obscurity in security. I’m looking forward to Part 2 next week.

Before the next article comes out, could you ask him to explain with some examples how source code can be turned into a number, let alone a prime number?

Yours sincerely,

[NAME REDACTED]

PS. I like your Amos Avatar avatar.

Here’s how it works.

Computer text is usually stored in files as a string of bytes, which we’ll assume are 8-bit memory storage locations.

(Strictly speaking, bytes can be any length the computer hardware designer likes, but have been as good as officially 8 bits each for decades. Nevertheless, internet standards documents use the word octet to denote an 8-bit chunk, thus ensuring clarity and sidestepping pedantic complaints.)

Numerous different encodings, or tables, exist for deciding which numbers represent which characters, and how they’re converted into strings of bytes; as you can imagine, the Chinese, Korean and Japanese writing systems need far bigger code tables than the basic Roman alphabet.

The best-known format is ASCII, the American Standard Code for Information Interchange, devised in the 1960s to provide just enough range to allow A to Z, a to z, 0 to 9, some common punctuation marks, and various so-called control codes such as carriage return, backspace, line feed and bell, which on early teletypes was ‘output’ by literally dinging a bell inside the printer.

Here’s the ASCII table with its numeric codes in decimal (base 10, 0-127) and hexadecimal (base 16, 0x00-0xFF):

As you can see, ASCII was constructed to use just 7 bits of each byte (some devices ignored the eighth bit, or used it for other purposes such as error detection to help with electrical decoding), which is fine for C source code, but not for binary data that isn’t basic ASCII text.

In real life, a non-text file can have any number from 0-255 in each byte of the file, like this file of pseudorandom bytes churned out by the Linux kernel’s random number generator getrandom().

I’ve shown the data here in hexadecimal (base 16), widely used by programmers because it conveniently packs every byte into exactly two hexadecimal digits, making data dumps easier to format and align:

For clarity, here are the same 128 bytes (8 rows of 16 bytes each, as above) in base 10, where by convention we don’t write zeros at the left-hand end of a number, in the same way that although we write 2024 for the year, we don’t usually talk about the 0012 days of Christmas or the 0007 colours of the rainbow:

But how to turn this sequence of numbers, each of which ranges from 0-255, into a single massive number?

The answer is surprisingly easy: we use the same trick that turns the year 2024 into 2×1000 + 0x100 + 2×10 + 4×1.

But instead of using the multipliers 1, 10, 100 and 1000, each of which is 10 times bigger that the last because decimal numbers count in base 10…

…we treat the individual numbers in the table above as ‘digits’ in base 256, because 256 is the maximum number of different values you can pack into a single 8-bit byte.

The number 256 and the range 0-255 come about because each bit, short for binary digit, gives you two options (0 or 1), and with 8 bits in a byte, you therefore get 2x2x2x2x2x2x2x2 = 2⁸ = 256 choices.

This works in the same way that three decimal digits each of 0-9 give you 10x10x10 = 1000 choices and let you denote one thousand different numbers from 0 to 999 inclusive. (Don’t forget to include the zero! Many software bugs over the years have come from this mistake.)

So, if we start from the bottom right-hand end of the table above, because we usually write numbers with the least significant digits at the right, just as we put cents to the right of the dollars, we can build up an ever-larger number like this: 71×1 + 221x(256) + 240x(256×256) + 90x(256x256x256)… and so on.

You’ve probably spotted a problem for regular programming languages such as C, which I have demonstrated here by putting the numbers into a table called array[] and multiplying out all the base-256 ‘digits’ from right to left.

I’ve used the language Lua because it’s fun, compact, powerful and simple, and quite a few cybersecurity tools, including Wireshark and Nmap, support it as a built-in scripting language:

The number huge has gone haywire!

The sum of a sequence of positive integers all multiplied by positive integers can’t end up negative, so something didn’t work out.

The problem is that many programming languages limit the maximum number of bits they use for numeric values in order to control memory use.

In Lua, conventional integers are signed 64-bit numbers, so you get what’s called an integer overflow (a bit like the Millennium bug, where 1999+1 wrapped back to 1900 instead of advancing to 2000), and every calculation from then on is incorrect.

Lua’s smallest and biggest integers are as follows…

…which just happen to be -2⁶³ and 2⁶³-1 respectively, the range available to signed 64-bit integers. (If you’re wondering why the positive side of the range is one shorter than the negative side, remember what we said above: don’t forget about zero!)

We need software that just keeps on allocating more and more memory as our huge number gets bigger, such as the bignum library in OpenSSL, used for handling the giant numbers needed in cryptography:

That’s more like it: a 308-digit number that is a decimal version of the hex dump above.

Now let’s turn to the CSS source code, which looks something like this, though I have cut out most of the file to save space, showing just part of the data at the start and some of the cryptographic code in the middle and at the end:

We could convert that file into a number easily enough, but the code size, including the needed data tables and the C functions that manipulate them, is about 10KB, which is more than 25,000 digits in decimal.

(Data chunks turned into decimal digits are about 2.4 times longer than the raw ‘base 256’ file they represent. That number happens to be 8/log₂10, though I shan’t explain why here. That’s a story for another day.)

I decided to do what Phil Carmody, the creator of the original ‘illegal prime’, did, and gzip the source code first, compressing it down to just 2617 bytes, or 6302 decimal digits.

Unfortunately, there’s almost no chance that a gzip file converted to a number will end up prime; indeed, any input file smaller than 16MB will end up, after gzipping, with a zero byte at the end, which leaves you with an even number that is therefore divisible by 2 and not prime.

(The original file size is tacked onto the end of a gzip file as a 32-bit integer written ‘backwords’, in other words with the least significant byte at the left, not at the right, and integers less than 16,777,216 (2²⁴) fit into just 3 bytes, because 3×8=24, leaving their last byte in 32-bit format empty, or zero.)

Fortunately, however, the gzip format includes internal information to denote when the decompressor has reached the end of its input data, so you can happily tack bytes onto the end of a gzip file and they’ll just be ignored.

I therefore padded my file out with zeros to a multiple of 16 bytes long, or 6319 decimal digits, giving me space at the end to fiddle with the contents without affecting decompression.

I then steadily increased the value of my huge file-in-the-form-of-a-number until I hit upon a value that was prime.

In Phil Carmody’s original work, he cheated slightly by leaving the data tables out of the code he used, simply because proving the primality of a number with more than 6000 digits would have taken too long back in 2001, when he did the work.

Nearly 25 years later, working with the raw source code as 25,000 decimal digits would still be too much for my laptop, no matter that it is about as powerful as you can get in early 2024, but working with 6000 digits is well within reach.

I cheated by not doing a formal proof at the end, as Carmody did to qualify for the ‘primes of interest’ site; the time taken to run the formal prime proving system he used goes up proportionally to the sixth power of the length of the number, so you can see why he squeezed the data down as much as he reasonably could!

With a few zeros added to give me prime number wiggling room, as described above, my starting number came out as:

Now I needed to find a nearby but slightly larger prime number.

OpenSSL’s bignum programming library is intended for use in cryptography, where prime numbers are important, so it comes with a built-in isPrime() function that uses various probabilistic tests.

These don’t formally prove whether a number is prime, but will demonstrate beyond the limits of statistical doubt that you can consider the number prime for cryptographic purposes.

The simplest way to do this is as follows, assuming that the variable b in the loop below is the 6319-digit even number above:

When doing the calculations myself, I used slightly different code, splitting the problem into eight separate parts, based on the shortcut that in any consecutive group of 30 numbers, any numbers that do not have the remainder 1, 7, 11, 13, 17, 19, 23 or 29 when divided by 30 cannot be prime.

This is a consequence, if you’re wondering, that 30 = 2x3x5, the product of the first three primes, and any multiple of 2, 3 or 5 clearly can’t itself be prime.

You can show with simple arithmetic that for any number n from 1 up:

30n + 0 (remainder 0) is a multiple of 2, 3 and 5 
30n + 2 = 2 x (15n+1), a multiple of 2
30n + 3 = 3 x (10n+1), a multiple of 3
30n + 4 = 2 x (15n+2), a multiple of 2
30n + 5 = 5 x ( 6n+1), a multiple of 5
30n + 6 = 6 x ( 5n+1), a multiple of 2 and 3
30n + 8 = 2 x (15n+4), a multiple of 2

…and carrying on in a similar fashion to eliminate any remainder that isn’t in the list (1, 7, 11, 13, 17, 19, 23, 29) above.

This immediately rules out 22 out of every 30 numbers, instead of ruling out just 15 out of every 30 by skipping only the even numbers, as in our simple code above; it also makes it really easy to split the problem into 8 parallel tasks, each running on its own processor core, to speed up the effort overall.

In a few minutes, I hit this prime, this time in hexadecimal (base 16) instead of base 10:

In hexadecimal, we end up with a 5248-digit number (two digits per byte) starting with 1F8B08.., which is the byte sequence used as a file header in compressed gzip files. (The 1F8B denotes gzip, and 08 denotes the DEFLATE compression method, which is the only one you are ever likely to encounter.)

Looks like we’re now on the right track!

The trailing bytes are ..004DA3, and 0x4DA3 in hexadecimal is 19,875, which is the number we ultimately needed to add to the starting value above to find the nearest prime.

We now have an essentially unremarkable prime number that is difficult to object to legally, given that it is just a number, which we can write out and publish in any base we like.

But it just so happens that we can trivially extract and recover the CSS source code from it, even though some people thought that should be illegal back at the turn of the millennium.

For a final bit of fun, we converted the number to binary (base 2) form, making the 0 bits dark blue and the 1 bits yellow, and thereby created this graphical ‘artwork’ that just happens also to contain the CSS code:

Less amusingly, that image could have left you with a bunch of explaining to do if you’d had a copy of it back in 2001.

In the end, wiser minds seem to have prevailed, with the criminal charges against the Norwegian teenager who had written some of the code (the other contributors managed to remain anonymous) eventually abandoned and the leaked CSS code accepted as providing a legitimate outcome for cybersecurity research.

And all of that serves to reiterate the maxim that those who cannot remember history are condemned to repeat it, which is an especially troublesome problem in cybersecurity.

Don’t be that person, and don’t be afraid to ask for help from a human-centric managed security service provider to ensure that you keep on top of cybersecurity without distracting yourself from your actual business!

About Duck

Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!

Paul Ducklin

03/21/2024

Share this article:

Table of contents:

The world doesn’t need another traditional MSSP  or MDR or XDR.

What it requires is practicality and reason.

tour the product

Related articles

Google’s cloud outage: Move slowly and break things anyway

Google’s cloud outage: Move slowly and break things anyway

Why have one bug when you can have a series of them?

How to Prevent Email Spear Phishing Attacks

How to Prevent Email Spear Phishing Attacks

Phishing emails are often sent out en masse and target people indiscriminately. According to a recent Gone Phishing Report, 7% of users clicked a phishing email in 2022. Of those, 44% submitted credentials in a subsequent online form. As high as these statistics are, spear phishing attacks–a highly targeted form of phishing–are much more successful. According to research by Barracuda, 11% of spear phishing email recipients click the links inside them. Let’s dive into what spear phishing is, why it’s […]

End-to-end encryption: How it works, and why it breaks (Part 1 of 2)

End-to-end encryption: How it works, and why it breaks (Part 1 of 2)

End-to-end encryption is everywhere, yet we still have breaches and eavesdropping. Why?

HTTPS: How secure is it, and do we really need it? (Part 1 of 2)

HTTPS: How secure is it, and do we really need it? (Part 1 of 2)

Phishing and Smishing: How they work and what to do about them (Part 2 of 2)

Phishing and Smishing: How they work and what to do about them (Part 2 of 2)

Good cybersecurity habits for startups and SMEs

Good cybersecurity habits for startups and SMEs

Choose identity-first managed security.

We start with identity and end with transparency — protecting where attacks begin and keeping you informed, with as much visibility as you want. No black boxes, just clear, expert-driven security.

Get Started

No more paying for useless bells and whistles.

No more time wasted on endless security alerts.

No more juggling multiple technologies and contracts.

Follow us!

SOLUTIONS

Mobile Protection Foundational Coverage XDR++ MDR++ Security Monitoring Extended Coverage Cyber Insurance+ Program Why SolCyber Pricing

WHO WE HELP

From Scratch Scaling Up Total Protection Overburdened

PROMO

Refer-a-Peer
Program Slogans & Grumbles Winners

PORTAL

Log in

CONNECT

Contact About SolCyber Partnerships Careers Customer Portal

LEARN

Case Studies Blog Tales from an Armadilllo Tales From The SOC News Downloads

Subscribe

Join our newsletter to stay up to date on features and releases.

By subscribing you agree to our Privacy Policy and provide consent to receive updates from our company.

CONTACT

©

2025

SolCyber. All rights reserved

|

Made with

by

Privacy Statement Cookies Settings