Low-level exploits and vulnerabilities that lead to spyware takeovers on mobile phones sound like hard problems to solve, and they are. In contrast, malware that spreads in plain old text files, created with plain old text editors, often by self-taught programmers with little or no experience, sound as though they should be easy to deal with.
How hard can text-based scripts really be? The fact that we’ve had to split this article into two parts probably gives you a hint at the answer: very hard indeed.
Some of the cybersecurity vulnerabilities that cybercriminals exploit are notoriously complex, being deviously difficult to uncover and analyse, and demanding what can only be described as grudging respect for their fiendishness.
Finding them in in the first place, or devising fixes for them once they’re known, often requires detailed knowledge of low-level assembler programming, arcane machine code tricks, and little-known or entirely undocumented operating system and hardware details.
Sometimes, just explaining new exploits to yourself, let alone writing them up in a way that counts as repeatable research, is surprisingly difficult, even though that is a vital part of creating a solution that you know fixes the underlying security hole reliably, thus delivering security by design and not just security by accident (where a lucky trick that worked once ends up adopted as a cure-all, even though it isn’t).
But malware analysis and threat detection is full of surprises.
One such surprise is what you might call “the practicability paradox”, where problems that feel as though they ought to be easy to deal with turn out to be as hard anything you’ve seen before, and keep on getting harder in a dizzying series of twists and turns, sometimes for months, years, or even decades. Ironically, this sort of surprise shouldn’t really be a surprise, because it’s been part of computer science since the very beginning, right back to the 1820s, when Charles Babbage from London in England announced plans for his legendary Difference Engine Number 1.
The difference engine was to be a digital computer, albeit a mechanically operated one, designed to eliminate costly and dangerous errors from the printed mathematical tables then critical in astronomy, engineering, and navigation.
We now know in practice, not just in theory, that Babbage’s design was correct, because by the early 1990s, the Science Museum in London had built a difference engine that actually worked – an item of the most astonishing industrial beauty even when it is at rest, and quite breathtaking when it is running.
But Babbage was stymied in his own lifetime by two parallel challenges: his initial design required finer manufacturing tolerances than could me achieved in the 19th century; and he kept changing, improving and extending his designs even as his engineers failed to build a working model of his first proposal.
Soon, Babbage has leapt ahead of himself to design an Analytical Engine controlled by punched cards, which was a truly general-purpose programmable computing device in the modern sense, complete with a mill and a store, Babbage’s industrial revolution metaphors for components we now call the CPU and RAM.
He also came up with Difference Engine Number 2, which was more compact and needed fewer parts than the original, a manufacturing trend that we expect computers and mobile phones to follow to this day.
In an irony worthy of a time-travel science fiction novel, it wasn’t until we had sufficiently powerful electronic digital computers, and the engineering precision they made possible, that we were able to create and build the very computer that was their progenitor.
In the modern era, script malware has turned out to be a similarly troublesome problem: one that in theory we know how to solve, but that in practice keeps throwing up ever more challenges, making it ever more difficult and time-consuming to deal with.
To be clear, many of the problems we face with script malware are quite deliberately introduced by cybercriminals, as we knew that the cat-and-mouse game was inevitable.
Indeed, cybersecurity is not for those who are faint-hearted or short of stamina, because you can think of it as an eternal card game where the cards are dealt from a deck of unlimited size, with an unlimited number of different suits and ranks available.
Your King of Hearts may beat my Queen of Clubs today, but my Emperor of Escutcheons will beat your King tomorrow, and your Grand Vizier of Galaxies will take out my Emperor the day after that, and so on forever.
Sadly, however, the script malware problem is not down to the cybercrooks alone.
Just as with Babbage and his ever-evolving computing machines, some of our script malware challenges are problems of our own making, caused by the relentless march of what we call progress.
Over many years, we have shown a collective willingness to accept programming practices in our own world that are surprisingly similar to, and sometimes exactly the same as, the tricks used by cyberattackers in their world.
But let’s rewind to the early days of malware, when viruses with names such as Elk Cloner, Brain, Vienna, Stoned, Jerusalem, Ping Pong and Michelangelo were doing the rounds.
(New viruses were sufficiently rare in the late 1980s that it was considered safe to name them after the city or town in which they first appeared, and sometimes even after the country, so that Stoned was also known as the New Zealand virus, and Ping Pong by the name Italian.)
Many of those early viruses had a lot in common with the devious and technically treacherous modern exploits we mentioned right at the start.
They used undocumented tricks and guesswork that weren’t in any books or magazines (there were no search engines back then, of course), and they were often hand-crafted in assembly language or machine code to fit into minuscule hiding places where you might not notice them even if you knew where to look.
Early viruses also often included programming tricks to make themselves as good as invisible once they were active, by lying to the rest of the computer about their appearance, a treachery known to this day as stealth, and also by the splendidly self-descriptive jargon term anti-anti-virus.
To become a successful malware researcher therefore required you to be an inventive progammer (there was no GitHub back then), and to know how to disassemble machine code at sight and to code directly in hex (development tools were slow and bulky).
You also needed a prodigious memory for hardware and operating system arcana, because there was little or no pop-up help, and where documentation existed, a lot of it was available only in printed form, so memorising it was much faster than paging through clumsy ring-bound BIOS listings or system manuals.
As you can imagine, it was tempting to assume that the same sort of entry requirements would apply to malware writers, albeit without any commitment to quality, correctness or fair play on their part.
Perhaps it was reasonable to assume that the amount of malware, and just as importantly the range of different malware types showing up, might be limited, too?
The cat-and-mouse game would be inevitable, but there was early hope that it would turn out to be a tractable problem, like Babbage’s assumption that he could build a Difference Engine Number 1 that really worked.
Many malware writers were, as it happened, driven by a desire to show off their coding prowess and their crafty assembly language skills, because there were few ways, if any, of making money out of malware back then.
But even in the 1980s and 1990s, not all coding tasks required traditional programming languages such as C and assembler, which have notoriously steep learning curves.
So-called scripting languages bridge the gap between the arcane complexity of progamming in C and the straightforwardness of saying what you want to do in plain words.
For example, in C, a progammer might end up copying files with a function like the code shown below, which is long and fussy even if you’re familiar with C:
In hexadecimal machine code form and decompiled to Intel assembly language, it’s even less tractable:
Despite the apparent attention to detail in the code above, if you assume that copious comments imply thoughtful programming, it is awash with bugs.
One bug is obvious and could lead to denial-of-service problems in a long-running program; a second is rather less obvious, but could lead to undetected data corruption; and two further problems make the second bug very much more problematic if the code is compiled on Windows.
The code is perfectly legal on Windows, but missing an annoying and unportable detail that isn’t needed on Linux, where it was originally written. It also contains resource management and error-handling flaws. If you can find some or any of the bugs, write to amos@solcbyer.com and tell us! We’ll send an Amos the Armadillo plush toy to the best answer. (Editor’s discretion and decision is final.)
In contrast, users of MS DOS (and indeed of Windows 11 today) could achieve a similar result in a so-called batch file or .BAT script containing a single line of text like this:
Implanting the above command inside a batch script achieves exactly the same result as typing it yourself at the DOS prompt:
Microsoft’s batch language is much more powerful today, superseded by PowerShell, a yet more powerful scripting language still much more powerful PowerShell scripting language, but even back in the MS DOS days it was perfectly possible to write dangerous and destructive malware in .BAT scripts.
Batch programmers could despite their underlying simplicity, batch programmers could access commands to navigate around the hard disk, to delete existing files, to create new ones, to search for text in files, to modify existing files as if editing them by hand, and even to reformat or wipe the hard disk entirely.
They could even create short machine code programs in assembly language by running the built-in MS DOS DEBUG command, which included a very simple assembler tool.
Today’s cybersecurity news stories may give you the impression that so-called fileless malware, where executable files get launched but never touch your hard disk, is something scarily new, but this short batch file, running on MS DOS 3.30 from 1987, generates a machine code program directly in memory and then runs it without saving it to disk at all:
The script works as follows, where the lines that start with hexadecimal memory addresses show where DEBUG placed the code in memory, the line -g tells DEBUG to ‘go’, which means to run the generated code, and the text Hello, world on the next line is the output of the assembler code above it:
(INT 21h
function AH=09h
is the DOS PrintString
function, if you are interested in operating system API history).
On Linux and Unix, the situation is even more dramatic, because Unix-style script language tools have had this sort of power and much, much more for decades.
Fileless malware in a Unix script can be as short and as simple as this:
The curl
command is a command-line internet download tool that fetches the given URL and prints it to the screen, seen here fetching a rogue script from an arbitrary website.
The vertical bar character (‘|
‘) is known as a pipe, and represents a special memory storage area, allocated by the operating system, that behaves as if it were the output file for the first program in the command (known in this context as a pipeline), and as if it were the input file for the second program.
The first program thinks it is writing to the screen and the second program thinks it is reading from the keyboard, but in real life the pipe is acting as if it were a temporary file used to convey data between the two programs, without actually ever saving any data to disk.
This single line of script downloads a hitherto unknown and untrusted script from outside your network, and feeds the script directly into the bash
script processing engine, as if you had typed the script in and executed it yourself.
At first sight, it might seem as though script malware ought to have produced what you might call a swings-and-roundabout effect, where the increased ease of creating malware in the first place would be offset by the increased ease of scanning those text files for rogue content.
Executable files (.EXE
files on Windows, MachO binaries on macOS, and Elf-format binaries on other Unixes) have surprisingly involved structures, starting off with complex headers that need to be read in and deconstructed by threat scanning tools before they can begin to tease the rest of the file apart to investigate how it works and what it does.
For example, Windows programs are stored in what’s known as Portable Executable (PE) format, though in this context ‘portable’ means that the file can be used on various versions of Windows, not portably between different operating systems. PE files start with data headers that look like this:
In contrast, batch files and scripts are, by and large, plain text files in which command execution starts at the first line and continues onwards line-by-line through the file (though possibly with jumps, loops or early exits).
This makes it sound as though scanning script files for rogue content should be much easier and faster than picking apart highly structured files such as executables, images, archives and so on.
But the apparent simplicity of text-based scripts hides many complexities.
Firstly, script files are generally much easier to modify than raw executables, and cybercrooks who want to devise new malware variants can do so by using just a text editor, without needing tools such as preprocessors, compilers, assemblers and linkers, which are typically used to rebuild executable files after their source code is changed.
Modifying raw executable files is possible using what’s known as a hex editor, which is a bit like a text editor but designed to work with any byte sequences, whether they make up legible lines of human-readable text or not.
But .EXE
files modified in this way are prone to damage that stops them working, for example if a byte is added or removed incorrectly by mistake.
That’s because executable files are full of internal references such as ‘read the data that is stored 64 bytes back in memory from here’, or ‘in the event of an error, leap forward 99 bytes in memory and carry on from there instead’, so that inept modifications may cause the file to crash if used, or to be rejected outright by Windows as incorrectly formed.
Secondly, script files are often trivial to rearrange so they are vastly larger and look very different, for example by adding hundreds, thousands or even millions of blank lines, which are ignored when the program runs, or by inserting so-called text comments that are ignored.
In .BAT
files, any line on which the first non-space characters are REM
, short for remark, are ignored as comments; in most Unix scripts, the hash (‘#
‘) character serves the same purpose; PHP script comments start with //
or are wrapped by /*...*/
markers; and so on.
Redundant lines of code can also easily be inserted, such as setting variables that are never used, performing calculations that are not required, and more.
Although this makes the script take longer to run, possibly even much longer, the cybercriminals who create obfuscated malware in this way don’t have to worry about those longer runtimes, because their malware typically only needs to run once to achieve its malevolent result, so a few extra seconds simply doesn’t matter.
But malware scanners are faced with much more work to process every possible text file on the system, because they are looking for script malware needles in a giant haystack of innocent text files.
A batch file like the one below, and trillions of different but easily created variants of it, will performs exactly the same function as the one-line COPY
script we showed above, even though at casual glance it looks like a mish-mash of .BAT
keywords that could never possibly make any sort of programmatic sense:
Here’s another example, where the injected comments are entirely nonsensical to the human eye, but look like pseudo-random non-text (which is indeed how they were generated) to an automated analytical tool.
The very possibility of scripts like this force scanners to read the entire file, teasing out the content line by line, discarding lines that won’t be executed if the file is ever used as a .BAT
script, and taking account of what’s left:
And here’s a yet more dramatic example, which exploits the fact that .BAT
scripts, unlike scripts in many other command languages, aren’t pre-processed by the operating system to ensure they represent usable programs.
Instead, each line is read in and tried out in sequence.
Lines that are not legal commands produce a Bad command or file name error
, but the system ploughs on regardless, ignoring the errors and helpfully (for the cybercrooks, at least) running any lines that it can:
Those ‘@
‘ characters at the start of each line tell the operating system not to print out each command before running it, so the output looks like this, with a string of errors masking the single line that actually worked:
Even more dramatically, perhaps, the Windows batch language allows commands to be constructed at run-time, directly from variables established with the SET
command, making scrambled scripts surprisingly easy to create, even though .BAT
scripts are considered unsophisticated from a programmatic point of view:
The .BAT
program above constructs the command ECHO
from the four variables created with the SET
commands at the top (the variables are confusingly and deliberately named after genuine batch commands), adds some text consisting of words that originally appeared in a completely different order, and thereby prints out a meaningful result from a heavily obfuscated input file:
Although scanning for malware in files such as executables, images and archives can be complex and time-consuming, most of those file types have some sort of magic marker at the start, without which they won’t be accepted as executables, images or archives by the operating system.
Windows executables famously start with the bytes MZ
, which are the initials of Mark Zbikowski, the Microsoft programmer who invented the .EXE file format, so files that don’t start that way can be assumed (with some limitations) not to be executables, and therefore don’t need analysing as if they were.
But almost any text file can be turned into a batch script, and many if not most of those files, regardless of their size or how unlikely they look at first glance, could include harmful commands.
Script malware therefore tilted the programming burden in favour of the cybercriminals by flattening the learning curve for malware creation and modification.
At the same time, it also tilted the complexity of detection and prevention against the cybersecurity industry, given that almost any file could, in theory, act as an infectious host for some sort of malware code.
Sadly, long as this article has been, we have only scratched the surface of this thorny issue.
Scanning text files for potential malware sounds as though it ought to be easier, faster and less prone to mistakes than scanning more complex files, but as early as the late 1980s and early 1990s, we had an early warning that script malware could turn out to be a thunderingly difficult problem to deal with.
And that is exactly what happened.
Microsoft Word and Office macros on Windows 95, JavaScript in our browsers, scripting tools such as PHP and Visual Basic Script on web servers, and PowerShell on Windows 10 and 11, have brought dramatic and still-evolving challenges for threat detection and remediation…
…and you can read all about them right now in Part 2 of this article.
PS. If there are any knotty topics you’re keen to see us cover, from malware analysis and exploit explanation all the way to cryptographic correctness and secure coding, please let us know. DM us on social media or email the writing team directly at amos@solcyber.com.
Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!
–Image of PE file header originally by ByteBiter, then modified and uploaded to Wikimedia and licensed under CC-BY-4.0.