Data Protection in a Nutshell
Read also Roger Mester's article: All About Copy Protection.
VARIOUS KINDS OF STANDARD DATA PACKAGES
We humans have various sensors that can receive signals, for example sound waves and light waves. These bring us information, much of which is generated by other human beings and makes sense to us: language, music, pictures and movies.
Information is fluid. We receive it and it's gone, except what gets stored in our memory. Information can also be stored externally. Then it's called data.
Data can be stored in various forms. The simplest is ink on paper. Our eyes can understand this directly, provided we know how to interpret the information. The text you are reading now only makes sense if you understand the language it is written in. Often data is stored in a more indirect way: groove modulation on phonograph records, magnetized particles on tape, or hills and valleys on a plastic disc. None of these can be immediately understood by a human being. To understand the phonograph record, you need a pick-up. To understand a magnetic tape, you need a tape player. In modern times much data is converted into digital form before being recorded. The pictures on a chip in a digital camera are stored as bits - on or off - totally impossible to understand without a device that translates it to something that "makes sense".
Data stored in digital form must be read and translated by a program. We call such a program a viewer, browser, reader or player. To complete the picure, a program is itself a kind of data that is understood by the computer's processor.
Digital information is stored in various file formats depending on whether it is sound, graphics, video, database, spreadsheet, or document.
So we have two components: a digital data file and a viewer that can read and present data to the user. Examples are a .PDF file and Acrobat Reader, or a .DOC file and MS Word.
Generating data involves work and money. Need I mention Hollywood film budgets? Sometimes data is critical and must be limited to a few users.
Digital data's greatest strength and weakness is that it is easily copied and spread across the world in a matter of minutes. There are various reasons for limiting the distribution of data - usually secrecy or money. If we're talking about company sales figures and military strategy, then only those with a "need to know" should see the data. If we're talking about songs and movies, then only those who have paid should have access.
If I want data to be secret, I can keep it in my head. Then, as long as I keep my mouth shut, no one but me can access it. However, most data is meant to be communicated. As soon as the data is written down (stored) someone can steal it. You can save it in your own format that no one else understands, but then you also need to make a proprietary viewer.
You don't necessarily want to keep data secret. Sometimes you just don't want it copied. Some data is printed with black ink on red paper to foil photocopiers. Think about money - special paper and ink, so it can't be copied.
You can encrypt data using a secret code and prevent the wrong people from reading it by inventing a sophisticated code machine to do the decryption. Now the data can be transmitted openly, but only owners of the machine can understand it. We all know the story of the famous Enigma cipher machine.
In our digital world, however, the code machine is just software and can be easily copied and passed around. Since the decoder itself is no longer physical, it must be locked to a physical key - one that cannot be copied and that is given to authorized users only. If you know and trust the receiver, you can use a password. But for general publication, you must assume that some of your users are hostile; therefore you need a key that cannot be copied. The key can for instance be the user's computer.
If you want to use a standard data format, and use standard viewers, then you need to encrypt data and then decrypt it again before the viewer sees the data.
So a protected data package consists of the encrypted data, the decryptor and the key. Often the decryptor is divided into two parts: the viewer which understands and displays the decrypted data and the actual decryptor itself.
When evaluating the quality of data protection, the following must be considered.
Very few companies in the world master the art of program (executable) protection. One could name a handful - dedicated bit-twisters that dig deep into the heart of the computer's operating system. See my article All About Copy Protection.
Even fewer companies master the difficult art of data protection. Data protection utilizes copy protection to lock the decryptor to a physical key, but also consists of encryption and decryption. Moreover, designing a decryptor and integrating it into the package is a science in itself.
VARIOUS KINDS OF STANDARD DATA PACKAGES
Data can be characterized by its availabity and the format used to store it. This can be a public (standard) format such as html or a private (proprietary) format. If data is located on a single computer which can be accessed by authorized personel only, then the data is restricted. On the other end of the scale, if it is on a website, it is open.
The most common need for protection comes with distribution of standard format data to a wide audience. Users are either paying customers or (more or less) trusted employees.
PROTECTING STANDARD DATA. HOW IT'S DONE
Standard data, depending on how it's used, can be protected in various ways. Take PDF for example. PDF can be read by the latest Adobe Reader and a few previous versions. What can be done to protect a PDF file? Well, you start by encrypting the document. This is easy enough. Invent some kind of algorithm, XOR with this and that and swap some bits around. You don't have to be an expert to make life difficult for the casual hacker.
The problem is making the Reader understand the data. If the Reader were your own software, you could change the source so that decryption was built in. If you were some kind of super-hacker, you could reverse-engineer the Reader and patch in some decryption code. This might work, but from a professional point of view, Adobe would not be happy about your releasing hacked versions of their software. In fact, this is explicitly forbidden by their user license.
At this point, we have encrypted data and have chosen a standard browser. The only thing missing to complete the scheme is a decryptor - a module that will take the encrypted data and feed it to the browser on demand. There are various ways of getting the job done.
Device driver
The simplest is to install a driver in the path data takes from the hard disk to the browser. In this case, the driver would talk directly to the hard disk, reading encrypted data, decrypting it and passing it on. Being a low-level device it doesn't know who is calling it, so this would be a problem. Homemade software could call the driver and receive decrypted data.
The disadvantages of this simple solution are compatibility and security. Drivers are different for each operating system and have problems making sure that it is not the hacker's homemade browser that is requesting the decrypted data.
Plug-in
Some browsers, for example Adobe Reader, Netscape and Internet Explorer support plug-ins. A plug-in is basically a DLL with a clearly defined interface that can be used by the browser. The plug-in decrypts the data and feeds it to the Reader. The problem here is that the plug-in has a well-documented interface that helps the hacker find out what's going on. Also the browser has no code security leaving it wide open for debugging and reverse-engineering. And you have to learn a completely different interface every time you want to develop a plug-in for a new browser. Another big limitation is that many data applications do not support plug-ins.
Monitor or Filter
Probably the best, but also most complex way is to launch a monitor module, hook into all the DLLs used by the Reader (and there are plenty) and watch every open, read or write operation. Then intercept read calls and decrypt data when necessary. Write calls have to be trapped and disabled, otherwise users can just save the decrypted data.
The monitoring software that keeps an eye on the DLLs must itself be encrypted to prevent reverse engineering, and it must also be locked to some kind of physical key (e.g. an original CD-ROM) so that it can be used by authorized customers only.
This scheme is a bit tricky, but feasible - and can be reasonably safe and reliable if the following conditions are met.
The above are all security considerations. Compatibility is just as important. The scheme must be robust enough to function reliably on all computers and operating systems..
Back-up should be considered. Encrypted data is worthless if you lose the key. Moreover, if data is to be transmitted, note that transfer rates drop as the quality of the encryption increases. This problem can be solved by combining encryption and compression.
We now have a good idea of how a practical data protection scheme operates. You launch a copy-protected module that will run only if a genuine CD or other hardware (e.g. the PC itself) is present. This filter monitors the Reader, watching all data I/O and decrypting on demand. The monitor must also know when the Reader is handling a non-encrypted document and let this pass unmodified.
There are various other details to account for.
You must consider the case where the PDF file is run from inside an internet browser - embedded in the html. Here the Reader executable is not started, but a plug-in is used instead. Again, you have to find a way to monitor all read/write calls.
This should give you an idea of the problems involved - and PDF is one of the easy ones. HTML itself is probably the most difficult of all. It's hard to place a filter on html browsers. First of all, there are so many of them. You would need different monitoring software for each one. And internet browsers - especially Internet Explorer - are so tightly integrated with the operating system that they're nearly impossible to monitor. Not to mention that it is difficult to prevent decrypted data from being saved or exported from the browser.
For html, all you can really do is develop your own browser and have your customers use it to read encrypted html pages. This is not an ideal situation, but workable.
Each data protection job is unique. Even for PDF, which is a basic situation, you have all sorts of variations: for example, a startup-file that automatically checks for the latest Reader before loading and displaying the PDF. So you have to get the startup to start the monitor as well as the Acrobat.
As you might expect, things can go inexplicably wrong. Testing on multiple platforms is therefore mandatory to ensure the desired degree of compatibility. This means that both implementation and testing costs are increased.
One wonders why any sane company would get into this area. On the other hand, there is more and more data all the time, and it gets more expensive to produce. You've got graphical applications, entertainment of all kinds, maps and educational material. Besides public distribution there is also an increased need for in-house security: technical documentation, training videos, sales charts, customer databases and budget spreadsheets.
Data protection is like program protection, but more so. Compatibility is more difficult. Security is tougher. If the hacker can decrypt the data, he doesn't need to crack the code security in the monitor module. Also, due to the additional processing, there are often speed considerations.
Data protection is more than just technology. It is also an economic consideration.
If you're supplying data to a mass market at a very low price, then your chief concern is compatibility and only a reasonable amount of protection. They probably won't try too hard to steal it if the price is right.
For expensive data going out to a small number of customers, you're not too concerned about compatibility. If there's one machine it doesn't work on, they can find another, or you can help them out on an individual basis. Safety is important, because you don't want expensive data getting out. On the other hand, if it's very technical data, it might not have wide appeal.
The ideal system - always works and unbreakable - doesn't exist. Think about how much compatibility you need, balanced against security. Data protection always means more work and expense. You have to be prepared for support and the possibility that the protection will be cracked. Weigh this against the price of the protection and the extra sales (or in-house security) gained.
Another important factor is longevity. If last year's data is irrelevant, the protection need only last for a limited period. If you can hold off the hackers for 6 or 8 months, you're OK. Other data might have a shorter or longer lifetime.
In the end, anything that can be displayed on a screen can be stolen. If a thief is willing to do the work of making endless screenshots, saving them one-by-one, there's nothing you can do to stop him. However, you can make it difficult.
If possible, data should be used interactively. If a pirate can grab the data and doesn't need the browser, that makes it easier for him. Take the case of an electronic textbook with several chapters - each being a separate Shockwave application. If a pirate can decrypt the individual chapters, he doesn't need the overall browser at all. On the other hand, if the textbook is presented in a more interactive way, where the individual SWF files can't stand on their own, then the hacker cannot use the data separately and is forced to crack the entire scheme. Then, even if he cracks the scheme and makes copies of the whole application, he still cannot get the data into a form where it can be modified or used in other places.
Generally, published protected data is delivered with the key, otherwise the user could not use it. The weak point is the software that performs the decryption, since after decryption the data is wide open to pirate distribution. In order to avoid this, the decrypting device is now being built into chips directly connected to the output media. Within a forseeable future we may have to invest in special video boards, screens, printers, sound cards and maybe even speakers. Assuming that a hacker is not able to get inside these chips, the security depends on the encryption algorithm and a high quality key. At the same time, new processors will become mandatory; they will decrypt program code to ensure that the code comes from a trusted source. All this in an attempt to avoid virus, worms and "Trojan horses".
Still, any security system sold in volume will be exposed to a massive amount of hacking. If nothing else, employees can be bribed or secret keys stolen. Security is a fluid item and must be changed constantly. Any scheme that ignores this is doomed to fail. Believing any code to be unbreakable is a classic mistake that has lost wars in the past.
About the author Roger Mester was educated in the USA at Rensselaer Polytechnic Institute and Stanford University. He has worked for General Electric, Lockheed, Radiometer, and Great Northern Telegraph. He was one of the two founders of Link Computer - now Link Data Security - in 1982.