Saving clients money on electronic discovery processing is one of the challenges facing attorneys, service bureaus and their clients. Due to the amount of data collected when imaging custodian hard drives the resulting processing and labor costs can be significant and potentially prohibitive.
Reduction of 30%+ Through DeNISTing
Many firms have discovered that deNISTing is a relatively easy way to reduce the overall EED processing costs for imaged custodian drives by an average of 30%. How do they accomplish this reduction without missing potential evidence? By removing ‘known’ files for Microsoft Windows, Linux, Mac OS and other systems the overall production is substantial reduced.
The NIST (National Institute of Standards and Technology) NSRL list contains more than 115 million known files and by using this list to filter custodian hard drives files, prior to EED processing, a significant reduction can be realized.
What Brought on DeNISTing’s Recent Popularity?
‘DeNISTing‘ has become a requested service in just the last few years. Until recently there haven’t been tools available to handle the processing without significantly increasing the turnaround time and investing in expensive computer forensic software.
Pinpoint Labs’ Harvester Software Makes deNISTing a Reality
Harvester from Pinpoint Labs is an affordable and easy to use application which leverages the more than 115 million known hash values in the NIST list to filter custodian data and dramatically reduce the costs and processing time associated with imaged hard drives. Harvester can also dedupe while creating a chain of custody and safely copy filtered files while deNISTing. By performing these multiple processes simultaneously, Pinpoint Harvester reduces electronic discovery processing costs and labor.
A hash value is a result of a calculation (hash algorithm) that can be performed on a string of text, electronic file or entire hard drives contents. The result is also referred to as a checksum, hash code or hashes. Hash values are used to identify and filter duplicate files (i.e. email, attachments, and loose files) from an ESI collection or verify that a forensic image or clone was captured successfully.
Each hashing algorithm uses a specific number of bytes to store a “ thumbprint” of the contents. The following is a list of hash values for the same text file. Regardless of the amount of data feed into a specific hash algorithm or checksum it will return the same number of characters. For example, an MD5 hash uses 32 characters for the thumbprint whether it’s a single character in a text file or an entire hard drive.
HASH
MD5: 464668D58274A7840E264E8739884247
SHA-1: 4698215F643BECFF6C6F3D2BF447ACE0C067149E
SHA-256: F2ADD4D612E23C9B18B0166BBDE1DB839BFB8A376ED01E32FADB03A0D1B720C7
SHA-384:
2707F06FE57800134129D8E10BBE08E2FEB622B76537A7C4295802FBB94755BBEE814B101ED18CC2D0126BD66E5D77B6
SHA-512:
C526BC709E2C771F9EC039C25965C91EAA3451A8CB43651EA4CD813F338235F495D37891DD25FE456FE2A8CA89457629378BE63FB3A9A5AD54D9E11E4272D60C
RIPEMD-128: A868B98EAEC84891A7B7BA620EDDE621
TIGER: F31A22CEED5848E69316649D4BAFBE8F9274DED53E25C02D
PANAMA: 7E703B1798A26A0AF21ECD661CBADB9C72B419455814CA7B82E29EE0C03FA493
CHECKSUM
CRC16: 117C
CRC32: FA2D47D4
ADLER32: CF7D65FF
As you can see there are also various length hashes within a family (SHA-1, SHA-256 et.) The most common hash values are MD5, SHA-1 and SHA-256. The longer hash values require more time to calculate and are designed to reduce the probability of a collision.
A few other ways that hash values are used:
- Verify a downloaded file was created by the publisher (oppose to a virus infected version)
- Identify and filter files on the NSRL/NIST list (“deNISTing”)
- Locate known contraband (illegal images and videos)
Here are a few reasons why hash values are so widely used as a means to validate and compare content:
1) Privileged Data – There would be obvious issues storing and providing multiple copies of the contents of a company’s files or entire hard drives data in a database to perform a byte comparison. Not to mention illegal images and videos (child pornography) would have to be stored and used in each system scan. These scenarios are unacceptable.
2) Speed – Comparing an indexed hash value versus what could be billions or trillions of bytes or source data is much quicker. Optimized hash engines (Pinpoint Harvester) can compare thousands of hash values in a second.
3) Security – Hashing data is a one way trip. The original data can’t be recreated or reverse engineered from the hash value. This provides additional security that a person can’t determine the source data from the hash.
The argument that data sources could be different and have the same hash value has raised a lot of concern. There are countless threads related to this issue on the litigation support and computer forensic forums. The bottom line is the only way to do an exact comparison of the original data is to store it everywhere you need to deduplicate or verify the information, however, as mentioned about this isn’t a practical alternative.
More complex hashing functions have been introduced (SHA-256, SHA-512 etc.) which will further reduce the likely hood of a collision. It is also worth noting that even in those cases where scientists have created collisions it was a result of exploiting the weaknesses in a specific hash algorithm. The same alterations would not create a collision in a different hashing algorithm.
So, if you still aren’t satisfied with the incredibly remote possibility a collision could happen using a single hash value then the easiest way to implement an extra precaution is to take the time to have your processes calculate hash values from two separate algorithms (i.e. MD5/SHA256) for each item. Unfortunately, most EED applications and forensic imaging tools don’t support this option, especially in a single pass.
What to Remember
Hash values are a reliable, fast, and a secure way to compare the contents of individual files and media. Whether it’s a single text file containing a phone number or five terabytes of data on a server, calculating hash values are an invaluable process for Deduplication and evidence verification in electronic discovery and computer forensics.
As our electronically stored information (ESI) data universe continues to grow, we are hearing about increasing storage capacities. The size of a project in terabytes (TB – 1024 Gigabytes) comes up frequently and is often the amount of data that has to be collected, culled or processed on a corporate server. However, now you can purchase a 1TB drive that will fit in a laptop computer.
Have you heard of a job that will reach or exceed a petabyte? If not, you most likely will in the near future and the following will help if you aren’t familiar with the larger capacities.
Equivalent Storage in Terabytes
Petabyte = 1,024 TB
Exabyte = 1,048,576 TB
Zettabyte = 1,073,741,824 TB
Yottabyte = 1,099,511,627,776 TB
As the size of electronic data at client sites increases so will the need for refined, targeted ESI collections. Many litigation support and computer forensic professionals have encountered collection jobs that are several terabytes and are provided keyword search terms and other criteria to help identify relevant data and decrease the amount being collected, processed and hosted.
Email Collection refers to the identification and isolation of electronic mail (email) messages that pertain to a specific legal matter in civil litigation cases.
What gets collected
What is actually being collected during email collections can be one of two things:
1. Files representing the contents of the transmitted email messages themselves (usually in MSG, HTML, EML or RTF format).
2. Container (or store) files that hold the contents and data associated with multiple email messages, usually all of the emails for a specific custodian.
Whether files for individual emails or container files are collected depend mostly on the type of email system being used by the custodian. If the custodian is a user of Microsoft Outlook for instance, then either container files or individual email files may be produced. If the custodian is a user of a webmail service, such as Gmail or Yahoo!, then it is likely only individual email files can be collected.
How it’s done
Software such as Harvester from Pinpoint Labs can search the PST store files produced by Microsoft Outlook and Exchange email systems for individual emails containing specific criteria, such as who sent the email, who received it, when these actions occurred and whether the subject, body, or attachments contain specified key words. It can also produce the result to either individual email files or whole, reconstructed container files, known as PST regeneration.
With other email systems, either the whole container file can be copied and sorted through manually, or the individual emails can be manually identified and exported as individual email files.
What to remember
As with any data being collected, the two concepts to remember are preservation and validation.
Preservation refers to keeping the metadata about the individual messages as well as the metadata contained within each of the messages intact so as to maintain their admissibility. PST regeneration is especially desirable in this case because it maintains both the email data and the data that linked it to contact data, task list data and other data integrated with these types of email messages.
Validation refers to the policy of insuring, either by hash value comparison (analogous to fingerprints for data) or bit-wise comparison, that the contents of the copy are the same as the contents of the original.
Software such as Harvester and SafeCopy 2, both from Pinpoint Labs, have built-in preservation and validation systems to certify that both of these conditions are always met.
ESI (Electronically Stored Information) is the general term for all of the data stored on the hard drives, camera cards, cell phones, GPS devices, digital video recorders, digital answering systems, thumb drives, RAID arrays and any other form of electronic media capable of storing data.
Types of Electronically Stored Information:
Files – Files are by far the most common arrangement for ESI data. Files (also referred to as loose files or active files) can be thought of as data containers similar to files in the real world. They can be copied, moved, and distributed freely on a variety of different media from DVDs to hard disk drives.
Emails - Emails are messages sent from user to another. In their raw form, they are simply a stream of data that contains everything needed to get the message from one user to another user. Since emails are a form of documented communication, they comprise highly sought-after data when it comes to legal matters. Emails themselves may be contained in databases, files, or unallocated space.
Database Entries - Database entries is data stored in a database. This type of data is usually context-specific and may be information pertaining to financial records, personnel entries or other data that is interrelated. Single entries in a database require export to another format in order to be useful or even readable by humans. Most databases include this ability.
Log Entries – Log entries are lines in files or entries in databases that contain information about activity on a particular computer. The more commonly useful log entries pertain to users logging into and out of a computer, accessing specific internet sites, the sending or receiving of email or other messages and the moving, copying or accessing of files on the computer. Log entries may require conversion into human-readable form before they can be processed.
Raw or Unallocated Data - Raw or unallocated data is data that resides in segments of the storage media (hard drive, camera card, etc) that are not being used by files. This data can contain all or part of files that were once referenced in the file allocation table but were subsequently deleted. It can also contain deleted internet history, old information from the computer’s RAM (Random Access Memory) or even old configuration data about the computer itself. Much of this data can even survive a reformatting of the disk itself. Since this data can come from any number of sources that had once been active on the drive, it can make or break a case where it is suspected that deletions may have occurred.
Tools for Collecting ESI
With the exception of unallocated space, tools such as One Click Collect Harvester from Pinpoint Labs have the ability to collect loose files, emails and whole databases with the added benefits of being able to specify key words, date ranges, domains and email addresses among other very useful filters.
Tools for collecting the unallocated space on a drive usually require an experienced forensic examiner in order to get useful interpretations of the data collected. In cases where this is necessary, it is recommended that a certified examiner be hired for the collection and analysis of the data.
Active File Collection refers to the collection of files that are active (not deleted) and pertain to a legal matter or legal hold. In most civil litigation cases, extensive forensic investigations that look at deleted files are unnecessary or too expensive. Thus, most ESI collections are active file collections and/or email collections.
How active file collections are performed
Active files are those that can be seen by normal users. They may include hidden or system files, but they do not include the computer’s Random Access Memory or any deleted files. Files in the Windows Recycle Bin are considered active files and are subject to collection using active file collection methods.
The first step is defining which files need to be collected. This definition can range from “everything” to files of a few specific types containing only certain key words. Since the cost of processing is usually related to the size of the data being processed, it is generally more economical to be as specific as possible without leaving out relevant files.
Once the files have been identified, it is mostly a matter of copying them in a manner that both avoids spoliation and provides a means of certifying the contents of the copies.
What to remember
The one thing to remember about active file collections is that they can be a potential minefield of spoliation. To avoid this, use software that is designed to preserve the metadata, the timestamps, and the data within the copied files. Some products, such as SafeCopy 2 from Pinpoint Labs are designed specifically for this purpose. Others, like Harvester, also from Pinpoint Labs, offer this feature as well as the ability to cull data by key word search and also support deduplication, email, and deNISTing.
The most important aspects of active file collections are preservation and validation.
Preservation refers to the preservation of the file data, its timestamps (when the file was created, last modified, and last accessed), and any other metadata contained within the file. If any of this data is compromised, the usefulness and admissibility of the file comes into question.
Validation refers to the ability to certify that the contents of the copy are the same as the contents of the original. This is usually done using a hash (analogous to a fingerprint of the files data). It may also be done using a bitwise comparison of the data in both the file and the copy, but since this method requires the same amount of storage as the files themselves and offers no means of independent verification, it is not in common use.
‘Imaging a hard drive’ is a phrase that is commonly used for preserving the contents of a custodian hard drive or server. It can also be used to describe when a custodian hard drive is cloned. It is worth taking some time to understand the differences and the advantages and disadvantages of each process.
Forensic Imaging
A forensic image or evidence file container (such as EnCase, DD, Expert Witness, and SMART) is often created using software that is running on a computer forensic examiner’s laptop or lab computer. The examiner will connect the drive to a write blocker and use software to create a forensic image of the entire contents of the source drive on a separate target hard drive. The process may also capture multiple forensic images to a single hard drive.
Hard Drive Cloning
Cloning a hard drive during collection uses a target drive to make an exact duplicate (bit stream copy) of the original hard drive. This process is normally completed using hardware referred to as hard drive cloning equipment.
A primary difference between imaging and cloning is that the files in a forensic image can’t be accessed by common litigation support applications or electronic discovery software (such as LAW PreDiscovery, Discovery Cracker, and IPRO) or litigation support databases (such as Concordance, Summation, and Ringtail).
Forensic images are designed to be accessed by computer forensic software (such as Encase, FTK, Winhex, and ProDiscover). If you need to access the original custodian information in a forensic image without using computer forensic software, then you will need to have it restored to a hard drive in the original native format. You could also look into purchasing the Mount Image Pro software (http://www.mountimage.com/purchase-forensic-software.php) that will allow you to view the contents of a forensic image without converting or restoring it to the native format.
Cost and Redundancy Considerations
If you want to compare the cost of different computer examiners, keep in mind that the lowest hourly rate doesn’t mean the lowest total price. An examiner using hardware-based cloning equipment can usually complete the process faster than using software to create a forensic image.
If you rely on a single forensic image or hard drive clone and find out later that there was a problem, you probably won’t have a second chance to preserve and collect the information. It’s well worth the additional cost to create a 2nd backup of the source hard drive. When comparing examiner rates, you will need to compare the hourly and per drive costs to determine the total price. Also, consider what you will be charged to restore a forensic image to a new drive, because this may have to be completed before the custodian files can be processed.