Electronically Stored Information (ESI) self collection drives and kits have become popular in the last few years because they offer an affordable means of collecting electronic data for a legal matter without the need to hire in expensive forensic experts. This article covers what should be included in an ESI collection drive kit as well as some tips to ensure the collections are completed properly.
ESI Self Collection Tips and Resources
Here are a few tips to help ensure a successful ESI self collection:
1) IT Assistance –Have someone on hand with knowledge of the products, how they work and how to overcome any issues encountered. This could be an individual with the legal department, corporate IT, a forensic computer examiner, or a competent vendor.
2) Hard Drives – If the ESI self collection drive is being connected directly to a custodian PC or server, take a look at the 2.5 inch enclosed external hard drives that are powered from a USB port. If collecting data across a network, a Network Attached Storage (NAS) device should be considered.
3) Software – Require these key features from active file collection software (like SafeCopy 2 or Harvester from Pinpoint Labs):
4) Evidence Bags – Tamper-proof evidence bags provide additional security and defensibility. The following antistatic bags from Packaging Horizons (http://www.alertsecurityproducts.com/antistaticsecuritybag/index.shtml) are designed for hard drives.
5) Paper Chain of Custody –Most firms are familiar with transferring evidence and have forms already created. Include this form with the drives used in an ESI collection kit.
Larger Collection Alternatives
Putting together ESI self collection kits can save money and eliminate delay and additional costs. Harvester from Pinpoint Labs is offered at a flat rate (you own it) or per collection.
Unease with ESI Self Collections
There has been some concern over custodian self collections. Relying on untrained employees to find, and then properly collect the relevant data may present a defensibility problem. This problem is overcome easily with automation features of data collection software. These features minimize the number of human errors that can occur by minimizing the amount of employee interaction with the collection process.
What you should know
ESI self collections and kits are here to stay. They significantly reduce discovery costs, perform targeted collections, and are the modern equivalent of boxing up relevant files. However, it is critical to ensure that the process is defensible by preserving the original content, with the correct process, products, and procedures. Further assistance designing an ESI self collection kit for specific project needs, contact one of the project leaders at Pinpoint Labs.
A hash value is a result of a calculation (hash algorithm) that can be performed on a string of text, electronic file or entire hard drives contents. The result is also referred to as a checksum, hash code or hashes. Hash values are used to identify and filter duplicate files (i.e. email, attachments, and loose files) from an ESI collection or verify that a forensic image or clone was captured successfully.
Each hashing algorithm uses a specific number of bytes to store a “ thumbprint” of the contents. The following is a list of hash values for the same text file. Regardless of the amount of data feed into a specific hash algorithm or checksum it will return the same number of characters. For example, an MD5 hash uses 32 characters for the thumbprint whether it’s a single character in a text file or an entire hard drive.
HASH
MD5: 464668D58274A7840E264E8739884247
SHA-1: 4698215F643BECFF6C6F3D2BF447ACE0C067149E
SHA-256: F2ADD4D612E23C9B18B0166BBDE1DB839BFB8A376ED01E32FADB03A0D1B720C7
SHA-384:
2707F06FE57800134129D8E10BBE08E2FEB622B76537A7C4295802FBB94755BBEE814B101ED18CC2D0126BD66E5D77B6
SHA-512:
C526BC709E2C771F9EC039C25965C91EAA3451A8CB43651EA4CD813F338235F495D37891DD25FE456FE2A8CA89457629378BE63FB3A9A5AD54D9E11E4272D60C
RIPEMD-128: A868B98EAEC84891A7B7BA620EDDE621
TIGER: F31A22CEED5848E69316649D4BAFBE8F9274DED53E25C02D
PANAMA: 7E703B1798A26A0AF21ECD661CBADB9C72B419455814CA7B82E29EE0C03FA493
CHECKSUM
CRC16: 117C
CRC32: FA2D47D4
ADLER32: CF7D65FF
As you can see there are also various length hashes within a family (SHA-1, SHA-256 et.) The most common hash values are MD5, SHA-1 and SHA-256. The longer hash values require more time to calculate and are designed to reduce the probability of a collision.
A few other ways that hash values are used:
- Verify a downloaded file was created by the publisher (oppose to a virus infected version)
- Identify and filter files on the NSRL/NIST list (“deNISTing”)
- Locate known contraband (illegal images and videos)
Here are a few reasons why hash values are so widely used as a means to validate and compare content:
1) Privileged Data – There would be obvious issues storing and providing multiple copies of the contents of a company’s files or entire hard drives data in a database to perform a byte comparison. Not to mention illegal images and videos (child pornography) would have to be stored and used in each system scan. These scenarios are unacceptable.
2) Speed – Comparing an indexed hash value versus what could be billions or trillions of bytes or source data is much quicker. Optimized hash engines (Pinpoint Harvester) can compare thousands of hash values in a second.
3) Security – Hashing data is a one way trip. The original data can’t be recreated or reverse engineered from the hash value. This provides additional security that a person can’t determine the source data from the hash.
The argument that data sources could be different and have the same hash value has raised a lot of concern. There are countless threads related to this issue on the litigation support and computer forensic forums. The bottom line is the only way to do an exact comparison of the original data is to store it everywhere you need to deduplicate or verify the information, however, as mentioned about this isn’t a practical alternative.
More complex hashing functions have been introduced (SHA-256, SHA-512 etc.) which will further reduce the likely hood of a collision. It is also worth noting that even in those cases where scientists have created collisions it was a result of exploiting the weaknesses in a specific hash algorithm. The same alterations would not create a collision in a different hashing algorithm.
So, if you still aren’t satisfied with the incredibly remote possibility a collision could happen using a single hash value then the easiest way to implement an extra precaution is to take the time to have your processes calculate hash values from two separate algorithms (i.e. MD5/SHA256) for each item. Unfortunately, most EED applications and forensic imaging tools don’t support this option, especially in a single pass.
What to Remember
Hash values are a reliable, fast, and a secure way to compare the contents of individual files and media. Whether it’s a single text file containing a phone number or five terabytes of data on a server, calculating hash values are an invaluable process for Deduplication and evidence verification in electronic discovery and computer forensics.
As our electronically stored information (ESI) data universe continues to grow, we are hearing about increasing storage capacities. The size of a project in terabytes (TB – 1024 Gigabytes) comes up frequently and is often the amount of data that has to be collected, culled or processed on a corporate server. However, now you can purchase a 1TB drive that will fit in a laptop computer.
Have you heard of a job that will reach or exceed a petabyte? If not, you most likely will in the near future and the following will help if you aren’t familiar with the larger capacities.
Equivalent Storage in Terabytes
Petabyte = 1,024 TB
Exabyte = 1,048,576 TB
Zettabyte = 1,073,741,824 TB
Yottabyte = 1,099,511,627,776 TB
As the size of electronic data at client sites increases so will the need for refined, targeted ESI collections. Many litigation support and computer forensic professionals have encountered collection jobs that are several terabytes and are provided keyword search terms and other criteria to help identify relevant data and decrease the amount being collected, processed and hosted.
ESI (Electronically Stored Information) is the general term for all of the data stored on the hard drives, camera cards, cell phones, GPS devices, digital video recorders, digital answering systems, thumb drives, RAID arrays and any other form of electronic media capable of storing data.
Types of Electronically Stored Information:
Files – Files are by far the most common arrangement for ESI data. Files (also referred to as loose files or active files) can be thought of as data containers similar to files in the real world. They can be copied, moved, and distributed freely on a variety of different media from DVDs to hard disk drives.
Emails - Emails are messages sent from user to another. In their raw form, they are simply a stream of data that contains everything needed to get the message from one user to another user. Since emails are a form of documented communication, they comprise highly sought-after data when it comes to legal matters. Emails themselves may be contained in databases, files, or unallocated space.
Database Entries - Database entries is data stored in a database. This type of data is usually context-specific and may be information pertaining to financial records, personnel entries or other data that is interrelated. Single entries in a database require export to another format in order to be useful or even readable by humans. Most databases include this ability.
Log Entries – Log entries are lines in files or entries in databases that contain information about activity on a particular computer. The more commonly useful log entries pertain to users logging into and out of a computer, accessing specific internet sites, the sending or receiving of email or other messages and the moving, copying or accessing of files on the computer. Log entries may require conversion into human-readable form before they can be processed.
Raw or Unallocated Data - Raw or unallocated data is data that resides in segments of the storage media (hard drive, camera card, etc) that are not being used by files. This data can contain all or part of files that were once referenced in the file allocation table but were subsequently deleted. It can also contain deleted internet history, old information from the computer’s RAM (Random Access Memory) or even old configuration data about the computer itself. Much of this data can even survive a reformatting of the disk itself. Since this data can come from any number of sources that had once been active on the drive, it can make or break a case where it is suspected that deletions may have occurred.
Tools for Collecting ESI
With the exception of unallocated space, tools such as One Click Collect Harvester from Pinpoint Labs have the ability to collect loose files, emails and whole databases with the added benefits of being able to specify key words, date ranges, domains and email addresses among other very useful filters.
Tools for collecting the unallocated space on a drive usually require an experienced forensic examiner in order to get useful interpretations of the data collected. In cases where this is necessary, it is recommended that a certified examiner be hired for the collection and analysis of the data.
Active File Collection refers to the collection of files that are active (not deleted) and pertain to a legal matter or legal hold. In most civil litigation cases, extensive forensic investigations that look at deleted files are unnecessary or too expensive. Thus, most ESI collections are active file collections and/or email collections.
How active file collections are performed
Active files are those that can be seen by normal users. They may include hidden or system files, but they do not include the computer’s Random Access Memory or any deleted files. Files in the Windows Recycle Bin are considered active files and are subject to collection using active file collection methods.
The first step is defining which files need to be collected. This definition can range from “everything” to files of a few specific types containing only certain key words. Since the cost of processing is usually related to the size of the data being processed, it is generally more economical to be as specific as possible without leaving out relevant files.
Once the files have been identified, it is mostly a matter of copying them in a manner that both avoids spoliation and provides a means of certifying the contents of the copies.
What to remember
The one thing to remember about active file collections is that they can be a potential minefield of spoliation. To avoid this, use software that is designed to preserve the metadata, the timestamps, and the data within the copied files. Some products, such as SafeCopy 2 from Pinpoint Labs are designed specifically for this purpose. Others, like Harvester, also from Pinpoint Labs, offer this feature as well as the ability to cull data by key word search and also support deduplication, email, and deNISTing.
The most important aspects of active file collections are preservation and validation.
Preservation refers to the preservation of the file data, its timestamps (when the file was created, last modified, and last accessed), and any other metadata contained within the file. If any of this data is compromised, the usefulness and admissibility of the file comes into question.
Validation refers to the ability to certify that the contents of the copy are the same as the contents of the original. This is usually done using a hash (analogous to a fingerprint of the files data). It may also be done using a bitwise comparison of the data in both the file and the copy, but since this method requires the same amount of storage as the files themselves and offers no means of independent verification, it is not in common use.
Each day, corporate IT managers, computer forensic examiners, and litigation support professionals are tasked with performing ESI collections for relevant files which reside in file shares, on client systems, and other popular data sources. The content may include Microsoft Exchange mailboxes, departmental data, individual custodian files, internet logs, telephone logs, or other critical corporate content.
Over 4 years ago, Pinpoint Labs released SafeCopy version 2.0 (SafeCopy 2) which alleviated several common problems encountered when using alternative copy utilities to collect client files. Here are a few of those problems that the SafeCopy 2 upgrade addressed:
In September 2009, Pinpoint Labs released One Click Collect – Harvester (Portable/Server), which was a new product that included the proven SafeCopy 2 engine. The Pinpoint Harvester 2.0 ESI collection software includes:
![]() |
||
| Great for Legal Holds | ![]() |
![]() |
| Preserve Metadata and Time Stamps | ![]() |
![]() |
| Filter by Extension and Date Range | ![]() |
![]() |
| Select from multiple data sources | ![]() |
![]() |
| Compatible with all electronic and litigation platforms | ![]() |
![]() |
| 100% File copy verification | ![]() |
![]() |
| Extensive chain of custody report | ![]() |
![]() |
| Process file lists | ![]() |
![]() |
| Resume easily | ![]() |
![]() |
| Supports path lengths greater than 255 characters | ![]() |
![]() |
| Transfer licenses quickly to another location | ![]() |
![]() |
| Create and deploy remote collections | ![]() |
|
| Keyword Filter MS Outlook PSTs | ![]() |
|
| Keyword Filter Loose Files | ![]() |
|
| Keyword Filter Attachments | ![]() |
|
| Keyword Filter Archives | ![]() |
|
| Dedupe and Filter Multiple PSTs | ![]() |
|
| Regenerate New PSTs | ![]() |
|
| Export Emails to 8 Different Message Formats | ![]() |
|
| Remove System Files Listed in NSRL (deNISTing) | ![]() |
|
| Filter by Header Signature | ![]() |
|
| Create Portable and Automated Collection Jobs | ![]() |
|
| Preconfigured Work Orders | ![]() |
|
| Can Be Used for In-House, Production-Level Culling (deNIST/dedupe) | ![]() |
|
| Scriptable Profiles and Collection Jobs | ![]() |
|
| Easily Save and Reuse Job Settings | ![]() |
|
![]() |
||
Pinpoint Labs has a proven record of developing defensible, affordable ESI collection software. Many Fortune 500 companies, government agencies, and computer forensic professionals rely on SafeCopy 2 and One Click Collect – Harvester every day.
‘Imaging a hard drive’ is a phrase that is commonly used for preserving the contents of a custodian hard drive or server. It can also be used to describe when a custodian hard drive is cloned. It is worth taking some time to understand the differences and the advantages and disadvantages of each process.
Forensic Imaging
A forensic image or evidence file container (such as EnCase, DD, Expert Witness, and SMART) is often created using software that is running on a computer forensic examiner’s laptop or lab computer. The examiner will connect the drive to a write blocker and use software to create a forensic image of the entire contents of the source drive on a separate target hard drive. The process may also capture multiple forensic images to a single hard drive.
Hard Drive Cloning
Cloning a hard drive during collection uses a target drive to make an exact duplicate (bit stream copy) of the original hard drive. This process is normally completed using hardware referred to as hard drive cloning equipment.
A primary difference between imaging and cloning is that the files in a forensic image can’t be accessed by common litigation support applications or electronic discovery software (such as LAW PreDiscovery, Discovery Cracker, and IPRO) or litigation support databases (such as Concordance, Summation, and Ringtail).
Forensic images are designed to be accessed by computer forensic software (such as Encase, FTK, Winhex, and ProDiscover). If you need to access the original custodian information in a forensic image without using computer forensic software, then you will need to have it restored to a hard drive in the original native format. You could also look into purchasing the Mount Image Pro software (http://www.mountimage.com/purchase-forensic-software.php) that will allow you to view the contents of a forensic image without converting or restoring it to the native format.
Cost and Redundancy Considerations
If you want to compare the cost of different computer examiners, keep in mind that the lowest hourly rate doesn’t mean the lowest total price. An examiner using hardware-based cloning equipment can usually complete the process faster than using software to create a forensic image.
If you rely on a single forensic image or hard drive clone and find out later that there was a problem, you probably won’t have a second chance to preserve and collect the information. It’s well worth the additional cost to create a 2nd backup of the source hard drive. When comparing examiner rates, you will need to compare the hourly and per drive costs to determine the total price. Also, consider what you will be charged to restore a forensic image to a new drive, because this may have to be completed before the custodian files can be processed.
Copying corporate data and using it at a competing company (intellectual property/corporate asset theft) is a common and serious concern for companies and their legal counsel. When employees leave companies, there are often questions about the security of the information they previously accessed. Will they use the contacts, forms, or product details as a competitive advantage in their new job?
I had previously written about how to use the file activity records located in the index.dat file to identify when files were accessed. This can help determine if files were copied from a corporate file server. I want to expand on a couple of additional artifacts that can be used and then provide an illustration. There are three primary artifacts that can be used to help determine if someone accesses and copies specific files using an external drive, CD/DVD, flash device, or other storage media.
1) USBStor Registry Entry – Microsoft Windows uses its registry to track information about the computer’s users, operating system, hardware, applications, security, and other relevant information. When USB devices are plugged into a computer, several key artifacts are captured including the make, model, serial number (if available), and when the device was plugged in.
2) Index.dat Access Record – Microsoft Windows uses the index.dat file to track website activity in Internet Explorer. It also contains when and from where files were accessed. We often have to recover deleted or purged activity using programs like NetAnalysis to do a thorough analysis. NetAnalysis can often recover hundreds of thousands of records that are no longer available in the index.dat files on the system.
3) Link File (.lnk shortcut) – Shortcuts can be created by a user and are commonly stored on the desktop. Microsoft Windows also automatically creates shortcuts for files that are accessed in .lnk files. These files store a wealth of information about the source document, including the path, date and time created, written, last accessed, size, volume serial, and several others. This information is encoded and requires special software to display it in a format that is useful.
4) “File Sniper” - Use a product like Harvester from Pinpoint Labs to create a hash list of the suspect files and scan all locations where the files could be in use. It isn’t uncommon for a computer forensic examiner to be asked if there is a way to create a list of files from a corporate network or employees system and check if they are in use by a competitor.
By using the above artifacts, it is possible to determine that files located on a company server or client machine were copied or accessed after a specific date and time. Note that this doesn’t provide the contents of the file and a thorough review would be necessary to make sure it is the same file. However, if the file name and other relevant metadata is a match, it does appear suspicious and may be enough to construct a solid argument that the employee did copy or burn files, access the contents, or used the information. This may lead to criminal and civil charges around possibly benefiting a future employer or a new company that the employee decided to start.
Changes are underway in how electronically stored information (ESI) is processed and reviewed. These changes are due to the huge size of repositories – hundreds of gigabytes or multiple terabyte sizes – identified for collection and processing. Corporations and their legal counsel realize that it may not be feasible or affordable to collect and produce all the information identified in larger cases.
Several new software applications have been introduced that offer many of the same features included in popular electronic discovery software (indexing, file and email “de-duplication”, online review, searching and culling). The difference is they are designed to run as an “appliance” application on a corporate network.
What does this mean? It means collections that would have been sent out to a processing vendor are now being deduped, filtered, and produced internally. The culled native files may still be sent out for tiffing, endorsing, and building load files. But it is a significantly reduced subset.
However, this is not to say that outsourcing will cease. But in the years ahead, there will probably be a reduction in the amount of EED/ESI processing that is outsourced. Additionally, once a corporation has invested in the “appliance” software and training to collect, filter and produce their collections, they will probably use it on smaller cases as well that were previously outsourced.
Systems that require a computer forensic investigation, or need to be collected by a third party, will still require individuals with the appropriate skills and credentials’ to image or clone media, and then analyze the contents as we do now. However, an increasing amount of electronic discovery processing will be performed at the client site with automated assistance to save time, money, and handle larger projects.
You have requested a hard drive clone or image and discover that the contents cannot be culled or reviewed. One reason may be hard drive encryption. Encryption involves ”scrambling” the contents of a file or hard drive so that they cannot be viewed without the appropriate key or password.
To secure data, companies and individuals are increasingly encrypting the contents of their hard drives or USB flash drives. Manufacturers are also building hard drives that automatically encrypt the contents. BitLocker encryption, for example, is available in Windows Vista. Hard drive encryption often requires a ”live” acquisition, which takes place when the system is running and the decrypted contents of the drive can be accessed and copied. Employing best practices, which handles hard drive encryption, is important and will increase in the months and years to come as encrypted hard drives become more common. Here are a few pointers:
Encrypted hard drives pose a challenge and potential delays for both computer investigations and electronic discovery processing. Work with a vendor who is capable of handling encrypted hard drive collections.