Preservation

ESI Self Collection Drives and Kits

Electronically Stored Information (ESI) self collection drives and kits have become popular in the last few years because they offer an affordable means of collecting electronic data for a legal matter without the need to hire in expensive forensic experts. This article covers what should be included in an ESI collection drive kit as well as some tips to ensure the collections are completed properly.

ESI Self Collection Tips and Resources

Here are a few tips to help ensure a successful ESI self collection:

1) IT Assistance –Have someone on hand with knowledge of the products, how they work and how to overcome any issues encountered. This could be an individual with the legal department, corporate IT, a forensic computer examiner, or a competent vendor.

2) Hard Drives – If the ESI self collection drive is being connected directly to a custodian PC or server, take a look at the 2.5 inch enclosed external hard drives that are powered from a USB port. If collecting data across a network, a Network Attached Storage (NAS) device should be considered.

3) Software – Require these key features from active file collection software (like SafeCopy 2 or Harvester from Pinpoint Labs):

  1. Preserves file timestamps and metadata – Using Windows Explorer to “drag and drop” files does not preserve critical metadata or confirm that the contents were copied exactly.
  2. Creates electronic chain of custody – Report(s) containing details of what happened, source and destination hash values, MAC times, where files were copied from/to and results are the audit trail required for defensibility.
  3. Hash verifies files – Files hashes of the source and destination are verifiable proof of a valid copy.
  4. No local installation – Ideally the software should run from an external device or from the network without installing anything on the host computer.
  5. Automated job tickets – Human involvement opens the risk of human error. Products like Harvester from Pinpoint Labs include features to automate the process with predefined work tickets.
  6. Filtering (Optional) – Filtering at the point of collection reduces the cost of processing the collected data. Some of the filters that can be applied at the point of collection are file types/headers, date ranges, folder names, key words, deduplication, and deNISTing.

4) Evidence Bags – Tamper-proof evidence bags provide additional security and defensibility. The following antistatic bags from Packaging Horizons (http://www.alertsecurityproducts.com/antistaticsecuritybag/index.shtml) are designed for hard drives.

5) Paper Chain of Custody –Most firms are familiar with transferring evidence and have forms already created. Include this form with the drives used in an ESI collection kit.

Larger Collection Alternatives

Putting together ESI self collection kits can save money and eliminate delay and additional costs. Harvester from Pinpoint Labs is offered at a flat rate (you own it) or per collection.

Unease with ESI Self Collections

There has been some concern over custodian self collections. Relying on untrained employees to find, and then properly collect the relevant data may present a defensibility problem.  This problem is overcome easily with automation features of data collection software. These features minimize the number of human errors that can occur by minimizing the amount of employee interaction with the collection process.

What you should know

ESI self collections and kits are here to stay. They significantly reduce discovery costs, perform targeted collections, and are the modern equivalent of boxing up relevant files. However, it is critical to ensure that the process is defensible by preserving the original content, with the correct process, products, and procedures. Further assistance designing an ESI self collection kit for specific project needs, contact one of the project leaders at Pinpoint Labs.

What is a Hash Value?

A hash value is a result of a calculation (hash algorithm) that can be performed on a string of text, electronic file or entire hard drives contents. The result is also referred to as a checksum, hash code or hashes. Hash values are used to identify and filter duplicate files (i.e. email, attachments, and loose files) from an ESI collection or verify that a forensic image or clone was captured successfully.

Each hashing algorithm uses a specific number of bytes to store a “ thumbprint” of the contents. The following is a list of hash values for the same text file. Regardless of the amount of data feed into a specific hash algorithm or checksum it will return the same number of characters. For example, an MD5 hash uses 32 characters for the thumbprint whether it’s a single character in a text file or an entire hard drive.

HASH

MD5: 464668D58274A7840E264E8739884247

SHA-1: 4698215F643BECFF6C6F3D2BF447ACE0C067149E

SHA-256: F2ADD4D612E23C9B18B0166BBDE1DB839BFB8A376ED01E32FADB03A0D1B720C7

SHA-384:

2707F06FE57800134129D8E10BBE08E2FEB622B76537A7C4295802FBB94755BBEE814B101ED18CC2D0126BD66E5D77B6

SHA-512:

C526BC709E2C771F9EC039C25965C91EAA3451A8CB43651EA4CD813F338235F495D37891DD25FE456FE2A8CA89457629378BE63FB3A9A5AD54D9E11E4272D60C

RIPEMD-128: A868B98EAEC84891A7B7BA620EDDE621

TIGER: F31A22CEED5848E69316649D4BAFBE8F9274DED53E25C02D

PANAMA: 7E703B1798A26A0AF21ECD661CBADB9C72B419455814CA7B82E29EE0C03FA493

CHECKSUM

CRC16: 117C

CRC32: FA2D47D4

ADLER32: CF7D65FF

As you can see there are also various length hashes within a family (SHA-1, SHA-256 et.) The most common hash values are MD5, SHA-1 and SHA-256. The longer hash values require more time to calculate and are designed to reduce the probability of a collision.

What is a Hash Value

A few other ways that hash values are used:

-  Verify a downloaded file was created by the publisher (oppose to a virus infected version)

-   Identify and filter files on the NSRL/NIST list (“deNISTing”)

-   Locate known contraband (illegal images and videos)

Here are a few reasons why hash values are so widely used as a means to validate and compare content:

1)  Privileged Data – There would be obvious issues storing and providing multiple copies of the contents of a company’s files or entire hard drives data in a database to perform a byte comparison. Not to mention illegal images and videos (child pornography) would have to be stored and used in each system scan. These scenarios are unacceptable.

2)  Speed – Comparing an indexed hash value versus what could be billions or trillions of bytes or source data is much quicker. Optimized hash engines (Pinpoint Harvester) can compare thousands of hash values in a second.

3)  Security  – Hashing data is a one way trip. The original data can’t be recreated or reverse engineered from the hash value. This provides additional security that a person can’t determine the source data from the hash.

The argument that data sources could be different and have the same hash value has raised a lot of concern. There are countless threads related to this issue on the litigation support and computer forensic forums. The bottom line is the only way to do an exact comparison of the original data is to store it everywhere you need to deduplicate or verify the information, however, as mentioned about this isn’t a practical alternative.

More complex hashing functions have been introduced (SHA-256, SHA-512 etc.) which will further reduce the likely hood of a collision. It is also worth noting that even in those cases where scientists have created collisions it was a result of exploiting the weaknesses in a specific hash algorithm. The same alterations would not create a collision in a different hashing algorithm.

So, if you still aren’t satisfied with the incredibly remote possibility a collision could happen using a single hash value then the easiest way to implement an extra precaution is to take the time to have your processes calculate hash values from two separate algorithms (i.e. MD5/SHA256) for each item. Unfortunately, most EED applications and forensic imaging tools don’t support this option, especially  in a single pass.

What to Remember

Hash values are a reliable, fast, and a secure way to compare the contents of individual files and media. Whether it’s a single text file containing a phone number or five terabytes of data on a server, calculating hash values are an invaluable process for Deduplication and evidence verification in electronic discovery and computer forensics.

How Much is a Petabyte, Exabyte, or Zettabyte?

As our electronically stored information (ESI) data universe continues to grow, we are hearing about increasing storage capacities. The size of a project in terabytes (TB – 1024 Gigabytes) comes up frequently and is often the amount of data that has to be collected, culled or processed on a corporate server. However, now you can purchase a 1TB drive that will fit in a laptop computer.

Have you heard of a job that will reach or exceed a petabyte? If not, you most likely will in the near future and the following will help if you aren’t familiar with the larger capacities.

Equivalent Storage in Terabytes

Petabyte = 1,024 TB

Exabyte = 1,048,576 TB

Zettabyte = 1,073,741,824 TB

Yottabyte = 1,099,511,627,776 TB

As the size of electronic data at client sites increases so will the need for refined, targeted ESI collections. Many litigation support and computer forensic professionals have encountered collection jobs that are several terabytes and are provided keyword search terms and other criteria to help identify relevant data and decrease the amount being collected, processed and hosted.

Email Collection

Email Collection refers to the identification and isolation of electronic mail (email) messages that pertain to a specific legal matter in civil litigation cases.

What gets collected

What is actually being collected during email collections can be one of two things:

1. Files representing the contents of the transmitted email messages themselves (usually in MSG, HTML, EML or RTF format).

2. Container (or store) files that hold the contents and data associated with multiple email messages, usually all of the emails for a specific custodian.

Whether files for individual emails or container files are collected depend mostly on the type of email system being used by the custodian. If the custodian is a user of Microsoft Outlook for instance, then either container files or individual email files may be produced. If the custodian is a user of a webmail service, such as Gmail or Yahoo!, then it is likely only individual email files can be collected.

How it’s done

Software such as Harvester from Pinpoint Labs can search the PST store files produced by Microsoft Outlook and Exchange email systems for individual emails containing specific criteria, such as who sent the email, who received it, when these actions occurred and whether the subject, body, or attachments contain specified key words. It can also produce the result to either individual email files or whole, reconstructed container files, known as PST regeneration.

With other email systems, either the whole container file can be copied and sorted through manually, or the individual emails can be manually identified and exported as individual email files.

What to remember

As with any data being collected, the two concepts to remember are preservation and validation.

Preservation refers to keeping the metadata about the individual messages as well as the metadata contained within each of the messages intact so as to maintain their admissibility. PST regeneration is especially desirable in this case because it maintains both the email data and the data that linked it to contact data, task list data and other data integrated with these types of email messages.

Validation refers to the policy of insuring, either by hash value comparison (analogous to fingerprints for data) or bit-wise comparison, that the contents of the copy are the same as the contents of the original.

Software such as Harvester and SafeCopy 2, both from Pinpoint Labs, have built-in preservation and validation systems to certify that both of these conditions are always met.

What is ESI (Electronically Stored Information)?

ESI (Electronically Stored Information) is the general term for all of the data stored on the hard drives, camera cards, cell phones, GPS devices, digital video recorders, digital answering systems, thumb drives, RAID arrays and any other form of electronic media capable of storing data.

Types of Electronically Stored Information:

Files – Files are by far the most common arrangement for ESI data. Files (also referred to as loose files or active files) can be thought of as data containers similar to files in the real world. They can be copied, moved, and distributed freely on a variety of different media from DVDs to hard disk drives.

Emails - Emails are messages sent from user to another. In their raw form, they are simply a stream of data that contains everything needed to get the message from one user to another user. Since emails are a form of documented communication, they comprise highly sought-after data when it comes to legal matters. Emails themselves may be contained in databases, files, or unallocated space.

Database Entries - Database entries is data stored in a database. This type of data is usually context-specific and may be information pertaining to financial records, personnel entries or other data that is interrelated. Single entries in a database require export to another format in order to be useful or even readable by humans. Most databases include this ability.

Log Entries – Log entries are lines in files or entries in databases that contain information about activity on a particular computer. The more commonly useful log entries pertain to users logging into and out of a computer, accessing specific internet sites, the sending or receiving of email or other messages and the moving, copying or accessing of files on the computer. Log entries may require conversion into human-readable form before they can be processed.

Raw or Unallocated Data - Raw or unallocated data is data that resides in segments of the storage media (hard drive, camera card, etc) that are not being used by files. This data can contain all or part of files that were once referenced in the file allocation table but were subsequently deleted. It can also contain deleted internet history, old information from the computer’s RAM (Random Access Memory) or even old configuration data about the computer itself. Much of this data can even survive a reformatting of the disk itself. Since this data can come from any number of sources that had once been active on the drive, it can make or break a case where it is suspected that deletions may have occurred.

Tools for Collecting ESI

With the exception of unallocated space, tools such as One Click Collect Harvester from Pinpoint Labs have the ability to collect loose files, emails and whole databases with the added benefits of being able to specify key words, date ranges, domains and email addresses among other very useful filters.

Tools for collecting the unallocated space on a drive usually require an experienced forensic examiner in order to get useful interpretations of the data collected. In cases where this is necessary, it is recommended that a certified examiner be hired for the collection and analysis of the data.

What is an Active File Collection?

Active File Collection refers to the collection of files that are active (not deleted) and pertain to a legal matter or legal hold. In most civil litigation cases, extensive forensic investigations that look at deleted files are unnecessary or too expensive. Thus, most ESI collections are active file collections and/or email collections.

How active file collections are performed

Active files are those that can be seen by normal users. They may include hidden or system files, but they do not include the computer’s Random Access Memory or any deleted files. Files in the Windows Recycle Bin are considered active files and are subject to collection using active file collection methods.

The first step is defining which files need to be collected. This definition can range from “everything” to files of a few specific types containing only certain key words. Since the cost of processing is usually related to the size of the data being processed, it is generally more economical to be as specific as possible without leaving out relevant files.

Once the files have been identified, it is mostly a matter of copying them in a manner that both avoids spoliation and provides a means of certifying the contents of the copies.

What to remember

The one thing to remember about active file collections is that they can be a potential minefield of spoliation. To avoid this, use software that is designed to preserve the metadata, the timestamps, and the data within the copied files. Some products, such as SafeCopy 2 from Pinpoint Labs are designed specifically for this purpose. Others, like Harvester, also from Pinpoint Labs, offer this feature as well as the ability to cull data by key word search and also support deduplication, email, and deNISTing.

The most important aspects of active file collections are preservation and validation.

Preservation refers to the preservation of the file data, its timestamps (when the file was created, last modified, and last accessed), and any other metadata contained within the file. If any of this data is compromised, the usefulness and admissibility of the file comes into question.

Validation refers to the ability to certify that the contents of the copy are the same as the contents of the original. This is usually done using a hash (analogous to a fingerprint of the files data). It may also be done using a bitwise comparison of the data in both the file and the copy, but since this method requires the same amount of storage as the files themselves and offers no means of independent verification, it is not in common use.

Electronically Stored Information (ESI) Collection Software

Each day, corporate IT managers, computer forensic examiners, and litigation support professionals are tasked with performing ESI collections for relevant files which reside in file shares, on client systems, and other popular data sources. The content may include Microsoft Exchange mailboxes, departmental data, individual custodian files, internet logs, telephone logs, or other critical corporate content.

Over 4 years ago, Pinpoint Labs released SafeCopy version 2.0 (SafeCopy 2) which alleviated several common problems encountered when using alternative copy utilities to collect client files. Here are a few of those problems that the SafeCopy 2 upgrade addressed:

  • DOS-based utilities can be difficult to customize and replicate across multiple users
  • Files located in paths with more than 255 characters are missed
  • Unicode file and folder names may create verification issues
  • Copied file contents are not hash verified (required to confirm that the entire contents were copied)
  • Incomplete copy logs do not support accurate recording and validation
  • Network outages halt file collections and can be difficult to resume

In September 2009, Pinpoint Labs released One Click Collect – Harvester (Portable/Server), which was a new product that included the proven SafeCopy 2 engine. The Pinpoint Harvester 2.0 ESI collection software includes:

  • Keyword cull loose files, PST emails, archives, and attachments
  • Dedupe, and Filter a Single or Multiple PSTs
  • Regenerate new PSTs, or export email to 8 different message formats
  • Separate email verification (chain of custody) and exclusion logs
  • Database tracking
  • Several targeted collection speed improvements
  • Known non-searchable file types included
  • Dedupe and DeNIST
  • Automated, Remote, and Portable Forensically sound collections
  • No Per Gig, Per Custodian, or Per Collection fees
  • Activate and Deactivate the license to move around for collections or in-house processing
Great for Legal Holds
Preserve Metadata and Time Stamps
Filter by Extension and Date Range
Select from multiple data sources
Compatible with all electronic and litigation platforms
100% File copy verification
Extensive chain of custody report
Process file lists
Resume easily
Supports path lengths greater than 255 characters
Transfer licenses quickly to another location
Create and deploy remote collections
Keyword Filter MS Outlook PSTs
Keyword Filter Loose Files
Keyword Filter Attachments
Keyword Filter Archives
Dedupe and Filter Multiple PSTs
Regenerate New PSTs
Export Emails to 8 Different Message Formats
Remove System Files Listed in NSRL (deNISTing)
Filter by Header Signature
Create Portable and Automated Collection Jobs
Preconfigured Work Orders
Can Be Used for In-House, Production-Level Culling (deNIST/dedupe)
Scriptable Profiles and Collection Jobs
Easily Save and Reuse Job Settings

Pinpoint Labs has a proven record of developing defensible, affordable ESI collection software. Many Fortune 500 companies, government agencies, and computer forensic professionals rely on SafeCopy 2 and One Click Collect – Harvester every day.

ESI (Electronically Stored Information) Winds of Change

Changes are underway in how electronically stored information (ESI) is processed and reviewed. These changes are due to the huge size of repositories – hundreds of gigabytes or multiple terabyte sizes – identified for collection and processing. Corporations and their legal counsel realize that it may not be feasible or affordable to collect and produce all the information identified in larger cases.

Several new software applications have been introduced that offer many of the same features included in popular electronic discovery software (indexing, file and email “de-duplication”, online review, searching and culling). The difference is they are designed to run as an “appliance” application on a corporate network.

What does this mean? It means collections that would have been sent out to a processing vendor are now being deduped, filtered, and produced internally. The culled native files may still be sent out for tiffing, endorsing, and building load files.  But it is a significantly reduced subset.

However, this is not to say that outsourcing will cease.  But in the years ahead, there will probably be a reduction in the amount of EED/ESI processing that is outsourced. Additionally, once a corporation has invested in the “appliance” software and training to collect, filter and produce their collections, they will probably use it on smaller cases as well that were previously outsourced.

Systems that require a computer forensic investigation, or need to be collected by a third party, will still require individuals with the appropriate skills and credentials’ to image or clone media, and then analyze the contents as we do now. However, an increasing amount of electronic discovery processing will be performed at the client site with automated assistance to save time, money, and handle larger projects.

Understanding File Timestamps

The terms, ‘file timestamps’ and ‘file metadata’ are often used interchangeably, however, they can have two completely different meanings. I trust the following will help clarify the differences.

1) There are two separate ‘timestamps’ for office documents and several other file types. The first set, is stored in the operating system (Windows, Linux, MacOS) and are different from those stored in the file.

2) The metadata stored in a file (Date Created, Date Last Saved etc.) may also be referred to as the files timestamps and confused with what’s stored by the operating system.

3) The two sets of dates are often very different because the operating system timestamps are easily altered through copying files and automated software processes (virus scanners, indexing). The timestamps in the file metadata are altered when files are saved or edited by the native application.

For example, if a custodian copies a file from their system to a network folder the created and last accessed times displayed in Microsoft Windows would be changed to the date and time of the copy. However, if you view the internal metadata (Date Created, Date Last Saved) in the document properties these values would remain unaltered. If you are looking for the most reliable created or last saved time for a document make sure you use the internal file metadata timestamps.