What is deNISTing?

Saving clients money on electronic discovery processing is one of the challenges facing attorneys, service bureaus and their clients. Due to the amount of data collected when imaging custodian hard drives the resulting processing and labor costs can be significant and potentially prohibitive.

Reduction of 30%+ Through DeNISTing
Many firms have discovered that deNISTing is a relatively easy way to reduce the overall EED processing costs for imaged custodian drives by an average of 30%. How do they accomplish this reduction without missing potential evidence? By removing ‘known’ files for Microsoft Windows, Linux, Mac OS and other systems the overall production is substantial reduced.

The NIST (National Institute of Standards and Technology) NSRL list contains more than 115 million known files and by using this list to filter custodian hard drives files, prior to EED processing, a significant reduction can be realized.

What Brought on DeNISTing’s Recent Popularity?
‘DeNISTing‘ has become a requested service in just the last few years. Until recently there haven’t been tools available to handle the processing without significantly increasing the turnaround time and investing in expensive computer forensic software.

Pinpoint Labs’ Harvester Software Makes deNISTing a Reality
Harvester from Pinpoint Labs is an affordable and easy to use application which leverages the more than 115 million known hash values in the NIST list to filter custodian data and dramatically reduce the costs and processing time associated with imaged hard drives. Harvester can also dedupe while creating a chain of custody and safely copy filtered files while deNISTing. By performing these multiple processes simultaneously,  Pinpoint Harvester reduces electronic discovery processing costs and labor.

ESI Self Collection Drives and Kits

Electronically Stored Information (ESI) self collection drives and kits have become popular in the last few years because they offer an affordable means of collecting electronic data for a legal matter without the need to hire in expensive forensic experts. This article covers what should be included in an ESI collection drive kit as well as some tips to ensure the collections are completed properly.

ESI Self Collection Tips and Resources

Here are a few tips to help ensure a successful ESI self collection:

1) IT Assistance –Have someone on hand with knowledge of the products, how they work and how to overcome any issues encountered. This could be an individual with the legal department, corporate IT, a forensic computer examiner, or a competent vendor.

2) Hard Drives – If the ESI self collection drive is being connected directly to a custodian PC or server, take a look at the 2.5 inch enclosed external hard drives that are powered from a USB port. If collecting data across a network, a Network Attached Storage (NAS) device should be considered.

3) Software – Require these key features from active file collection software (like SafeCopy 2 or Harvester from Pinpoint Labs):

  1. Preserves file timestamps and metadata – Using Windows Explorer to “drag and drop” files does not preserve critical metadata or confirm that the contents were copied exactly.
  2. Creates electronic chain of custody – Report(s) containing details of what happened, source and destination hash values, MAC times, where files were copied from/to and results are the audit trail required for defensibility.
  3. Hash verifies files – Files hashes of the source and destination are verifiable proof of a valid copy.
  4. No local installation – Ideally the software should run from an external device or from the network without installing anything on the host computer.
  5. Automated job tickets – Human involvement opens the risk of human error. Products like Harvester from Pinpoint Labs include features to automate the process with predefined work tickets.
  6. Filtering (Optional) – Filtering at the point of collection reduces the cost of processing the collected data. Some of the filters that can be applied at the point of collection are file types/headers, date ranges, folder names, key words, deduplication, and deNISTing.

4) Evidence Bags – Tamper-proof evidence bags provide additional security and defensibility. The following antistatic bags from Packaging Horizons (http://www.alertsecurityproducts.com/antistaticsecuritybag/index.shtml) are designed for hard drives.

5) Paper Chain of Custody –Most firms are familiar with transferring evidence and have forms already created. Include this form with the drives used in an ESI collection kit.

Larger Collection Alternatives

Putting together ESI self collection kits can save money and eliminate delay and additional costs. Harvester from Pinpoint Labs is offered at a flat rate (you own it) or per collection.

Unease with ESI Self Collections

There has been some concern over custodian self collections. Relying on untrained employees to find, and then properly collect the relevant data may present a defensibility problem.  This problem is overcome easily with automation features of data collection software. These features minimize the number of human errors that can occur by minimizing the amount of employee interaction with the collection process.

What you should know

ESI self collections and kits are here to stay. They significantly reduce discovery costs, perform targeted collections, and are the modern equivalent of boxing up relevant files. However, it is critical to ensure that the process is defensible by preserving the original content, with the correct process, products, and procedures. Further assistance designing an ESI self collection kit for specific project needs, contact one of the project leaders at Pinpoint Labs.

What is a Hash Value?

A hash value is a result of a calculation (hash algorithm) that can be performed on a string of text, electronic file or entire hard drives contents. The result is also referred to as a checksum, hash code or hashes. Hash values are used to identify and filter duplicate files (i.e. email, attachments, and loose files) from an ESI collection or verify that a forensic image or clone was captured successfully.

Each hashing algorithm uses a specific number of bytes to store a “ thumbprint” of the contents. The following is a list of hash values for the same text file. Regardless of the amount of data feed into a specific hash algorithm or checksum it will return the same number of characters. For example, an MD5 hash uses 32 characters for the thumbprint whether it’s a single character in a text file or an entire hard drive.

HASH

MD5: 464668D58274A7840E264E8739884247

SHA-1: 4698215F643BECFF6C6F3D2BF447ACE0C067149E

SHA-256: F2ADD4D612E23C9B18B0166BBDE1DB839BFB8A376ED01E32FADB03A0D1B720C7

SHA-384:

2707F06FE57800134129D8E10BBE08E2FEB622B76537A7C4295802FBB94755BBEE814B101ED18CC2D0126BD66E5D77B6

SHA-512:

C526BC709E2C771F9EC039C25965C91EAA3451A8CB43651EA4CD813F338235F495D37891DD25FE456FE2A8CA89457629378BE63FB3A9A5AD54D9E11E4272D60C

RIPEMD-128: A868B98EAEC84891A7B7BA620EDDE621

TIGER: F31A22CEED5848E69316649D4BAFBE8F9274DED53E25C02D

PANAMA: 7E703B1798A26A0AF21ECD661CBADB9C72B419455814CA7B82E29EE0C03FA493

CHECKSUM

CRC16: 117C

CRC32: FA2D47D4

ADLER32: CF7D65FF

As you can see there are also various length hashes within a family (SHA-1, SHA-256 et.) The most common hash values are MD5, SHA-1 and SHA-256. The longer hash values require more time to calculate and are designed to reduce the probability of a collision.

What is a Hash Value

A few other ways that hash values are used:

-  Verify a downloaded file was created by the publisher (oppose to a virus infected version)

-   Identify and filter files on the NSRL/NIST list (“deNISTing”)

-   Locate known contraband (illegal images and videos)

Here are a few reasons why hash values are so widely used as a means to validate and compare content:

1)  Privileged Data – There would be obvious issues storing and providing multiple copies of the contents of a company’s files or entire hard drives data in a database to perform a byte comparison. Not to mention illegal images and videos (child pornography) would have to be stored and used in each system scan. These scenarios are unacceptable.

2)  Speed – Comparing an indexed hash value versus what could be billions or trillions of bytes or source data is much quicker. Optimized hash engines (Pinpoint Harvester) can compare thousands of hash values in a second.

3)  Security  – Hashing data is a one way trip. The original data can’t be recreated or reverse engineered from the hash value. This provides additional security that a person can’t determine the source data from the hash.

The argument that data sources could be different and have the same hash value has raised a lot of concern. There are countless threads related to this issue on the litigation support and computer forensic forums. The bottom line is the only way to do an exact comparison of the original data is to store it everywhere you need to deduplicate or verify the information, however, as mentioned about this isn’t a practical alternative.

More complex hashing functions have been introduced (SHA-256, SHA-512 etc.) which will further reduce the likely hood of a collision. It is also worth noting that even in those cases where scientists have created collisions it was a result of exploiting the weaknesses in a specific hash algorithm. The same alterations would not create a collision in a different hashing algorithm.

So, if you still aren’t satisfied with the incredibly remote possibility a collision could happen using a single hash value then the easiest way to implement an extra precaution is to take the time to have your processes calculate hash values from two separate algorithms (i.e. MD5/SHA256) for each item. Unfortunately, most EED applications and forensic imaging tools don’t support this option, especially  in a single pass.

What to Remember

Hash values are a reliable, fast, and a secure way to compare the contents of individual files and media. Whether it’s a single text file containing a phone number or five terabytes of data on a server, calculating hash values are an invaluable process for Deduplication and evidence verification in electronic discovery and computer forensics.

E-Discovery Collection

E-Discovery Collections also known as Electronic Evidence Discovery (EED) or Electronic Data Discovery (EDD) can include a review of all the data stored on employee desktop or laptop computers, company servers, camera cards, cell phones, smart phones, GPS devices, digital video recorders, digital answering systems, thumb drives, RAID arrays and any other form of electronic media capable of storing data.

Types of Electronic Discovery Content

Employee Work Product – Computer Files are by far the most common arrangement for a forensic e-discovery collection. Files (also referred to as loose files or active files) are similar to their paper equivalent. They can be copied, moved, and even “shredded”. Work product could include sales reports, QA reports, product or service information, client lists, engineering designs and much more.

Employee Correspondence - Email has practically replaced letters and interoffice memos. A forensic e-discovery collection of correspondence is often a critical piece and can often contain the “smoking gun”. What someone said, to whom, and when are some of the first questions asked in a legal matter. Since emails are a form of documented communication, they comprise highly sought-after data when it comes to legal matters. Emails themselves may be contained in databases, files, or unallocated space.

Customer Relations and Accounting Data – Customer lists, internal notes, and financial records are also a critical component in forensic e-discovery collection or computer forensic investigations. Properly collecting the live database files that store this information can be a challenge. Single entries in a database often require export to another format in order to be useful or even readable by humans. Most databases include this ability.

User Logs – Collecting user logs isn’t always as relevant in an e-discovery collection/review as it is in computer forensics analysis, however, they can be and are worth mentioning. User logs will contain entries about the activities performed on a computer and different user accounts. Attorneys may want to know when emails were sent or received between accounts in case the emails were deleted.  Log entries may require conversion into human-readable form before they can be processed.

Raw or Unallocated Data – Unless a forensic image of the source data has been requested a forensically sound e-discovery collection will focus on “active” files. However, it is helpful to understand the difference between “unallocated” and “active” data. Raw or unallocated data is data that resides in segments of the storage media (hard drive, camera card, etc) that are not being used by files. This data can contain all or part of files that were once referenced in the file allocation table but were subsequently deleted. Much of this data can even survive a reformatting of the disk itself. Since this data can come from any number of sources that had once been active on the drive, it can make or break a case where it is suspected that deletions may have occurred.

Tools for Forensic E-Discovery Collection

With the exception of unallocated space, tools such as One Click Collect Harvester from Pinpoint Labs have the ability to collect loose files, emails and whole databases with the added benefits of being able to specify key words, date ranges, domains and email addresses among other very useful filters.

Tools for collecting the unallocated space on a drive usually require an experienced forensic examiner in order to get useful interpretations of the data collected. In cases where this is necessary, it is recommended that a certified computer examiner be hired for the collection and analysis of the data.

How Much is a Petabyte, Exabyte, or Zettabyte?

As our electronically stored information (ESI) data universe continues to grow, we are hearing about increasing storage capacities. The size of a project in terabytes (TB – 1024 Gigabytes) comes up frequently and is often the amount of data that has to be collected, culled or processed on a corporate server. However, now you can purchase a 1TB drive that will fit in a laptop computer.

Have you heard of a job that will reach or exceed a petabyte? If not, you most likely will in the near future and the following will help if you aren’t familiar with the larger capacities.

Equivalent Storage in Terabytes

Petabyte = 1,024 TB

Exabyte = 1,048,576 TB

Zettabyte = 1,073,741,824 TB

Yottabyte = 1,099,511,627,776 TB

As the size of electronic data at client sites increases so will the need for refined, targeted ESI collections. Many litigation support and computer forensic professionals have encountered collection jobs that are several terabytes and are provided keyword search terms and other criteria to help identify relevant data and decrease the amount being collected, processed and hosted.

Email Collection

Email Collection refers to the identification and isolation of electronic mail (email) messages that pertain to a specific legal matter in civil litigation cases.

What gets collected

What is actually being collected during email collections can be one of two things:

1. Files representing the contents of the transmitted email messages themselves (usually in MSG, HTML, EML or RTF format).

2. Container (or store) files that hold the contents and data associated with multiple email messages, usually all of the emails for a specific custodian.

Whether files for individual emails or container files are collected depend mostly on the type of email system being used by the custodian. If the custodian is a user of Microsoft Outlook for instance, then either container files or individual email files may be produced. If the custodian is a user of a webmail service, such as Gmail or Yahoo!, then it is likely only individual email files can be collected.

How it’s done

Software such as Harvester from Pinpoint Labs can search the PST store files produced by Microsoft Outlook and Exchange email systems for individual emails containing specific criteria, such as who sent the email, who received it, when these actions occurred and whether the subject, body, or attachments contain specified key words. It can also produce the result to either individual email files or whole, reconstructed container files, known as PST regeneration.

With other email systems, either the whole container file can be copied and sorted through manually, or the individual emails can be manually identified and exported as individual email files.

What to remember

As with any data being collected, the two concepts to remember are preservation and validation.

Preservation refers to keeping the metadata about the individual messages as well as the metadata contained within each of the messages intact so as to maintain their admissibility. PST regeneration is especially desirable in this case because it maintains both the email data and the data that linked it to contact data, task list data and other data integrated with these types of email messages.

Validation refers to the policy of insuring, either by hash value comparison (analogous to fingerprints for data) or bit-wise comparison, that the contents of the copy are the same as the contents of the original.

Software such as Harvester and SafeCopy 2, both from Pinpoint Labs, have built-in preservation and validation systems to certify that both of these conditions are always met.

What is PST Regeneration?

PST Regeneration is used during electronic discovery processing or even during an ESI collection.  A Personal Folder File (PST) is a container file created by Microsoft Outlook which stores email messages and other data (i.e. contacts, calendar entries, tasks, to do list etc.)

How it’s done

Regenerating PSTs refers to the identification, isolation and often deduplication of electronic mail (email) messages that pertain to a specific legal matter in civil litigation cases. The filtered email messages are copied to a new “regenerated” PST file. The resulting PST can be considerably smaller than the original and results in the following benefits:

1)      Quicker attorney review

2)      Electronic Discovery processing and hosting cost reduction

3)      Significantly smaller ESI collection

Practical application

PST regeneration is commonly used when there are dozens of archive (backup) PST files that contain many duplicate messages. It is a common practice for companies to set up Microsoft Outlook or Exchange servers to create daily, weekly or monthly PST backups of employee email messages.

The result is potentially dozens of employee backup PST files which contain duplicate messages. Why? Each backup will contain many of the same messages as the last. Only new emails sent or received (that have not been deleted) since the last backup will be considered “unique” to each PST. Regenerating PSTs with only one copy of each email (deduplication) significantly reduces the number of messages and the size of the PST data to be processed or produced.

Maintaining defensibility

Significant cost reductions related to electronic discovery processing and hosting are gained by deduping, performing key word, date range, and email/domain filtering on the emails in PST files. However, it is critical to use an application that is designed to regenerate PSTs in a defensible manner and maintains the chain of custody.

Software such as Harvester from Pinpoint Labs (designed by Certified Computer Examiners (CCE’s)) can regenerate PST files at the point of collection or during in-house processing. Harvester also creates an extensive verification log (chain of custody) for all copied and duplicate messages.

What to remember

Creating deduped, targeted PSTs is common practice in the electronic discovery lifecycle because it saves clients a considerable amount of money as well as reducing attorney review time. PST regeneration may be performed onsite (during an ESI collection) or in-house to cull down responsive data.

What is ESI (Electronically Stored Information)?

ESI (Electronically Stored Information) is the general term for all of the data stored on the hard drives, camera cards, cell phones, GPS devices, digital video recorders, digital answering systems, thumb drives, RAID arrays and any other form of electronic media capable of storing data.

Types of Electronically Stored Information:

Files – Files are by far the most common arrangement for ESI data. Files (also referred to as loose files or active files) can be thought of as data containers similar to files in the real world. They can be copied, moved, and distributed freely on a variety of different media from DVDs to hard disk drives.

Emails - Emails are messages sent from user to another. In their raw form, they are simply a stream of data that contains everything needed to get the message from one user to another user. Since emails are a form of documented communication, they comprise highly sought-after data when it comes to legal matters. Emails themselves may be contained in databases, files, or unallocated space.

Database Entries - Database entries is data stored in a database. This type of data is usually context-specific and may be information pertaining to financial records, personnel entries or other data that is interrelated. Single entries in a database require export to another format in order to be useful or even readable by humans. Most databases include this ability.

Log Entries – Log entries are lines in files or entries in databases that contain information about activity on a particular computer. The more commonly useful log entries pertain to users logging into and out of a computer, accessing specific internet sites, the sending or receiving of email or other messages and the moving, copying or accessing of files on the computer. Log entries may require conversion into human-readable form before they can be processed.

Raw or Unallocated Data - Raw or unallocated data is data that resides in segments of the storage media (hard drive, camera card, etc) that are not being used by files. This data can contain all or part of files that were once referenced in the file allocation table but were subsequently deleted. It can also contain deleted internet history, old information from the computer’s RAM (Random Access Memory) or even old configuration data about the computer itself. Much of this data can even survive a reformatting of the disk itself. Since this data can come from any number of sources that had once been active on the drive, it can make or break a case where it is suspected that deletions may have occurred.

Tools for Collecting ESI

With the exception of unallocated space, tools such as One Click Collect Harvester from Pinpoint Labs have the ability to collect loose files, emails and whole databases with the added benefits of being able to specify key words, date ranges, domains and email addresses among other very useful filters.

Tools for collecting the unallocated space on a drive usually require an experienced forensic examiner in order to get useful interpretations of the data collected. In cases where this is necessary, it is recommended that a certified examiner be hired for the collection and analysis of the data.

What is an Active File Collection?

Active File Collection refers to the collection of files that are active (not deleted) and pertain to a legal matter or legal hold. In most civil litigation cases, extensive forensic investigations that look at deleted files are unnecessary or too expensive. Thus, most ESI collections are active file collections and/or email collections.

How active file collections are performed

Active files are those that can be seen by normal users. They may include hidden or system files, but they do not include the computer’s Random Access Memory or any deleted files. Files in the Windows Recycle Bin are considered active files and are subject to collection using active file collection methods.

The first step is defining which files need to be collected. This definition can range from “everything” to files of a few specific types containing only certain key words. Since the cost of processing is usually related to the size of the data being processed, it is generally more economical to be as specific as possible without leaving out relevant files.

Once the files have been identified, it is mostly a matter of copying them in a manner that both avoids spoliation and provides a means of certifying the contents of the copies.

What to remember

The one thing to remember about active file collections is that they can be a potential minefield of spoliation. To avoid this, use software that is designed to preserve the metadata, the timestamps, and the data within the copied files. Some products, such as SafeCopy 2 from Pinpoint Labs are designed specifically for this purpose. Others, like Harvester, also from Pinpoint Labs, offer this feature as well as the ability to cull data by key word search and also support deduplication, email, and deNISTing.

The most important aspects of active file collections are preservation and validation.

Preservation refers to the preservation of the file data, its timestamps (when the file was created, last modified, and last accessed), and any other metadata contained within the file. If any of this data is compromised, the usefulness and admissibility of the file comes into question.

Validation refers to the ability to certify that the contents of the copy are the same as the contents of the original. This is usually done using a hash (analogous to a fingerprint of the files data). It may also be done using a bitwise comparison of the data in both the file and the copy, but since this method requires the same amount of storage as the files themselves and offers no means of independent verification, it is not in common use.

ESI (Electronically Stored Information) Software Challenges

A couple weeks ago, I outlined what computer forensics and electronic discovery have in common and how they differ. I’d like to expand on this topic by identifying some common obstacles encountered when using popular computer forensic software for typical electronic discovery projects.

A typical computer forensic case may involve:

  1. A small quantity of email and/or attachments
  2. Recovered files, internet history, and user activity
  3. Registry entries
  4. Pre-fetch files
  5. Portions of unallocated space

A typical electronic discovery project may involve:

  1. Processing dozens or hundreds of custodian mailstores that results in thousands of potentially relevant emails and/or attachments
  2. Indexing hundreds of gigabytes or multiple terabytes of data
  3. Hosting data online so multiple parties can easily review, identify, and produce files
  4. Converting relevant files to tiff, endorse, and build load files compatible with common litigation support applications
  5. Deduping emails, attachments, and files across dozens of custodians

Generally speaking, the primary obstacles encountered when using off-the-shelf computer forensic software for electronic discovery are:

  1. Inability to create load files from tagged emails, attachments, and other relevant data
  2. No support for tiffing, endorsing, and assigning docIDs
  3. Missing/incomplete links between email and attachments
  4. No clear way to produce carved or partial files recovered from unallocated space

If you anticipate reviewing a large ESI collection using one of the common litigation support review tools, make sure that your service provider can process and produce compatible output files for production sets. Don’t assume that all computer forensic examiners are equipped to handle large scale ESI projects.  On the other hand, not all EED service providers have the appropriate tools to complete a thorough computer investigation.


Collection
Computer Investigations
Data Recovery
Definition
Electronic Discovery
Electronic Discovery Collection
ESI Collection
ESI Software