No announcement yet.

Exports for Standard Electronic Discovery Review Tools

  • Filter
  • Time
  • Show
Clear All
new posts

  • Exports for Standard Electronic Discovery Review Tools

    First, I am enjoying using your tool and have just completed a successful Windows 8 analysis project using OSForensics - very helpful and thanks. Keep up the good work.

    In my past life I co-developed electronic discovery processing software (full-text, metadata extraction, TIFF image creation, load file creation for applications such as Concordance, Relativity, Summation) and wrote some notes memorializing some of the lessons I learned. I am going to post it at the end of this message.

    If your future development road-map includes being able to export files from OSForensics for attorney review in eDiscovery review platforms, (e.g. with an accompanying .DAT file containing extracted full-text, extracted Metadata, file folder path to native file, etc. I can and will help your developers on a completely Pro Bono basis if you would like because I would benefit personally and professionally if your tool could do just that.

    From my last two development experiences, both times the lead software developer had difficulty understanding the importance of maintaining document families (e.g. Parent email to children attachments) from a control number assignment standpoint. My understanding is that all indexing software, such as DTSearch and whatever engine OSForensics is using, assigns a control number to each and every file as it ingests them into what ever database table the engine is creating:

    CONTROL_ID_00000001 Parent Email (ABC.eml)
    CONTROL_ID_00000002 Child attachment (invoices.xlsx)
    CONTROL_ID_00000003 Child attachment (minecraft.docx)

    - Family Range for above three documents: CONTROL_ID_00000001 through CONTROL_ID_00000003

    I have a spreadsheet with a larger, more complex example you will encounter, such as when a PST file contains emails with ZIP file attachments that contain emails with PST file attachments that contain emails with more attachments, and so on.

    Both times the developers had to make multiple changes to how their indexing engines were assigning control numbers both to individual files as well as family attachment ranges in order for search results in the ediscovery review platforms return correct results.

    Email me and I will send you my complex model parent-child document relationship spreadsheet if you are interested.

    I can also help you short circuit the development process to achieve message threading, near duplicate identification and concept clustering in your OSForensics exports.

    Some of my historical development notes:

    g. Date Fields

    It is common for impossible and missing date values to occur when ingesting, indexing or exporting ESI.

    Examples of impossible dates would include an email sent date of 01/27/1932. There was no email in 1932, but nonetheless such values do occur for a variety of reasons. This must be checked for in each database date field, and corrected with a quality control algorithm automatically.

    Occasionally date values will be missing entirely due to files being exported from archive container files improperly. Some document review database applications require a value exist for every record and date field, e.g. "00/00/0000", so if there is none at all, e.g. "". the following logic must be applied when inserting a date value where none exists:

    Dates the file was ingested and/or exported by the tool. A common mistake is for electronic discovery processing tools to create load files with the dates that the ESI was ingested and/or processed in place of the original date.

    For example, the original file creation date of a Word file was 01/27/2004, but when the ESI tool processes the file, the creation date is changed to the date of processing. This is unacceptable and would only open the attorneys to attack by opposing counsel.

    Therefore, quality control algorithms must be in place to check for impossible, missing or incorrect dates and automatically insert the correct value.

    The correct value to insert would be the next most meaningful date for a file. For example, the LastAccessed date value shows the last time the custodian accessed the file and would therefore be typically more meaningful than the file creation date.

    The external Microsoft metadata values should be ranked in the following manner when the algorithm determines what value to enter into a date field missing a value entirely or containing an impossible value:

    1. LastAccessed
    2. LastModified
    3. FileCreated

    If none of the fields above exist, then the value 00/00/0000 should be entered. There can never be a date field with no value whatsoever as certain document review tools underlying database engines will not accept records with no date value at all.

    All parent email date fields must be propagated to their children attachment's date fields, e.g. populate child excel file's "SentDate" value with the parent email's "SentDate" value. If this is not done, then chronologically sorted search results will appear incorrectly in most if not all e-discovery document review platforms (e.g. Parent1, Parent2, Parent3, Child1, Child 1, Child2, Child3 instead of the correctly ordered Parent1, Child1, Child 1, Parent2, Child2, Parent3, Child3).

    Hope this helps.

    Larry Lieb

  • #2
    Thanks for the post and the offer.

    In the create signature function in OSF it is possible to scan a drive or folder and list out files and E-mails with a parent child relationship. Similar to what you mentioned above. I am not sure however how well this complies with the format expected by 3rd party tools however. It does deal with nesting. So you should be able to have, for example, a zipped up PST file, with a another zip file as a attachment to an EMail in the PST and a PDF in the attached Zip.

    What is missing is a consistent numbering system across the entire package. So that, for example, a file name search can return a number that identifies a file, and the same number is presented in the create signature function. At the moment you are more or less reliant on the file path and file name to identify a unique file.

    We are not aware of any instances where we are displaying bad dates, but it is not easy to test all possibilities of bad data in all file types with all international date formats. If you notice anything specific wrong, then please let us know.