Addressing the Problem of Uncategorized Legacy Data

Topics: Artificial Intelligence, Big Data, Business Development & Marketing Blog Posts, Client Relations, Data Analytics, Efficiency, Law Firm Profitability, Law Firms, Legal Innovation, Legal Managed Services, Midsize Law Firms Blog Posts, Practice Engineering, Practice Innovations, Process Management

data quality specialist

It’s easy enough to enable (and require) document profiling for new documents created within the system, but it is a challenging and seemingly intractable problem to decide what to do for the tens of thousands — if not hundreds of thousands — of legacy documents that already exist.
Deploying a Knowledge Management (KM) or Information Governance system in an organization that has no previous experience with such systems is difficult, even in the best of circumstances. Beyond the often-heated philosophical debate over structured vs. free-form storage that occurs with every one of these projects, there’s also another universal difficulty: bringing existing work product into the new system.

What can be done to finally address this lingering and universal problem?

Here are four suggestions to think about when considering your legacy data:

1. Not All Documents Are Equal; Not All Should Survive

In thinking about the treatment of legacy documents, it’s critically important to remember that not all existing documents are equally valuable to an organization.

Generally speaking, documents recently created or accessed are likely to have greater relevance and value to an organization than old WordPerfect files tucked away in a forgotten directory. Indeed, some older documents may exist only because historically, it had been easier to migrate them en masse to the new hardware than it had been to review and make retention and disposition decisions about them. But remember, mere existence of a document does not mean that it has continuing value to the organization.

A document migration initiative is also a perfect opportunity to validate existing electronic document repositories against an organization’s policies for managing and retaining hardcopy documents. Often, because they take up less physical space, electronic documents may be stored longer than if they had been hardcopies; similar to the ease of pushing documents forward rather than reviewing them for disposition, many older documents lingering on file servers might well have been slated for destruction pursuant to established records retention schedules if they had been stored in hardcopy format.

How old is too old for electronic documents? Records retention schedules aside, one methodology for assessing the business value of documents is to set a cutoff date, typical three to five years before the present, where documents on the newer side are automatically accorded business value and those on the older side are deemed to have reached their end of life and should be destroyed.

This cutoff date should be consistent with existing records management schedules or other criteria — but it shouldn’t necessarily be considered the final word. Organizations should review a sample of documents on either side of the proposed cutoff, and results are typically very revealing. As a basic objective, sampling individual documents close to either side of the cutoff date will validate whether the cutoff date is appropriate and makes business sense.

Often, the cutoff date will be adjusted one way or the other, based on review results.

2. Leverage Existing Organizational Systems

The absence of a holistic document management or KM system does not mean that documents reside in an unorganized pile. Users typically create their own organizational systems for storing their own materials, typically in the form of named and nested folders.

While it may make little sense to review thousands of legacy documents for their relevance, it is a much more manageable task to review a list of folder names. Even at the folder level, too, it may be possible to identify obsolete subject matter — as well as clearly current topics — and apply triage decisions more quickly as a result.

One advantage to this approach is that folder-level analysis requires minimal additional technology investment — Microsoft Windows itself has built-in utilities (in Powershell or the legacy command prompt environment) to identify all the directories and sub-directories on a given network drive or file share. If the legacy documents have been stored in a SharePoint repository, the process is even simpler — SharePoint contains an “export to Excel” function that will create a spreadsheet containing all folder and sub-folder names, plus basic metadata (creator, creation date, etc.) that is typically helpful in assessing the age and relevance of the folder content.

A major disadvantage of this approach, of course, is that decisions that are made based on folder name and metadata do not catch the fact that someone may have stored additional non-related materials in a named folder, too. Further, old and new folder names alike may lack sufficient details that make it easy to determine their contents, forcing greater reliance on folder-based metadata to make relevance decisions. However, given that the only alternative is to conduct a time-consuming file-by-file analysis of each document within a folder, the efficiency of folder-based analysis may still outweigh these potential downsides.

Yet, if this is not a complete solution, folder analysis does still identify groups of folders that should be migrated and others that should be deleted, leaving a smaller population that will require further analysis.

3. Bring Them All In — With A Disclaimer

A third approach to managing legacy data is to embrace the problem by migrating all existing documents into the new system. In such an approach, older materials would become full-text searchable, but their document profile information would include dramatic disclaimers like “legacy” or “imported” to show that these documents are unvalidated as to their continuing relevance or accuracy.

Importing all materials, with or without disclaimers, has the advantage of requiring no complex planning and will appease those who are concerned about the potential loss of historical organizational documents. However, this single advantage must be balanced against multiple, powerful downsides, including the most critical: bringing in everything will noticeably reduce the power of the KM solution.

This approach almost always imports a significant amount of duplicative and draft material of limited value that clogs up search results. Running a search against every document ever created by any user in an organization will gather many low-value search results, calling into question the value of the entire search.

Second, bringing in so many similar or duplicate documents, from multiple locations, will provide competing drafts without offering versioning to understand the priority of one version over another, a key feature of modern KM solutions.

Third, importing all existing documents will take up considerable storage space, something that might make it much more expensive to host and back up the new document repository, especially if there are no policies to automatically age-out documents pursuant to some records retention schedule.

4. Leave Them In Place

A final approach to managing legacy documents is to set existing repositories to read-only mode and leave them where they are as permanent archives.

Typically, this approach is also paired with the ability of users to import individual or groups of files from these archives into the new system as needed. This approach requires the least immediate work by IT professionals but offers the least benefits to both system users and IT.

With this approach, end users would need to consult two entirely separate systems to find previous work product, and the legacy files will not be searchable to the same degree or by using the same tools as the new files. As a practical matter, the legacy files will remain largely hidden until they become wholly irrelevant or forgotten.

For IT, maintaining existing file repositories will require maintaining the physical infrastructure on which they are stored. This may prove both expensive and difficult — hardware fails after a period of time, and the costs to migrate data from one storage system to another as systems become obsolete may be both expensive and duplicative of the cost of the KM system.

In the end, this approach delays making hard decisions, but at an increasing financial cost.


Knowledge Management, Information Governance, or Document Management systems are excellent tools for helping an organization manage its institutional information. However, these systems must be set up to offer genuine value to end users; and ideally, the organization will find a way to consolidate all of its document-based (i.e., unstructured) work in this new system.

Such an approach will be more effective and cheaper to maintain over the long run than those approaches that include long-term access to any part of the pre-existing storage system, even in read-only mode.