Data Quality in the Age of Artificial Intelligence
An Operations Manager at an oil & gas company had a simple question:
“When was the catalyst last changed in Unit 2 of our facility?”
A seemingly innocent query, but without having the information properly stored or structured, the answer lived somewhere in:
- Shared drive documents
- Temporary repositories
- Third-party databases
- A pile of emails
Looking to remedy this endemic problem, this operations team invested in an AI-powered search and discovery tool.
After training the model, a purpose-built solution scoured all these sources and generated a succinct narrative. The catalyst was last changed on April 2, 2016.
Given the sheer magnitude of the reviewed content, finding this was an incredible feat.
The only problem: the date was absolutely wrong. *
Questions About AI’s Capabilities
With the increased availability of AI-augmented chatbots and search tools, companies have understandably piloted new tools to take advantage of these disruptive solutions.
Although scopes vary, the general goal of these initiatives is to achieve data intelligence; and to enable decision-makers at all levels to leverage existing data to make decisions quickly and accurately.
Potential Disruption
Taking this concept further, could an AI search tool illuminate accurate insights on assets regardless of how the underlying content is stored?
Does the quality of information across repositories even matter, or will the model learn to separate good and bad quality results?
Could AI replace the need for proper document management altogether?
The upside, if true, is substantial – users can work whichever way they want, storing information whenever and wherever. It’s understandable why companies are excited to pilot these emerging solutions. Imagine a workplace with:
- No data entry
- No filing
- Less document governance
- Less training required
- Less administration
Even pain points like duplication would be resolved since the system could traverse through the “haystack”, rather than taking countless hours from the operators or engineers to resolve them.
This could disrupt the software world and the management consultants who tirelessly work to align the people, processes, and content stored on these systems.
Of course, it’s all built on the assumption that the AI answers questions correctly. As the anecdote above illustrates, data quality must be a consideration. Despite the advances in potential capability, there’s still a need to build on a foundation of quality. While these pilot AI programs can deliver some of those benefits, it’s far from plug-and-play.
‘Garbage in, Garbage Out’ So Far Still Holds True
Was the failure from the opening anecdote the result of poor underlying data? Or was it that the AI was improperly configured and trained? The answer is likely a bit of both.
In the above case, the AI solution had expertly interrogated turnaround planning documentation and found a procedure that provided instructions on how to change the catalyst.
This information was all dated around when the turnaround was planned, but it was well known that the maintenance activity did not happen on schedule at that facility.
In fact, upon further human investigation, it was determined that there was no documented proof of when the catalyst was last changed. The true answer likely lives in an operator’s head, or a hand-written stored on a shelf.
The AI system couldn’t possibly know this. It interpreted the procedure/schedule combination as an executed task. The incorrect result introduced avoidable risks and costs to the organization.
Solving these Challenges
To prevent similar problems in the future, organizations can address these four software-agnostic issues that hurt your data intelligence.
1) Missing Information
Although search tools will continue to evolve and perhaps “imply” a result, the best way to ensure accuracy is to have the information itself in the first place.
Efforts should be undertaken to:
- Ensure quality turnover from EPCs to an in-house project team, handover from project to operations, and proper filing of day-to-day maintenance activities all contribute to this goal.
- Encourage individuals to store high-value information on non-personal devices and outside their inboxes. They should also
- Maintain a centralized database for critical information as a single source of truth.
- Digitize and enrich high-value paper and physical media. Thus, legacy information can train AI tools, while ensuring valuable or important information isn’t missing from your databases. Another benefit is that migrating information stored in warehouse boxes to cloud databases reduces storage costs.
2) Missing Asset (Tag) Lists
Companies in process-heavy industries rely on the naming conventions of their assets to find and retrieve relevant information.
The lack of a standard list will hinder search and cleanup efforts as ambiguity can cause confusion (i.e. the tank that holds the catalyst may be referred to as “T-1”, “V-1”, or “TNK01”).
Having a transformation rule at the outset that unifies or connects tag lists is invaluable.
3) Duplication
Although storing three extra copies of the same isometric drawing will not lead to worse insights, the “noise” these copies produce slows down the search process and increases costs for IT organizations (major cloud storage providers have increased pricing in recent years).
As employees and contractors within an organization transition to new roles, the story of the latest information they worked with is lost. A future project team could consider an outdated backup as a source of truth and slow later software implementations.
4) Incorrect Information
The quality of stored information is critical to future retrieval and AI model training. The information that describes a document (metadata, folder path, format) to assist with the categorization and results.
A document with conflicting or ambiguous terms disrupts the search for that specific information and corrupts the training and configuration of AI-powered search tools in the future.
Working With AI
Bottomline: Innovative AI search solutions will require accurate information to provide correct results.
Expecting software to produce correct answers by feeding it low-quality information to process is to risk inaccuracy and hallucinations.
For the best results, the underlying information should be cleaned, enriched, and/or excluded.
If you would like to discuss how your asset information can benefit from AI, our team at ReVisionz is ready to help.
The ReVisionz Intelligent Data & Insights practice has focused on data enrichment for intelligent usage for over twenty years. Our team of data scientists leverages proprietary machine learning, artificial intelligence, and natural language processing programs to help clients achieve their goals.
Reach out today to start a consultation.
*Disclaimer: Although the underlying premise of the article is true, company-specific elements were dramatized for anonymity and impact.