Advanced Data Classification in DLP: EDM, IDM, OCR

Advanced Data Identification in DLP: What Is EDM, IDM, OCR and AI-Powered Classification?

Share with your network!

August 14, 2023 Itir Clarke and Alin Mutu

The shift to remote work, an evolving cybersecurity landscape, and higher volumes of data all add more burden on information security teams, which are already overstretched. Data loss prevention (DLP) can help teams keep an eye out for data policy violations and quickly respond to incidents where there’s a potential for data loss. DLP alerts can provide early warnings of cyber attacks or insider threats and prevent sensitive data from being lost or stolen.

When teams constantly get alerts, however, they can become desensitized to them. That’s why it’s important to improve the accuracy of alerts. The way to do that is to deploy advanced content-matching methods, such as Exact Data Matching (EDM) and Indexed Document Matching (IDM). These methods help organizations detect and protect the sensitive datasets and documents that are unique to their businesses. Organizations can further augment DLP artificial intelligence (AI) to classify files in the cloud and in on-premises file repositories and with Optical Character Recognition (OCR) to identify sensitive data in images.

In this blog, we’ll explain what you should know about advanced data identification and how Proofpoint can help.

What are EDM and IDM?

Over the past two years, insider threats have grown 44%. Organizations are very concerned about protecting customer data, intellectual property, business-critical information and other sensitive data. Advanced content-matching methods, like EDM and IDM, are an important aspect of data loss prevention.

EDM detects structured data in a file or message. For example, healthcare organizations collect patient data, including the patient’s name, address, medical record number and credit card number. Put together, this data is Protected Health Information (PHI) and Personally Identifiable Information (PII). PHI and PII are subject to data privacy laws and must be protected. Proofpoint Enterprise DLP’s EDM service helps organizations increase detection accuracy for this type of structured data. The service allows us to detect confidential information that is unique to an organization by matching text to values in a dataset or specified column when scanning content across email, cloud and web.

IDM compares the overall content of an unstructured file or message to a known source based on a full or partial match—such as 40% of a file’s content. For example, if a malicious insider tries to exfiltrate a new product data sheet prior to launch using the approach that Figure 1 illustrates, Proofpoint Enterprise DLP can compare the file to the indexed document, confirm the match and remediate the policy violation by removing the file-sharing permissions of other users.

Process of a malicious insider exfiltrating confidential data

Figure 1. Malicious insider in a manufacturing company exfiltrating confidential new product data sheet via cloud sharing prior to launch.

How do Proofpoint EDM and IDM services work?

In the case of EDM, structured data in a CSV file format is secured as a hash, statistically analyzed and uploaded to the Proofpoint EDM service in the cloud. The EDM service then uses this hashed dataset as a source to identify sensitive data matches. EDM-based DLP detectors can match a single value or combination of values from the uploaded dataset as well as more complex expressions, such as analyzing proximity to other sensitive data described in dictionaries or smart identifiers, for maximum efficacy.

Organizations can identify potential violations for this highly sensitive data using customer-defined policies. Proofpoint’s EDM service is highly scalable and supports datasets as large as tens of millions of rows with more than 10 columns.

IDM indexes sensitive unstructured text files as a rolling set of hashes and uploads them to our cloud service. Similar files are grouped in file sets, which are used in DLP detector expressions to define the desired matching percentage applied against any file in the respective set. When a file is shared in the cloud or uploaded to the web, the IDM service compares this file with the indexed file sets to arrive at a match percentage between 0 and 100. The DLP rule will generate an alert if the match percentage satisfies the threshold logic defined in the DLP detector.

When indexing files, organizations can also specify exclusion files that contain non-sensitive text, such as disclaimers, which need not be considered for matching.

What is OCR and why does it matter?

OCR extracts text from scanned forms, medical images, screenshots of sensitive content, PDFs and more. Once the text is extracted, you can use DLP detectors, dictionaries and rules to identify and prevent exfiltration of that sensitive data whether it’s in standalone images or in images that are embedded in documents, such as Microsoft Word or PowerPoint. OCR helps you to identify behavior patterns, broken business processes and un-sanctioned use of image capturing techniques.

Detecting sensitive data in images matters because the use of images in the workplace is expanding. For example, unstructured data can account for up to 80 percent of a patient's medical record, which predominantly comes in the form of medical images and notes scanned to PDF. A good portion of Intellectual Property can be in design drawings, screen captures, etc. And more and more users are capturing data in screenshots for taking notes and quick collaboration.

What is AI-powered classification?

Many organizations use manual processes to identify, classify and protect their data. As data volumes continue to increase, however, this approach is less viable and more prone to human error. AI-based technologies can help.

The Proofpoint data classification solution is built on a pre-trained AI engine to classify data at petabyte scale and with speed. Your data can be in the cloud or on-premises. The solution scans Microsoft 365, SharePoint on-premises and network-shared drives.

Using proprietary AI models, algorithms and a process that lets you validate results generated by our AI engine on a sample of documents, Proofpoint can achieve up to 99% accuracy in our automated data classification. After classifying data, it can then recommend how best to prioritize which content to protect. It also can generate custom dictionaries and data detectors to augment your DLP platform.

These advanced data extracting, matching and classifying techniques help you cast a larger net and pull that net faster. You can capture more types of sensitive data, in more file formats and with better accuracy and efficiency.

Learn more

Watch our on-demand webinar series, Break The Attack Chain. This series can help you gain a deeper understanding of the evolving threat landscape and give you proactive strategies that you can use to break the attack chain at every stage.

Find out more about Proofpoint Enterprise DLP.

Solutions By Use Case

Solutions By Industry

Proofpoint vs. the competition

Partners

Resources

Company

Platform

Advanced Data Identification in DLP: What Is EDM, IDM, OCR and AI-Powered Classification?

What are EDM and IDM?

How do Proofpoint EDM and IDM services work?

What is OCR and why does it matter?

What is AI-powered classification?

Learn more

Turn people into your best defense with Proofpoint

Solutions By Use Case

Solutions By Industry

Proofpoint vs. the competition

Partners

Resources

Company

Platform

Advanced Data Identification in DLP: What Is EDM, IDM, OCR and AI-Powered Classification?

What are EDM and IDM?

How do Proofpoint EDM and IDM services work?

What is OCR and why does it matter?

What is AI-powered classification?

Learn more

Subscribe to the Proofpoint Blog

Turn people into your best defense with Proofpoint