Engineering Insights is an ongoing blog series that gives a behind-the-scenes look into the technical challenges, lessons and advances that help our customers protect people and defend data every day. Our engineers write each post, explaining the process that led up to a Proofpoint innovation.
When analysts investigate security incidents, they must navigate complex, ever-expanding schemas that contain thousands of complex nested fields and enum values. If they want to query this large data set efficiently, they should have an intricate knowledge of the workflow and be able to iteratively refine an investigation.
We built an AI-powered security investigation assistant that helps analysts to interact with security data using natural language. We did this by simplifying query formulation and guiding investigations with context aware recommendations. This blog post explores our process.
Translating natural language into a DSL query
Our assistant dynamically suggests the right questions to ask based on the tenant’s current state and previous investigation steps. It speeds up security workflows and ensures comprehensive analysis.
One of its key components is that it translates queries from natural language to domain-specific language. This, in turn, queries our backend data stores.
When an analyst asks a question in natural language, the tool translates it into a structured DSL query. The generated DSL query is then rendered in an exploration window as a sequence of connected steps. Filters, time ranges and logical conditions are all included.
Figure 1 below shows a snippet from the overall workflow.
Figure 1: AI-powered exploration query.
This intuitive representation enhances visibility into the data exploration process. It also helps analysts refine their investigation with minimal effort.
Key challenges
One of our key challenges was grounding the large language model (LLM) with facts using high quality labeled data. To overcome it, we used existing investigations and the database schema as our primary data sources. However, they presented significant data quality issues. Many of the past investigations were poorly labeled. Plus, they lacked clear titles and descriptions.
Another challenge was that the database schema did not have intuitive descriptions for every field and enum value. This limited the LLM’s ability to generate insightful queries.
Improving our seed data
In addition to human annotations, we used LLMs to improve the quality of our seed data in two key ways:
- Schema enrichment. We fed the raw database schema into an LLM and iteratively generated descriptions for each field and enum value.
- Exploration labeling. Then, we used an LLM to analyze past investigations using the schema descriptions that we generated. At this point, meaningful titles and descriptions were created. This structured labeling improved the quality of our seed data. And it led towards a robust contextual foundation when free-form user queries were processed.
Architecture
The graphic below illustrates the training and inference pipelines at a high level.
Figure 2: Architecture illustrating training and inference pipelines at a high level.
Throughout there is a strict adherence to privacy and compliance. No PII or user details are shared with any LLMs or used for training. We fine-tuned a custom text embedding model to ensure domain relevance with Proofpoint-specific vocabulary, like PSAT and TAP.
The text embedding model was trained using the SBERT model, a contrastive loss function and manually annotated data sets. The model was evaluated using metrics like BLEU, ROUGE and a custom accuracy metric derived from a gold standard data set that allowed partial credit for filter matches. Finally, to enhance inference efficiency, model distillation was applied, with Claude Sonnet as the teacher model and Haiku as the student model.
Conclusion
Our security assistant enables analysts to ask questions in natural language. In response, they receive actionable insights and recommendations that are derived from trillions of data points across our Nexus platforms.
It’s available in the Proofpoint Information Protection platform as a technology preview. In the future, it will be available in Proofpoint Threat Protection and Identity Threat Defense.
Join the team
At Proofpoint, our people—and the diversity of their lived experiences and backgrounds—are the driving force behind our success. We have a passion for protecting people, data and brands from today’s advanced threats and compliance risks.
We hire the best people in the business to:
- Build and enhance our proven security platform
- Blend innovation and speed in a constantly evolving cloud architecture
- Analyze new threats and offer deep insight through data-driven intelligence
- Collaborate with our customers to help solve their toughest cybersecurity challenges
If you’re interested in learning more about career opportunities at Proofpoint, visit the careers page.
About the authors
Ram Kulathumani is a senior manager at Proofpoint, leading machine learning (ML) research and development for impactful security use cases. Recently, his primary focus has been on fine-tuning large transformer models, prompt engineering, evaluating LLMs and optimizing ML inference performance. Outside of work, he enjoys playing badminton and teaching chess to his son.
Addison Beall is a software engineer at Proofpoint and is part of a team working to add useful AI integrations to the existing platform. He has spent the past few years becoming familiar with the LLM landscape and how they can best be used to enhance the analyst experience.
Chris Covney is a staff software engineer at Proofpoint where he focuses primarily on important elements of the Information Protection platform, such as distributed systems, microservice design and data persistence. Currently, he spends most of his time thoughtfully studying and carefully applying AI/ML practices to grow the AI/ML pipeline and its capabilities. When not at work, he is usually found jogging or playing jazz guitar.
Khurram Ghafoor is a senior director at Proofpoint who specializes in data platform architecture and AI/ML pipelines. With over 20 years of experience, he brings expertise in software development, scalable architecture, and the application of ML and AI technologies.