This blog is authored by Mohamed Afifi Ibrahim, Principal Machine Learning Engineer at Barracuda Networks.
74% of organizations globally have fallen victim to a phishing attack. Barracuda Networks is a global leader in security, application delivery and data protection solutions, helping customers fight phishing attacks at scale. Barracuda has built a powerful artificial intelligence engine that uses behavioral analysis to detect attacks to keep malicious actors at bay.
Handling phishing emails is difficult due to the sophistication attackers use in creating malicious emails nowadays. Barracuda Networks uses machine learning to assess and identify malicious messages and protect their customers. Using ML on the Databricks Lakehouse Platform, the Barracuda team has been able to move much faster and is now blocking tens of thousands of malicious emails daily from reaching millions of mailboxes across thousands of customers.
Providing Comprehensive Email Security Protection
The Barracuda team is dedicated to detecting phishing attacks and providing customer security. They achieve this by working on top of Microsoft Office 365 and analyzing the email stream for any possible threats. If an attack is detected, it is immediately removed from the mailbox before users can see it.
One of the key products that Barracuda offers is impersonation protection. Impersonation occurs when malicious actors disguise their messages as coming from an official source, such as a known executive or service. Attackers can utilize this attack to access confidential information, posing a significant risk to individuals and organizations alike.
Impersonation protection is focused on deterring targeted phishing attacks. Such attempts are not sent in vast quantities, unlike spam emails. To send a targeted attack, the attacker must have personal details about the recipient to customize it, such as their profession or field of work. To identify and block impersonation phishing attacks, the team had to build a set of classification models and deploy them into production for our users.
Difficulties with Feature Engineering
In order to properly train our AI models to detect phishing and impersonation attacks, Barracuda needed to utilize the right data and do feature engineering on top of that data. The data included email text, which could be a signal of a phishing attack, and statistical data, such as email sender detail. For example, if a user receives an invoice email from someone who hasn’t sent a similar email over the last few months, this could signal a risk of a phishing attack. Before the Databricks integration, building features was more difficult with the labeled data spread over multiple months, particularly with the statistical features. Additionally, keeping track of the features when our data set grew in size is challenging.
Our team kept the code and model separate and had to duplicate research code for the production environment, which took time and energy. We would first pass each incoming email through the preprocessing code and then pass the preprocessed emails to the model for inferencing.
Barracuda Finds Success Using Databricks
The Barracuda team leveraged machine learning on the Databricks Lakehouse Platform, specifically using the Databricks Feature Store and Managed MLflow, to improve the ML process and deploy better quality models faster.
The Databricks Feature Store serves as the single repository for all of the features used by the Barracuda team.In order to create and maintain statistical features that are constantly updated with fresh batches of incoming emails, labeled data was employed in feature engineering. Because Feature Store is built on top of Delta, there is no extra processing required to convert labeled data to features, and the features remain current.Features are kept in an offline repository, and snapshots of this information are then released online for use in online inferencing. Additionally, by integrating Databricks Feature Store with MLflow, these features can be readily called from the models in MLflow, and the model can obtain the feature concurrently with the feature retrieval when the e-mail comes through for inferencing.
Faster Machine Learning Operations
The other advantage is managing all the machine learning models in MLflow. With MLflow, the team can move all the code inside the model , therefore, can just let the mail go through the model for inferencing instead of preprocessing through code as was being done before, making it simpler and simpler faster to infer. By using MLflow, Barracuda team is able to build fully self-packaged models. This capability greatly reduces the time the team spends developing ML models.
Higher Detection Rate
With Databricks, the team has more time and more computations – enabling them to publish a new table frequently in Delta, update the features every day, and use these to tell whether an incoming email is an attack or not. This results in higher accuracy in detecting phishing attacks and improves customer protection and satisfaction.
With the help of Databricks, Barracuda protects users from email attacks worldwide. Each day the team blocks tens of thousands of malicious emails from reaching customers’ mailboxes. The team is looking forward to continuing to implement new Databricks features to enhance our customers’ experience further.