Detecting structured data loss

admin

August 21, 2009

Loss of large numbers of credit cards is no longer news – DLP (data loss prevention) technologies are an excellent way of obtaining real time monitoring capability without changing your network and enterprise applications systems.
Typically when companies are considering a DLP (data loss prevention ) solution – they start by looking at the offerings from security vendors like Fidelis Security, Verdasys, Mcafee, Symantec, Infowatch or Websense.

As tempting as it may seem to lean back and listen to vendor pitches and learn from them (since after all, it is their specialty), I’ve found that when this happens you become preoccupied with evaluating security technology instead of evaluating business information value.

By starting an evaluation of security countermeasures with an assessment of asset value and focusing on mitigation of threats to the highest value assets in the business process we dramatically reduce the number of data loss signals we need to detect and process.

By focusing on a small number of important signals (for example file transfer of over 500 credit card records over FTP channels) we reduce the number of signals that the security team need to process and dramatically improve the signal to noise ratio.

With fewer data loss signals to process – the data security team can focus on continuous improvement and refinement of the DLP signatures and the Data loss incident response process.
As we will see later in this post – it’s important to select appropriate methods use for data loss signal detection in order to obtain high signal to noise ration.
A common data security use case is protecting Microsoft Office documents on personal workstations from being leaked to competitors. In 2003 Gartner estimated that business users spend 30 to 40 percent of their time managing documents. In a related vein, Merrill Lynch estimated that over 85 percent of all business information exists as unstructured data .
The key question for enterprise information protection is value – not quantity.
Ask yourself – what is your most valuable asset and where is it stored?
For a company developing automated vision algorithms, the most valuable assets would be inside unstructured files stored in engineers’ workstations – working design documents and software code. For a customer service business the most valuable assets are in structured datasets stored in database servers and data warehouses.
The key asset for a customer service business (retail, e-Commerce sites, insurance companies, banks, cellular providers, telecommunications service providers and government agencies) is customer data.
Customer data stored in large structured databases includes billing information, customer contract information, CDRs (call detail records), payment transactions and more. Customer data stored in operational databases is vulnerable due to the large numbers of users who access and handle the data – users who are not only salaried employees but also contractors and business partners.
Due to the high levels of external network connectivity to agents and customers using on-line insurance portals, one of the most important requirements for an insurance company is the ability to protect customer data in different formats and multiple inbound/outbound network channels.
This is important both from a privacy compliance (complying with EU and American privacy regulation) and from a business security perspective (protecting the data from being stolen by competitors).
Fidelis XPS Smart Identity Profiling provides a powerful way to automatically identify and protect policy holders information without having to scan databases and files in order to generate fingerprints.
Fidelis XPS operates on real-time network traffic (up to 2.5gigabit traffic ) and implements multiple layers of content interception and decoding that “peels off” common compression, aggregation, file formats and encoding schemes, and extracts the actual content in a form suitable for detection and prevention of data leakage.
Smart Identity Profiling
Unlike keyword scanning and digital fingerprinting, Smart Identity Profiling can capture essential characteristics of a document or a structured data set but tolerates some significant variance that is common in database updates and document lifetime: editing, branching into several independent versions, sets of similar documents, etc. It can be considered as the successor to both keyword scanning and fingerprinting, combining the power of both techniques.
Keyword Scanning is a simple, relatively effective and user-friendly method of document classification. It is based on a set of very specific words, matched literally in the text. Dictionaries used for scanning include words inappropriate in communication, code words for confidential projects, products, or processes, and other words that can raise the suspicion independently of the context of their use. Matching can be performed by a single-pass matcher based on a setwise string matching algorithm.

As anybody familiar with Google can attest, the signal-to-noise ratio of keyword searches varies from good to unacceptable, depending on the uniqueness of the keywords themselves and the exactness of the mapping between the keywords and concepts they are supposed to capture.

Digital Fingerprinting (DF) is a technique designed to pinpoint the exact replica of a certain document or data file with the rate of false positives approaching zero. The methods used are calculations of message digests by a secure hash algorithm (SHA-1 and MD5 are popular choices). Websense uses PreciseID (a sliding hash algorithm that is a variation on the DF technique – which is more robust than DF for unstructured data, but still requires frequent update of the signature and is unsuitable for protecting information in very large customer databases due to the amount of computation required and the need to access customer data and store the signatures which creates an additional data security vulnerability.
Here is an example of a Fidelis XPS Smart Identity Profile that illustrates the simplicity and power of XPS.

# MCP3.0 Profile
# name: InsurancePolicyHolders
# comments: Policy Holders
# threshold: 0
pattern:    MemoNo    P[A-Z][A-Z]
pattern:    BusinessUnitName    PZUInternational
pattern:    ControlNo    d{9}
pattern:    PolicyNo    4d{7}
use:    DateOfPolicy(PolicyNo,Date,Name,Phone,e_mail):Medium
use:    Medication(PolicyNo,Drug_Name,Name,Phone):Medium
use:    NamePhonePolicyNo(BusinessUnitName,PolicyNo,Name,Phone):Medium
------------------------------
prob: DateOfPolicy 0.200 0.200 0.200 0.200 0.200
prob: Medication 0.201 0.398 0.201 0.201
prob: NamePhonePolicyNo 0.000 0.333 0.333 0.333

As you can see in the above example – Smart Identity Profiling uses tuples of data fields – for example, the DateOfPolicy tuple which contains 5 fields – PolicyNo,Date,Name,Phone and e_mail address. Although the probability of not detecting a single field might be fairly high, the probability of not detecting a given tuple of 5 fields is the multiple of 5 probabilities – for example if the miss probability of a single field is 70% then the probability of missing the entire tuple is only 16.8%.
SIP (Smart Identity Profiling) is used successfully in Fidelis XPS appliances at gigabit deployments at large insurance companies like PBGC and telecommunication service providers like 013 and Netia.