Unstructured Data Fingerprinting is used to protect data such as Word, PDF, PowerPoint, or CAD documents. It is also known as Indexed Document Matching or IDM. Potentially sensitive or confidential data is first identified by the organization and then fingerprinted by the Skyhigh CASB Fingerprinting process. This process includes extracting the text and data, normalizing it, and securing it using multiple overlapping hashes. By employing multiple hashes, the policy can be set to look for the document as a whole or just a percentage of its content.
Unstructured or Index Document Matching (IDM) fingerprints work by content-matching indexed documents and images.
With the Unified IDM, you can now fingerprint non-text-based files such as images. With these files, there is no text to extract instead, a digest match is performed for an exact match during policy evaluation.
Benefits of Unstructured Fingerprint
The following benefits of Unstructured fingerprints are applicable only for Enhanced IDM.
- Supported data includes the file types listed in Supported File Formats.
- IDM has the ability to reduce false positives with ignored text.
- API support to automate fingerprint workloads for real-time production.
- Archive files such as ZIP files are not automatically excluded.
- Each individual input file is limited to 500 MB, but multiple input files can be indexed together. In addition, the files that are less than 64 bytes are ignored.
- Support for 200 million signatures. The following are the findings for various file formats and the amount of signatures (sigs).
The first column shows how many GB of input data results in 1 million signatures.
- For XLSX, you would be able to fingerprint 200 x 0.24 GB = 48 GB
- For PPTX, you would be able to fingerprint 200 x 20.22 GB = 4 TB
- For DOCX, you would be able to fingerprint 200 x 3.78 GB = 756 GB