Skip to main content
Skyhigh Security

Prepare the EDM (Enhanced) Fingerprint File

Fingerprint files are created automatically when you train the data source file using the DLP Integrator, which includes the EDMTrain tool. As a prerequisite for exact data matching, the data source file must be trained using the EDMTrain tool to generate the .props file and .dis file.

We recommend not using common values in a data source file. The values must be unique to get the best results while using EDM. Analyze the data that you want to protect (match) and make sure it is a meaningful match value. All values in the data source file are normalized and hashed in the fingerprint file, regardless of the definition you use in classifications.

NOTE: The delimiter or separator between the columns must be a character that doesn't occur in data, so we recommend using the .tsv file format over .csv file format.

Install the DLP Integrator

The DLP Integrator, which includes the EDMTrain tool, is supported on both Windows and Linux platforms.

For more information, see:

Generate the metadata (.props) and fingerprint (.dis) files using the EDMTrain tool

The data source file must be trained to generate the .props and .dis file format for exact data matching and scanning the user records. The .props file contains information, such as number of columns detected, column headers, fingerprint details, file size, database name, and checksum of the .dis file. The .dis file contains fingerprinted details of the user records in the normalized and hashed format.

You can use the command line interface (CLI) or any third-party data transfer tool, such as PuTTy to run the edmtrain command with these options to train the data source file.

 
CLI Option - Short form CLI Option - Full form Description
-? --help Shows the EDMTrain tool help
-h --header Shows the first row, which is a header row
-s --split-ids Allows space separated IDs, can be split on any delimiter
-v [ --verbose] Shows the verbose output
-u [ --utf8 ] Specifies the input in UTF-8 encoded format (automatic with a UTF-8 BOM)
-R [ --redact ] Removes input data from all output and messages
-E [ --no-errors ] Specifies not to generate error messages or enforce thresholds
-W [ --no-warnings ] Specifies not to generate warning messages
-e [ --errors ] % (=5) Specifies the error threshold in percentage
-m [ --memory ] gb (=0) Allocates memory, you can specify between 1–GB to 8–GB RAM
-D [ --db-name ] name Specifies the database name (default based on output file name)
-c [ --charset ] name Specifies the charset encoding of input file (if no UTF-8 BOM)
-n [ --newline ] byte (=\n) Specifies the newline terminator in bytes (before decoding)
-p [ --progress ] [=secs(=2)] Shows the progress after the specified interval
-d [ --delimiter ] char (=\t) Specifies the cell delimiter character (up to U+FFFF)
-b [ --non-blank ] n (=0) Specifies the minimum number of non-blank cells per row
-P [ --primary ] column_name (or column_number) Ensures that columns are never secondary. You can specify column name or column numbers (1–32). Specifying primary columns makes other columns as secondary with no maximum size.

Positional parameters

 
CLI Option - Short form CLI Option - Full form Description
-i [ --input ] file Specifies the input file name or first positional option
-o [ --output ] file Specifies the output file name or second positional option

 

When these options aren't explicitly mentioned in the command line, edmtrain considers the first file as input file and the second file as output file.

Example: edmtrain --header --progress=2 1M.csv out

1M.csv is the input file and out is the output file that is generated.

  • Was this article helpful?