Prepare the EDM (Enhanced) Fingerprint File

Last updated
Save as PDF

Fingerprint files are created automatically when you train the data source file using the DLP Integrator, which includes the EDMTrain tool. As a prerequisite for exact data matching, the data source file must be trained using the EDMTrain tool to generate the .props file and .dis file.

We recommend not using common values in a data source file. The values must be unique to get the best results while using EDM. Analyze the data that you want to protect (match) and make sure it is a meaningful match value. All values in the data source file are normalized and hashed in the fingerprint file, regardless of the definition you use in classifications.

NOTE: The delimiter or separator between the columns must be a character that doesn't occur in data, so we recommend using the .tsv file format over .csv file format.

Install the DLP Integrator

The DLP Integrator, which includes the EDMTrain tool, is supported on both Windows and Linux platforms.

For more information, see:

Generate the metadata (.props) and fingerprint (.dis) files using the EDMTrain tool

The data source file must be trained to generate the .props and .dis file format for exact data matching and scanning the user records. The .props file contains information, such as number of columns detected, column headers, fingerprint details, file size, database name, and checksum of the .dis file. The .dis file contains fingerprinted details of the user records in the normalized and hashed format.

You can use the command line interface (CLI) or any third-party data transfer tool, such as PuTTy to run the edmtrain command with these options to train the data source file.


CLI Option - Short form	CLI Option - Full form	Description
`-?`	`--help`	Shows the EDMTrain tool help
`-h`	`--header`	Shows the first row, which is a header row
`-s`	`--split-ids`	Allows space-separated IDs, can be split on any delimiter
`-v`	`[ --verbose]`	Shows the verbose output
`-V`	`[ --version ]`	Display version information and exit
`-u`	`[ --utf8 ]`	Specifies the input in UTF-8 encoded format (automatic with a UTF-8 BOM)
`-M`	`[ --merge ]`	Merge columns with same name (an error otherwise)
`-R`	`[ --redact ]`	Removes input data from all output and messages
`-E`	`[ --no-errors ]`	Specifies not to generate error messages or enforce thresholds
`-W`	`[ --no-warnings ]`	Specifies not to generate warning messages
`-j`	`[ --json ] file`	Output progress and exit status to file as JSON
`-e`	`[ --errors ] % (=5)`	Specifies the error threshold in percentage
`-m`	`[ --memory ] gb (=0)`	Allocates memory, you can specify between 1–GB to 8–GB RAM
`-D`	`[ --db-name ] name`	Specifies the database name (default based on output file name)
`-c`	`[ --charset ] name`	Specifies the charset encoding of input file (if no UTF-8 BOM)
`-n`	`[ --newline ] byte (=\n)`	Specifies the newline terminator in bytes (before decoding)
`-P`	`[ --primary ] column_name (or column_number)`	Ensures that columns are never secondary. You can specify column name or column numbers (1–32). Specifying primary columns makes other columns as secondary with no maximum size.
`-p`	`[ --progress ] [=secs(=2)]`	Shows the progress after the specified interval
`-d`	`[ --delimiter ] char (=\t)`	Specifies the cell delimiter character (up to U+FFFF)
`-b`	`[ --non-blank ] n (=0)`	Specifies the minimum number of non-blank cells per row
`-f`	`[ --salt-file ] file`	Supplied file contains a single line hashing salt

Positional parameters


CLI Option - Short form	CLI Option - Full form	Description
`-i`	`[ --input ] file`	Specifies the input file name or first positional option
`-o`	`[ --output ] file`	Specifies the output file name or second positional option

When these options aren't explicitly mentioned in the command line, edmtrain considers the first file as input file and the second file as output file.

Example: edmtrain --header --progress=2 1M.csv out -f salt.txt

1M.csv is the input file and out is the output file that is generated.