Options to Generate the Fingerprint File using the EDMTrain tool

Last updated
Save as PDF

You can use these examples to understand how to specify different options with the edmtrain command to generate the fingerprint file.

Create a salt file

Create a file containing the salt used to protect your database:

Example: echo "8d8dbed7-f2a6-480f-bac9-5e590e3e4691" > salt.txt

Export the database records using the pipe command and generate the fingerprint file

Recommended: Use the pipe (|) command to export the database records directly to the EDMTrain tool and generate the fingerprint file instead of exporting the database file to a temporary .tsv file.

Example: db_export | edmtrain --header --output pipe_out-f salt.txt

Generate the fingerprint file with column headers

Consider the input data source file in the CSV file format with the file name as 1M.csv. 1M.csv is a large file that contains 1 million cells. Execute this command to generate the .dis and .props file. The --progress shows the status of the file generation after 2 sec and --delimiter is used to specify the delimiter value.

Example: edmtrain --header --progress=2 --delimiter=, --input 1M.csv -f salt.txt

After 1M.csv is trained, the output files — 1M.dis and 1M.props are generated because the output file name option is not specified in the command line. The processed files are saved in the same location as that of 1M.csv.

The .props file looks similar to the following:

{
    "Records": "200000",
    "Columns": [
        "id",
	"accountnumber",
	"surname",
	"firstname",
	"sortcode"
    ],
    "Fingerprints": "1000000",
    "FingerprintsSize": "12282699",
    "CSVFileSize": "15333357",
    "Revision": "1623685866",
    "SkippedCells": "0",
    "DbName": "1M"
}

Consider the data to be trained as shown:

id,accountnumber,surname,firstname,sortcode
1092831,78192110,smith,john,08-12-29
0921231,23424334,flint,graham, 01-22-33

When you use the following command without the --header option, the tool generates fingerprints for the content of the .csv file and also for the words in the header.

edmtrain --input 1M.csv --progress=2 --delimiter=, -f salt.txt

The fingerprints are also created for "id", "accountnumber", "surname", " firstname", and "sortcode" and are considered for matching. The headers that are generated are named as "column 1", "column 2", "column 3", "column 4", and "column 5" as --header isn't specified in the command line. In such a scenario, you can eliminate the header values from being fingerprinted by specifying the --header option in the command line.

Only when you specify --header, the column names are shown in the UI when creating classifications. This enables you to easily specify the mandatory columns, exceptions, and other columns as you can see the column names, rather than "column 1", "column 2".

Generate the fingerprint file with an output file option

Execute the edmtrain command with the --output option to generate the output file with the specified file name.

Example: edmtrain --header --progress=2 --delimiter=, --input 1M.csv --output out -f salt.txt

After 1M.csv is trained, the output files, out.dis and out.props files are generated. The processed files are saved in the same location as that of 1M.csv.

Generate the fingerprint file for a data source file with tab-separated values

Consider the input data source file (1M.tsv) with tab separated values. By default, tab is considered as the delimiter value if the delimiter option isn't specified.

Example: edmtrain --input 1M.tsv --header --progress=2 -f salt.txt

Alternatively, you can specify the delimiter for tab-separated values using one of these options:

Option 1: edmtrain --header --progress=2 --delimiter="\t" --input 1M.tsv --output out_tests -f salt.txt

Option 2: edmtrain --header --progress=2 -d"\t" --input 1M.tsv --output out_tests_2 -f salt.txt

Note the use of double quotes with "\t" for the command to recognize the tab-separated values in the file.

The columns aren't separated properly if an incorrect delimiter is specified, and the .props file looks like:

"Columns": [
       "ssn,account_number,credit_card_number"
       ],

Generate the fingerprint file with verbose output

Run the edmtrain command with --verbose option to see all statistics of the fingerprint file.

Example: edmtrain --header --progress=2 --verbose --input tab_sep_1M.csv -f salt.txt

Output:
Number of columns detected: 3
> 1: ssn
> 2: account_number
> 3: credit_card_number
Capacity for 10,572,227,072 entries
> Records read: 333,333
> Cells written: 999,999
> Bytes written: 12,282,699
Total number of columns: 3
Total number of records: 333,333
> Total number of bad records: 0
> Total number of empty records: 0
Total number of fingerprints: 999,999
> Total number of skipped cells: 0
> Total number of discarded cells: 0
Database created successfully: tab_sep_1M.dis
Build time: 5 seconds

Generate the fingerprint file where field values having identifiers use separators

When EDM (Enhanced) scans numeric identifiers, the separator characters present in between the numeric identifiers are normalized (for example, 1234-5678-9012 and 1234.5678.9012 are both normalized as 123456789012). This means that a numeric identifier is matched regardless of whether separators are used or not.

The same normalization applies to alphabetical and alphanumeric identifiers, except for ".". The "." character isn't considered as a separator between a number and a letter and vice versa. For example, "12.34" and "AB.CD" are single word identifiers but "12.CD" is considered as two words.

The --split-ids option also supports a space separator by training the identifier as if the "-" and "." separators were replaced by space. This only works for single word identifiers, so "1234-5678" is trained as "1234 5678", but "Flat 12-2, The Green" is unaffected. The hash for "1234 5678" is the same as for "1234-5678", so it extends what is picked up at scan time through the multiword logic.

If the numeric identifier being trained is already space separated, the --split-ids option isn't needed and must not be used as it slows down training.

If you want to split the identifiers for more separators, process your database export before fingerprinting by adjusting the data. If the data has no separators, it is needed to add spaces at the appropriate places in the data source file.


At fingerprint time		At scan time
Fingerprinted data	--split-ids option provided?	Does 1234-5678 match?	Does 1234 5678 match?
12345678	No	Yes	No
12345678	Yes	Yes	No
1234-5678	No	Yes	No
1234-5678	Yes	Yes	Yes
1234 5678	No	Yes	Yes
1234 5678	Yes	Yes	Yes

Performance considerations for EDM data

Single cell values versus multiword cell values — Single word cell values provide better performance than multiword cell values. Having a large number of multiword cell values also increases the size of the fingerprint database. The database has to be loaded into memory for scanning.

What are secondary matches? — For best performance, fingerprinted cell values must not contain large numbers of repeated values. This is because all possible row matches have to be examined. For example, if a two column data source has one column with 100,000 cell values of "Smith", then all 100,000 possible matches would have to be considered when looking for a row match. To improve performance, when a cell value occurs more than 99 times, the values are marked as "secondary". When scanning more than one column, secondary hashes alone will never trigger a match, they need to be in proximity of a primary hash to trigger a match.

Truncation of duplicate cell values — When a cell value occurs more than 999 times, all subsequent occurrences are not recorded. These cells are recorded as errors. Some errors are permitted (by default, 5%), but if the error limit is exceeded, the training of data fails.

Command options for a dataset with large numbers of repeated values — To avoid large numbers of cells with common values being discarded (such as city or surname), you can mark other columns as "primary". In this case, all columns which aren't marked as primary are treated as secondary, but duplicate cell values from secondary columns are not discarded.

For example, if you have a two column dataset or if you were to designate the account number as primary column, this resolves the problem of lost triggers as surnames would not be discarded.

Example: edmtrain --primary <column_name> 1M.tsv -f salt.txt

NOTE: This only works for almost unique fields. As in this example, if an account number occurs more than 99 times, the extras will be discarded instead, thus reintroducing lost triggers.

To designate more than one column as primary, use the --primary <column_name> option multiple times.

Example: edmtrain --primary <column_name_a> --primary <column_name_b> --primary <column_name_x> 1M.tsv -f salt.txt

You can also use column numbers instead of column names with --primary.

If the data doesn't have any values that occur more than 1000 times, the --primary option is not required. We recommend using --primary if the training tool produces an error similar to:

ERROR: Truncating number of entries to 1000
for hash 0123456789abcdef

Although, some errors are permitted (by default, 5%), if the error limit exceeds, the training of data fails. You can increase the error threshold, but it increases the level of destruction, so we recommend using --primary to avoid discarding values.