Options to Generate the Fingerprint File using the EDMTrain tool
You can use these examples to understand how to specify different options with the edmtrain
command to generate the fingerprint file.
Create a salt file
Create a file containing the salt used to protect your database:
Example: echo "8d8dbed7-f2a6-480f-bac9-5e590e3e4691" > salt.txt
Export the database records using the pipe command and generate the fingerprint file
Recommended: Use the pipe (|
) command to export the database records directly to the EDMTrain tool and generate the fingerprint file instead of exporting the database file to a temporary .tsv file.
Example: db_export | edmtrain --header --output pipe_out
-f salt.txt
Generate the fingerprint file with column headers
Consider the input data source file in the CSV file format with the file name as 1M.csv. 1M.csv is a large file that contains 1 million cells. Execute this command to generate the .dis and .props file. The --progress
shows the status of the file generation after 2 sec and --delimiter
is used to specify the delimiter value.
Example: edmtrain --header --progress=2 --delimiter=, --input 1M.csv -f salt.txt
After 1M.csv is trained, the output files — 1M.dis and 1M.props are generated because the output file name option is not specified in the command line. The processed files are saved in the same location as that of 1M.csv.
The .props file looks similar to the following:
{
"Records": "200000",
"Columns": [
"id",
"accountnumber",
"surname",
"firstname",
"sortcode"
],
"Fingerprints": "1000000",
"FingerprintsSize": "12282699",
"CSVFileSize": "15333357",
"Revision": "1623685866",
"SkippedCells": "0",
"DbName": "1M"
}
Consider the data to be trained as shown:
id,accountnumber,surname,firstname,sortcode
1092831,78192110,smith,john,08-12-29
0921231,23424334,flint,graham, 01-22-33
When you use the following command without the --header
option, the tool generates fingerprints for the content of the .csv file and also for the words in the header.
edmtrain --input 1M.csv --progress=2 --delimiter=, -f salt.txt
The fingerprints are also created for "id", "accountnumber", "surname", " firstname", and "sortcode" and are considered for matching. The headers that are generated are named as "column 1", "column 2", "column 3", "column 4", and "column 5" as --header
isn't specified in the command line. In such a scenario, you can eliminate the header values from being fingerprinted by specifying the --header
option in the command line.
Only when you specify --header
, the column names are shown in the UI when creating classifications. This enables you to easily specify the mandatory columns, exceptions, and other columns as you can see the column names, rather than "column 1", "column 2".
Generate the fingerprint file with an output file option
Execute the edmtrain
command with the --output
option to generate the output file with the specified file name.
Example: edmtrain --header --progress=2 --delimiter=, --input 1M.csv --output out -f salt.txt
After 1M.csv is trained, the output files, out.dis and out.props files are generated. The processed files are saved in the same location as that of 1M.csv.
Generate the fingerprint file for a data source file with tab-separated values
Consider the input data source file (1M.tsv) with tab separated values. By default, tab is considered as the delimiter value if the delimiter option isn't specified.
Example: edmtrain --input 1M.tsv --header --progress=2 -f salt.txt
Alternatively, you can specify the delimiter for tab-separated values using one of these options:
Option 1: edmtrain --header --progress=2 --delimiter="\t" --input 1M.tsv --output out_tests -f salt.txt
Option 2: edmtrain --header --progress=2 -d"\t" --input 1M.tsv --output out_tests_2 -f salt.txt
Note the use of double quotes with "\t"
for the command to recognize the tab-separated values in the file.
The columns aren't separated properly if an incorrect delimiter is specified, and the .props file looks like:
"Columns": [
"ssn,account_number,credit_card_number"
],
Generate the fingerprint file with verbose output
Run the edmtrain
command with --verbose
option to see all statistics of the fingerprint file.
Example: edmtrain --header --progress=2 --verbose --input tab_sep_1M.csv -f salt.txt
Output:
Number of columns detected: 3
> 1: ssn
> 2: account_number
> 3: credit_card_number
Capacity for 10,572,227,072 entries
> Records read: 333,333
> Cells written: 999,999
> Bytes written: 12,282,699
Total number of columns: 3
Total number of records: 333,333
> Total number of bad records: 0
> Total number of empty records: 0
Total number of fingerprints: 999,999
> Total number of skipped cells: 0
> Total number of discarded cells: 0
Database created successfully: tab_sep_1M.dis
Build time: 5 seconds
Generate the fingerprint file where field values having identifiers use separators
When EDM (Enhanced) scans numeric identifiers, the separator characters present in between the numeric identifiers are normalized (for example, 1234-5678-9012 and 1234.5678.9012 are both normalized as 123456789012). This means that a numeric identifier is matched regardless of whether separators are used or not.
The same normalization applies to alphabetical and alphanumeric identifiers, except for ".". The "." character isn't considered as a separator between a number and a letter and vice versa. For example, "12.34" and "AB.CD" are single word identifiers but "12.CD" is considered as two words.
The --split-ids
option also supports a space separator by training the identifier as if the "-" and "." separators were replaced by space. This only works for single word identifiers, so "1234-5678" is trained as "1234 5678", but "Flat 12-2, The Green" is unaffected. The hash for "1234 5678" is the same as for "1234-5678", so it extends what is picked up at scan time through the multiword logic.
If the numeric identifier being trained is already space separated, the --split-ids
option isn't needed and must not be used as it slows down training.
If you want to split the identifiers for more separators, process your database export before fingerprinting by adjusting the data. If the data has no separators, it is needed to add spaces at the appropriate places in the data source file.
At fingerprint time | At scan time | |||
---|---|---|---|---|
Fingerprinted data | --split-ids option provided? | Does 1234-5678 match? | Does 1234 5678 match? | |
12345678 | No | Yes | No | |
12345678 | Yes | Yes | No | |
1234-5678 | No | Yes | No | |
1234-5678 | Yes | Yes | Yes | |
1234 5678 | No | Yes | Yes | |
1234 5678 | Yes | Yes | Yes |
Performance considerations for EDM data
Single cell values versus multiword cell values — Single word cell values provide better performance than multiword cell values. Having a large number of multiword cell values also increases the size of the fingerprint database. The database has to be loaded into memory for scanning.
What are secondary matches? — For best performance, fingerprinted cell values must not contain large numbers of repeated values. This is because all possible row matches have to be examined. For example, if a two column data source has one column with 100,000 cell values of "Smith", then all 100,000 possible matches would have to be considered when looking for a row match. To improve performance, when a cell value occurs more than 99 times, the values are marked as "secondary". When scanning more than one column, secondary hashes alone will never trigger a match, they need to be in proximity of a primary hash to trigger a match.
Truncation of duplicate cell values — When a cell value occurs more than 999 times, all subsequent occurrences are not recorded. These cells are recorded as errors. Some errors are permitted (by default, 5%), but if the error limit is exceeded, the training of data fails.
Command options for a dataset with large numbers of repeated values — To avoid large numbers of cells with common values being discarded (such as city or surname), you can mark other columns as "primary". In this case, all columns which aren't marked as primary are treated as secondary, but duplicate cell values from secondary columns are not discarded.
For example, if you have a two column dataset or if you were to designate the account number as primary column, this resolves the problem of lost triggers as surnames would not be discarded.
Example: edmtrain --primary <column_name> 1M.tsv -f salt.txt
NOTE: This only works for almost unique fields. As in this example, if an account number occurs more than 99 times, the extras will be discarded instead, thus reintroducing lost triggers.
To designate more than one column as primary, use the --primary <column_name>
option multiple times.
Example: edmtrain --primary <column_name_a> --primary <column_name_b> --primary <column_name_x> 1M.tsv -f salt.txt
You can also use column numbers instead of column names with --primary
.
If the data doesn't have any values that occur more than 1000 times, the --primary
option is not required. We recommend using --primary
if the training tool produces an error similar to:
ERROR: Truncating number of entries to 1000
for hash 0123456789abcdef
Although, some errors are permitted (by default, 5%), if the error limit exceeds, the training of data fails. You can increase the error threshold, but it increases the level of destruction, so we recommend using --primary
to avoid discarding values.