Digest::ssdeep(3pm) | User Contributed Perl Documentation | Digest::ssdeep(3pm) |
Digest::ssdeep - Pure Perl ssdeep (CTPH) fuzzy hashing
This document describes Digest::ssdeep version 0.9.0
use Digest::ssdeep qw/ssdeep_hash ssdeep_hash_file/; $hash = ssdeep_hash( $string ); # or in array context: @hash = ssdeep_hash( $string ); $hash = ssdeep_hash_file( "data.txt" ); @details = ssdeep_dump_last(); use Digest::ssdeep qw/ssdeep_compare/; $match = ssdeep_compare( $hashA, $hashB ); $match = ssdeep_compare( \@hashA, \@hashB );
This module provides simple implementation of ssdeep fuzzy hashing also known as Context Triggered Piecewise Hashing (CTPH).
Please, refer to Jesse Kornblum's paper for a detailed discussion ("SEE ALSO").
To calculate the CTPH we should choose a maximum signature length. Then divide the file in as many chunks as this length. Calculate a hash or checksum for each chunk and map it to a character. The fuzzy hashing is the concatenation of all the characters.
We cannot use fixed length blocks to separate the file. Because if we add or remove a character all of the following blocks are also changed. So we must divide the file using the "context" i.e. a block starts and ends in one of the predefined sequence of characters. So the problem is 'Which contexts -sequences- we define to separate the file in N parts?.'
This is the 'roll' of the rolling hash. It is a function of the N last inputs, in this case the 7 last characters. The result of the rolling hash function is uniformly spread between all valid output values. This makes the rolling hash some kind of pseudo-random function whose output depends only on the last N characters. Since the output is supposed to be uniform, we can modulus BS and the expected values are 0 to BS-1 with the same probability.
Let the blocksize (BS) be the length of file divided by the maximum signature length (i.e. 64). If we split the file each time the rolling hash mod BS gives BS-1 we get 64 blocks. This is not a good approach because if the length changes, blocksize changes also. So we cannot compare files with dissimilar sizes. One good approach is to take some 'predefined' blocksizes and choose the one that fits based on the file size. The blocksizes in ssdeep are "3, 6, 12, ..., 3 * 2^i".
So this is the algorithm:
The pitfall is Rolling Hash is statistically uniform, but it does not mean it will give us exactly 64 blocks.
The traditional hash is an usual hash or checksum function. We use 32 bit FNV-1a hash ("SEE ALSO"). But its output is 32 bits, so we need to map it to a base-64 character alphabet. That is, we only use the 6 least significant bits of FNV-1a hash.
The ssdeep hash has this shape: "BS:hash1:hash2"
There are several algorithms to compare two strings. I have used the same that ssdeep uses for compatibility reasons. Only in certain cases, the result from this module is not the same as ssdeep compiled version. Please see DIFFERENCES below for details.
These are the steps for matching calculation:
This section describes the recommended interface for generating and comparing ssdeep fuzzy hashes.
Usage:
$hash = ssdeep_hash( $string );
or in array context
@hash = ssdeep_hash( $string );
In scalar context it returns a hash with the format "bs:hash1:hash2". Being "bs" the blocksize, "hash1" the fuzzy hash for this blocksize and "hash2" the hash for double blocksize. The maximum length of each hash is 64 characters.
In array context it returns the same components above but in a 3 elements array.
Usage:
$hash = ssdeep_hash_file( "/tmp/malware1.exe" );
This is a convenient function. Returns the same of ssdeep_file in scalar or array context.
Since this function slurps the whole file into memory, you should not use it in big files. You should not use this module for big files, use libfuzzy wrapper instead ("BUGS AND LIMITATIONS").
Returns undef on errors.
Usage. To compare two scalar hashes:
$match = ssdeep_compare( $hashA, $hashB );
To compare two hashes in array format:
$match = ssdeep_compare( \@hashA, \@hashB );
The default is to discard hashes with less than 7 characters common substring. To override this default and set this limit to any number you can use:
$match = ssdeep_compare( $hashA, $hashB, 4 );
The result is a matching score between 0 and 100. See Comparison for algorithm details.
Usage after a calculation:
$hash = ssdeep_hash_file( "/tmp/malware1.exe" ); @details = ssdeep_dump_last();
The output is an array of CSV values.
... 2,125870,187|245|110|27|190|66|97,1393131242,q 1,210575,13|216|13|115|29|52|208,4009217630,e 2,210575,13|216|13|115|29|52|208,4009217630,e 1,210730,61|231|220|179|40|89|210,1069791891,T 1,237707,45|66|251|98|56|138|91,4014305026,C ....
Meaning of the output array:
So we can read it this way:
At byte 125870 of the input file, there is a sequence of these 7 characters: "187 245 110 27 190 66 97". That sequence triggered the second part of the hash. The FNV hash value of the current chunk is 1393131242 that maps to character "q".
Or this way:
From the 4th row I know the letter "T" in the first hash comes from the chunk that started at 210575+1 (the one-starting row before) and ends at 210730. The whole FNV hash of this block was 1069791891.
Please report any bugs or feature requests to "bug-digest-ssdeep@rt.cpan.org", or through the web interface at <http://rt.cpan.org>.
Reinoso Guzman "<reinoso.guzman@gmail.com>"
Copyright (c) 2013, Reinoso Guzman "<reinoso.guzman@gmail.com>". All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
2021-01-01 | perl v5.32.0 |