News
Introduction
DCE (Dense Cluster Enumeration) is a C++ software for enumerating all subtensor patterns that exceed a user-defined minimum density threshold.
Given an n-dimensional data array (tensor), a subtensor pattern is defined by specifying a non-empty subset for each of the n dimension-specific index sets. The density of a subtensor is defined as the average value of its elements. The input tensor can contain binary or real values. However, the current implementation assumes non-negativity. If you would like to analyze mixed-sign tensors, please contact the email address given below. The output consists of all locally maximal patterns, which can be ranked according to exact probability values. Furthermore, the method optionally takes into account inherent symmetry constraints. Detailed information about the algorithm can be found in the paper Multi-Way Set Enumeration in Real-Valued Tensors appeared in the Proceedings of DMMT'09 (SIGKDD 2009 Workshop on Data Mining using Matrices and Tensors).
Download
- DCE.zip: DCE code with example files and a README file.
License: DCE is free software under the GNU General Public License, version 2.
Manual
Compilation:
>> make
Usage:
>> ./dce <input_file> <output_file> <density_threshold> [options]
-
Input file:
The input file contains the weights of non-zero tensor entries in tab-delimited format. Each line corresponds to a tensor entry. For an n-way tensor, each of the first n columns contains an index, and column n+1 contains the value of the corresponding tensor entry. Indices start from zero. The values are automatically normalized such that the maximum weight is 1. The current implementation assumes that the values are non-negative. However, the underlying algorithm is applicable to mixed-sign tensors as well (please contact the email address given below).
Example:0 0 0 0 1.0
0 0 0 1 0.9
0 0 0 2 0.5
0 0 1 0 1.0
0 0 1 3 1.0
... -
Output file:
The output file lists the obtained subtensor patterns in a tab-delimited format. Each line corresponds to one solution pattern. The columns contain the following information:- First column: Ranking score (negative log10 of the probability value)
(note: this column is missing if the --ranking=0 option is used) - Second column: Density
- Columns 3 to n+2: Pattern size with respect to each dimension
- Last n columns: Subtensor pattern (for each dimension an index subset, given as comma-delimited list)
- First column: Ranking score (negative log10 of the probability value)
-
Density threshold:
Minimum density threshold for subtensor patterns (between 0 and 1). The density of a subtensor is defined as the average value of its elements. -
Options:
--ranking=0 Deactivate probability-based ranking
--minsize=%d Minimum set size per dimension
--maxsize=%d Maximum set size per dimension
--mindegree=%f Minimum degree threshold for each instance within a pattern (default: 0)
--symmetry=%d-%d[-%d,%d-%d] Inherent tensor symmetries; dimensions are denoted by indices from 0 to n-1. Dimensions that form a symmetry group are delimited by -; different symmetry groups are delimited by ,. For example, --symmetry=0-1,2-3 means that the tensor is symmetric with respect to the first two dimensions as well as with respect to the third and the fourth dimension. Thus, the entries (0,1,0,1), (1,0,0,1), (0,1,1,0), and (1,0,1,0) are equivalent, so only one of them has to appear in the input file. Solution patterns will contain the same index set for symmetric dimensions (in the output file, the index set is only given for the first of the symmetric dimensions).
--balance=1 Additional constraint: density threshold must be satisfied for each slice of the subtensor pattern
--maxbranches=%d Heuristic restriction of the number of considered children per search node (then, the output can also contain solutions that are not locally maximal; the maximality check for a pattern is only performed with respect to own descendants)
- The current implementation is designed for sparse tensors and assumes non-negativity (for information on extensions, please contact the email address given below).
- The density threshold can be set to positive values smaller or equal to 1. Lower density thresholds yield larger solution sets.
- If you do not need the pattern ranking based on exact probabilities, please deactivate the ranking with the --ranking=0 option. This makes the method faster.
- There exists an option to heuristically speed up the search procedure. It limits the number of considered children per search node (see --maxbranches option), that means the generation of overlapping patterns is restricted. This leads to a loss of the completeness guarantee for the solution set, but it can be helpful for fast detection of significant patterns.
Examples:
>> ./dce testdata/test2way.txt testres/test2way_75 0.75
>> ./dce testdata/test4way.txt testres/test4way_100 1.0 --minsize=2
>> ./dce testdata/test4way.txt testres/test4way_100_b 1.0 --minsize=2 --balance=1
>> ./dce testdata/test3waySymm.txt testres/test3waySymm_90_2 0.9 --symmetry=0-1 --maxbranches=2 --ranking=0
Publications
- Elisabeth Georgii, Koji Tsuda, Bernhard Schölkopf (2009) Multi-Way Set Enumeration in Real-Valued Tensors. Proceedings of DMMT'09 (SIGKDD 2009 Workshop on Data Mining using Matrices and Tensors). Link
Contact