Structure learning software

Please refer to the JMLR paper for further details regarding the algorithm. Latest version of the software is from January/2012. There were important updates with respect to previous versions.

@article{decampos11aJMLR,
  title={Efficient Structure Learning of Bayesian Networks using Constraints},
  authors={Cassio P. de Campos and Qiang Ji},
  volume={12},
  pages={663--689},
  year={2011},
  journal={Journal of Machine Learning Research},
  url={http://www.jmlr.org/papers/volume12/decampos11a/decampos11a.pdf},
}

The algorithm presented here also appeared in the paper Structure Learning of Bayesian Networks using Constraints, presented at the 26th International Conference on Machine Learning (2009), by Cassio P. de Campos, Zhi Zeng, and Qiang Ji, however the JMLR version is more complete and accurate, covering also BD scores (besides BIC/AIC). Please cite this paper (or preferable the JMLR version) in case you make use of the software in other publications. Please read the comments below before using it, as there is important information here (even if you want to make your implementation, I suggest you to read all this document). A possible bibtex:

@inproceedings{1553389,
 author = {de Campos, Cassio P. and Zeng, Zhi and Ji, Qiang},
 title = {Structure learning of {B}ayesian networks using constraints},
 booktitle = {ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning},
 year = {2009},
 isbn = {978-1-60558-516-1},
 pages = {113--120},
 location = {Montreal, Quebec, Canada},
 doi = {http://doi.acm.org/10.1145/1553374.1553389},
 publisher = {ACM},
 address = {New York, NY, USA},
 }

This paper addresses exact learning of Bayesian network structure from data and expert's knowledge based on score functions that are decomposable. First, it describes useful properties that strongly reduce the time and memory costs of many known methods such as hill-climbing, dynamic programming and sampling variable orderings. Secondly, a branch and bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality with respect to the score function. It is an any-time procedure because, if stopped, it provides the best current solution and an estimation about how far it is from the global solution. We show empirically the advantages of the properties and the constraints, and the applicability of the algorithm to large data sets (up to one hundred variables) that cannot be handled by other current methods (limited to around 30 variables).

Any questions about the software, please write an email to cassiopc @ acm . org. The version available here allows the use of structural constraints, but it ignores parameter constraints. In fact one of the derivations for parameter constraints that appears in the ICML paper is incorrect (we sorry for that), and rethinking about it, parameters constraints as explained in the paper are not very appealing (at least we did not find yet a good compromise for them). Hence, we have postponed to explore further ideas related to parameter constraints. If you want to discuss on that or have ideas to make the parameter constraints more sensible, please write an email. Structural constraints are more interesting (at least with the applications we have tried so far).

The input file to run the code has the following specification: first line contains five integer numbers m, n, k, t, s separated by spaces. m is the number of data samples in the file, n is the number of variables, k is the number of parameter constraints, t is the number of structural constraints, and s is a boolean (0 or 1) to indicate if a starting solution is given in the file. Following the first line, it comes the constraints, one per line. After the constraints, a matrix n by n must contain the arcs that are mandatory or denied. A 1 entry in the position (i,j) means that the arc from variable j to variable i is prohibited. A -1 entry means that such arc is mandatory. Note that these restrictions might also be enforced by the constraints, but we adopted also this matrix to facilitate the input in some cases. A zero entry means that the arc may or may not be present (that is, no constraint for that arc). If s is one, then another n by n matrix must contain the starting solution, which must satisfy all the constraints. This last matrix is boolean (contains zeros and ones). The ones of line i indicate which are the parents of the variable i. Note that all indices start from zero. Finally, the data set must be given. Each line must have n integers representing a sample of all the variables. Again, the states of the variables are supposed to start from zero, and the maximum integer plus one in a given variable is used as the total number of states (e.g. if a variable is ternary, then its states must be 0, 1 and 2). Some examples are found here: test1.txt, test2.txt, and test3.txt (this is an example of an infeasible case, because constraints cannot be satisfied). There are more two test files in this folder.

The constraints allow two predicates: a(X,Y) and d(X,l,op). The predicate a(X,Y) means the arc from Y to X, and d(X,l,op) means the in-degree of X being equal to l (if op is eq) or less than l (when op is lt; in fact we also allow gt to mean greater than, although it could be possible to obtain it from the others). Each constraint is a clause in disjunctive form. The elements may be just separated by spaces, which gives the meaning of the logical operator or. A negative sign exactly preceding the predicate means its negation. Variables are numbered from 0 to n-1 (constraints with invalid variable numbers, that is, less than zero or greater than n-1, are ignored). Two final remarks: each constraint line in the file must begin with the number of the variable that it is about (followed by colon), and all the elements of the constraint must be local to this variable, that is, predicates must be about the in-degree and incoming arcs of that variable.

More details on the command-line options and how to use them are available in this text file.

Downloading the software

The software is available free of charge for academic non-commercial use. Please fill in the following form with your information. The data will only be used for statistics and will not be made publicly available. If you want to use it commercially, please send an email.

Last modified: May, 2010

Name:
Email:
Operating System:
Agree to receive updates about this code?