I am working on crash logs clustering to automatically categories/cluster logs who have same/similar crash. After log clean up – I have only few lines (50) from logs where crashes are found.
I have already done tf-idf and knn to find m similar logs.
For clustering I am thinking infinite mixture models with Dirichlet prior(but stuck with choosing right priors); I am also evaluating LDA(Dirichlet allocation) with different K trials to reach K where perplexity becomes stationary.
This algo once written – I want to put with SPLUNK which over nightly will keeps clustering. Nature of logs won’t change much, so once modeled – we may rarely change it (enhancement may happen though, but not like normal exploring and data science cycles for every time data arrives)