Feature selection for big data

machine learning, data mining, knowledge discovery, expert systems, big data

High dimensional datasets frequently occur in social media, text processing, image recognition and bioinformatics. These datasets can contain tens of thousands of features while having available only hundreds (or usually less than hundred) samples. High dimensionality can negatively impact the performance of classifier by increasing risk of over-fitting and prolonging the computational time. Feature selection is a significant part of many machine learning applications dealing with small-sample and high-dimensional data. Choosing the most important features is an essential step for knowledge discovery in many areas.

We work on proposal of new feature selection methods and evaluate existing methods from different point of views such as stability, ability to correctly identify significant features and influence on classification accuracy. 

Feel free to try our Weighted nearest neighbors method for feature selection available at .

Related publications:

Bugata, P.,  Drotár, P. (in print) On some aspects of minimum redundancy - maximum relevance feature selection, Science China Information Sciences.

Drotár, P, Gazda, M., Vokorokos, L. (2019). Ensemble feature selection using election methods and ranker clustering. Information Sciences, Vol. 480, pp. 365-380, ISSN 0020-0255.

Bugata, P.,  Drotár, P. (2019). Weighted nearest neighbors feature selection. Knowledge-Based Systems,

Drotár, P., Gazda, M., & Gazda, J.(2017). Heterogeneous ensemble feature selection based on weighted Borda count, 2017 9th International Conference on Information Technology and Electrical Engineering (ICITEE), Phuket, pp. 1-4.

Drotár, P., Gazda, J., & Smékal, Z. (2015). An experimental comparison of feature selection methods on two-class biomedical datasets. Computers in biology and medicine, 66, 1-10.

Research supported by:


VEGA 1/0327/20