Big data and machine learning framework for clouds and its usage for text classification.

The work “Big data and machine learning framework for clouds and its usage for text classification”, by NEANIAS team, has been published.

The article was published on December 21, 2020, and it is available at Wiley Online Library.

  • Authors: István Pintye, Eszter Kail, Péter Kacsuk and Róbert Lovas [1].
  • Affiliations: [1] Institute for Computer Science and Control (SZTAKI), Hungary.


Reference architectures for big data and machine learning include not only interconnected building blocks but important considerations (among others) for scalability, manageability and usability issues as well. Leveraging on such reference architectures, the automated deployment of distributed toolsets and frameworks on various clouds is still challenging due to the diversity of technologies and protocols. The paper focuses particularly on the widespread Apache Spark cluster with Jupyter as the particularly addressed framework, and the Occopus cloud‐agnostic orchestrator tool for automating its deployment and maintenance stages. The presented approach has been demonstrated and validated with a new, promising text classification application on the Hungarian academic research infrastructure, the OpenStack‐based MTA Cloud. The paper explains the concept, the applied components, and illustrates their usage with real use‐case measurements.


We thank for the usage of MTA Cloud ( that significantly helped us achieve the results published in this paper. We would also like to acknowledge the support of the Text Mining of Political and Legal Texts (POLTEXT) Incubator Project, MTA Centre for Social Sciences. The presented work was partially funded by the European H2020 NEANIAS project under grant No. 863448, by the Hungarian Scientific Research Fund (OTKA) under project No. 132838, and by Bolyai+ Scholarship for Young Higher Education Teachers and Researchers under grant No. ÚNKP-20-5-OE-73.

You can get the whole article at



EU Flag  NEANIAS is a Research and Innovation Action funded by European Union under Horizon 2020 research and innovation programme via grant agreement No.863448.