Tackling the lack of annotated data in machine learning based computer vision.

Suppose you want to teach a concept to someone: a possible way, instead of trying to provide a formal definition (which can be hard to provide or ineffective to communicate), is to give examples and counterexamples. Basically, this is the approach adopted by supervised machine learning; based on the complexity of the objects to be classified, the number of examples can be very high and, in general, the choice of the number of examples from the different classes that must be managed is not simple. So producing a good dataset for training supervised machine learning models is difficult and time consuming.

Trying to describe the amount of data provided by the astronomical imaging world, adjectives like huge, titanic, gargantuan are not exaggerations. Sadly, we cannot describe in the same way the amount of annotated images, which are the fuel of supervised machine learning. Manual labelling is, in the best case scenario, trivial but time consuming: imagine defining the borders of a ball or a window in an everyday picture (crowdsourcing approaches to annotate large amounts of data for training was instrumental in the production of several relevant datasets, and tools like Amazon’s Mechanical Turk were instrumental in the success of this approach). But in the worst case scenario, we might not even be completely certain about how to define borders of entities in the frame, and how to classify the defined areas. This is the case in several situations of astronomical research.

 

EU Flag  NEANIAS is a Research and Innovation Action funded by European Union under Horizon 2020 research and innovation programme via grant agreement No.863448.