Free and open AI-based cloud mask for Sentinel-2

Asking a person on the street ‘what is a cloud mask and why would one need it’, you would probably receive more questions than answers.  Yet people who have worked with EO data for at least a year, or, even better, who have tried to program an automatic classifier, know exactly what we are talking about.

Sentinel-2 is a beautiful European data factory, producing tons of valuable imagery every day. Its full potential is yet to be exploited. The first Sentinel-2 satellite was launched in 2015 and after launching the second satellite (2B) the system reached its full data production capacity in the second half of 2017. One of the factors limiting the usage of Sentinel-2 data is clouds. There is a need for an accurate mask for separating the clear pixels from the corrupted ones. Figure 1 illustrates it well – not only the cloud covered areas are unusable for great majority of EO applications, but also the cloud shadows cause trouble. If you run e.g. a crop classification algorithm on these regions you cannot expect an adequate result.

Figure 1. Cumulus clouds and cloud shadows on Sentinel-2 satellite image

While Sentinel-2 system, functioning as a data factory, is beautiful and undeniably a huge game changer, then the official Sen2cor cloud mask incorporated to the L2A data products can be insufficient in terms of accuracy when used in fully automatic processing chains. The issues rise with underestimating the cloud shadows and small fragmented clouds. While feeding an operational processing chain (such as the grassland mowing detection system of KZ) with this data a lot of corrupted pixels are passed through, deteriorating the accuracy of the end result. In practice we have ended up with aggressive outer buffering and few other post-processing steps to reduce the errors. But obviously these are all just work-arounds without solving the underlying problem. The cloud mask itself, out-of-the-box, should be accurate enough without thinking about it during the work processes.

When digging deeper, you find that the Sen2cor cloud mask processor has a rule-based thresholding decision tree with some post-processing steps (e.g. morphological filtering to reduce the noise). On one hand it is impressive how accurate results this decision tree is able to produce in a global scale, but after the revolution of AI and deep learning one knows that the same task can be solved much better with a different – more modern design.

Leaving the Sen2cor cloud mask as it is, the proposal of KappaZeta was convincing enough that we were given a chance to develop an AI-based Sentinel-2 cloud mask for ESA and we are very grateful for it.

Which other cloud masks are out there?

Firstly, we would like to outline how import is to develop open source cloud masks. There are a few privately developed cloud masks, raising the first question about accuracy figures. If these details are not public, it is also hard to assess how good the offered product really is and that raises many other questions. Furthermore, this adds to the unnecessary amount of time spent on something that could be shared openly, reducing duplication, and contributing to improved products. Therefore, everybody would win time-wise and quality-wise from more open approach and sharing. This is what we believe in and hope that more and more companies over time will come to the same conclusion.

One of the best open source cloud masks is probably s2cloudless by Sinergise. Find more information from here, here and here.

There is just one thing we would like to question and open for discussion. They write that: “We believe that machine learning algorithms which can take into account the context such as semantic segmentation using convolutional neural networks (see this preprint for overview) will ultimately achieve the best results among single-scene cloud detection algorithms. However, due to the high computational complexity of such models and related costs they are not yet used in large scale production.” So CNNs are great, but too heavy to be practical? Let us put this claim in doubt, at least in 2020. One thing is CNN model fitting, which for a complex model can be computationally expensive, that is true. But the other thing is running a prediction with a pre-trained model. This is much cheaper – and this is what you need to do when you put a CNN into production.

One of the best research papers on using deep learning for cloud masking is probably by DLR. We are taking this as one of the starting points for our development. They claim higher accuracies than Fmask (which is roughly on the same level with Sen2Cor) at a reasonable computational cost (2.8 s/Mpix on a NVIDIA M4000 GPU).

There are also several CNN-based cloud masking research papers by the University of Valencia. E.g. by Mateo-García and Gómez-Chova (2018) and Mateo-García, Gómez-Chova and Camps-Valls (2017).

All in all – deep learning as a universal mimicking machine has proven to be at least as accurate in recognizing objects from images and segmenting them semantically as human interpreters. Deep learning has been proven in various domains with image interpretation, speech recognition and, text translation. Computer Vision, which focuses particularly on digital images and videos, has enormous success in medical field, where data labelling is an expensive procedure or rapidly developing autonomous driving cars field, where huge amount of data should be processed in real time. There is every reason to believe that it will excel also in detecting clouds and cloud-shadows from satellite imagery. What determines the success is the quality and variety of the model fitting reference data set.

We believe that cloud masking is such an universal pre-processing step for satellite imagery that sooner or later someone will develop an open source deep learning cloud mask and all the closed source cloud masks become obsolete. Let us then try to be among the first and help the community further.


The goal of the project is to develop the most accurate cloud mask for Sentinel-2. We know it is going to be hard and to avoid going crazy by trying to solve everything at once, we are limiting the scope of the project. We concentrate on Northern European terrestrial summer season conditions. With Northern Europe we mean the area north from the Alps, which has relatively similar nature and land cover. Summer season means the vegetative season – from April to October. We start from terrestrial conditions (with all due respect to the marine researchers), because we believe it has higher impact for developing operational services that make clients happy, for example in the agricultural and forestry sectors.

Everyone, who has worked on machine learning projects, know that the most critical factor for success is the quality and variety of input data. In our case it is the labelled Sentinel-2 imagery following the classification schema agreed in CMIX. Eventually each pixel should have a label, one out of four: 1) clear, 2) cloud, 3) semi-transparent cloud, 4) cloud shadow. For labelling we are using CVAT with a few scripts for automation and thanks to the hard work of our intern Fariha we have already labelled more than 1500 Sentinel-2 cropped 512x512 pixels tiles. The work goes on to have a large and accurate reference set for CNN model fitting.

To be more effective, there are several machine learning techniques we are going to apply:

1) Active learning. To select only the tiles and pixels, which have the highest impact for increasing the accuracy of the model. Labelling is a time-consuming process, and it is critical to do only work that matters.

2) Transfer learning. The idea is to use all possible open sources labelled Sentinel-2 images to train the network and then fine-tune it on our smaller focused dataset.

The initial literature review is completed and we plan to start with applying U-Net on our existing labelled dataset. We still have many open questions, e.g. should we use one of the rule-based masks as an input feature; is the improvement worth the fear that the network can possibly capture the same errors; to what extent we can augment existing features in terms of brightness and angles; can certain calculated S2 coefficients help the network, such as NDVI, NDWI etc?

Last, but not least, it is an open source project. All our results, final software and source code will be freely and openly distributed in GitHub. Openness and accessibility of our software should directly translate into greater usage. We are also intending to learn from the community and take advantage of the existing open source projects and labelled cloud mask reference data sets.

If you have any good suggestions how we could improve our cloud mask or be aware of some parallel developments for cooperation, please let us know. Our project runs from October 2020 to September 2021.

Further information:
Marharyta Domnich

Read more

Data splitting challenge

Almost any machine learning pipeline requires an input data split for training and validation purposes. However, ground truth collection is challenging and could be gathered from different sources. Various sources provide different confidence levels for the labels and in general it would be beneficial to test the model on the most confident samples, but also providing some part of it for training as well, keeping the class distributions as uniform as possible. We are facing the challenge of having unbiased data split with adjustable filters in different tasks and it feels that there is a need for a more general solution or brainstorming from the community.

Mowing detection

One of the examples where we meet the splitting challenge is the mowing detection task. The goal of mowing detection is to predict the time at which the grass on the field was cut. Thus, as part of our mowing detection project, each year we collect field books from farmers and field reports from the inspectors of the local paying agency. The received data is converted and reviewed manually, and some of the ground truth is produced from the manual labelling.

The labels would differ in trustworthiness, depending on the source (farmer field books, inspector field reports, any of the former with manually adjusted dates, or fully manual labelling). Since inspector field reports are the most reliable source, we would use most of them for the validation and test set. However, we would need at least some of them to be present in the training dataset as well. Additionally, each dataset is expected to have as well balanced classes distribution as possible, perhaps with additional filtering to randomly drop least trustworthy samples from the over-represented class.

Considering the aforementioned conditions, let’s say we would like to have 70% of the labelled data for training, 20% for validation and 10% for testing. For validation and testing, we would only use instances from inspector field reports and farmer field books with tweaked dates. For training, we would use data from all sources, including the ones from inspector field reports and tweaked field books which were left over from the test and validation datasets.

Crop classification

Another task we are dealing with is crop classification. We would like to detect the crop type of agricultural fields out of 28 possible classes. Similarly to the mowing detection we have different sources for labels, some of which have been provided by the local Agricultural Registers and Information Board, some from drone observations. For crop classification, class balance distribution plays the core role. In order to mitigate the issue of an unbalanced dataset, undersampling and oversampling can be used. Undersampling and oversampling should be available for the training subset, while for testing we would use some of the fields with labels of high confidence. Some of the classes might have a poor representation, due to which general split ratios might leave the validation or test dataset without any samples, whereas we need to ensure that all datasets have enough samples.

Image credit: Madis Ajaots

Thus, the requirements for splitting are the following. We would like to have 70% / 20% / 10% splits, ensuring that for smaller representing classes at least one instance is present in all sets. Additionally for the test set we would like to have the list of high confidence instances together with random leftover samples that added up to 10% of the whole data.

Generic and configurable

While such processing chains can be implemented, we have found it tricky to have it generic and configurable enough to cater for all sorts of projects with different (and sometimes rapidly changing) requirements.

Current solution

Currently we have separate implementations for mowing detection and crop classification, both of which take input parameters from a config file. The config file is basically python code and supports the definition of custom filter functions for datasets. For each dataset, the current solution invokes custom filters (if any) and then performs random sampling of data indices, leaving the rest of the samples for the next datasets. The samples which have been filtered out, are also left for the next datasets, for each dataset might have a different filter.

The reason why we prefer to use data / sample indices instead of data directly, is to have a layer of abstraction. This way the splitting logic could be agnostic of data type. It would not matter whether a single sample / instance is a raster image, an image time-series or time-series of parameter values which have been averaged over a pre-defined geometry.

For multiclass applications such as crop classification, data indices are sampled separately for each class within each dataset. The splitting also supports capping of samples for classes which are represented too well. However, if there are too few samples per class, a low threshold can be applied such that a different split ratio would be used. For instance, in the case of 70% training, 20% validation and 10% testing dataset with just 9 samples in one of the classes, we might end up with 7 samples in the training dataset, 2 samples in validation and 0 in the test set. To mitigate the issue, we could have the ratios adjusted to 40% training, 30% validation and 30% testing for classes with less than 100 samples.

Ideas for future developments

Instead of project-specific implementation of the splitting logic, we would prefer to have a generic framework for graph-based data splitting with support for cross-validation and bagging. Please let us know if there is such a framework already out there, or if there would be community interest in developing the framework.

Read more

Towards the operational service of crop classification

We are about to finish a R&D project, where we developed and tested crop classification methodology specifically suited for Estonian agricultural, ecological and climatic conditions. We relied mostly on Sentinel-1 and -2 data and used neural network machine learning approach to distinguish 28 different crop types. Results are promising and the methodology is ready for operational service to automate another part of agricultural monitoring.

Using machine learning in crop type classification is not new, and definitely not a revolutionary breakthrough - already for decades different classifiers (Support Vector Machine, Decision Trees, Random Forest and many more) have been used in land cover classification. Recently also neural networks, the wunderkind of machine learning and image recognition, are widely used in crop discrimination. Satellite data, as the main input to classification models, has no serious alternatives, since our aim is to implement it on worldwide scale and in applications, which run near real time. So, why even get excited about another crop type classification study, which exploits same methods and datasets as tens of previous studies?

I can give you one reason. Estonia has been very successful in following European Commission (EC) guidelines and rules in modernizing the EU Common Agricultural Policy. In 2018 EC adopted new rules that allow to completely replace physical checks on farms with a system of automated checks based on analysis of Earth observation data. The same year Estonian Agricultural Registers And Information Board (ARIB) launched the first nation-wide fully automated mowing detection system, which uses Sentinel-1 and Sentinel-2 data and where the prediction model inside the system is developed by KappaZeta. The system has been running for 3 years, it has significantly reduced the amount of on-site checks and increased the detection of non-compliances. In short – saved Estonian and EU taxpayers’ money. Automated crop discrimination is the next step in pursuing the above-mentioned vision and will probably become the foundation of all agricultural monitoring. With proved and tested methodology, it’s highly likely that Estonia will take this next step in the very near future and launch it again on the nationwide level. This is definitely a perspective to be excited about.

Now, let’s see how we tackled “the good old” crop classification task.

Input data

Although algorithms and methods are important to make a difference in prediction model performance, the training data is the most valuable player in this game. In Estonia all farmers who want to be eligible for subsidies need to declare crops online (field geometry + crop type label). This open dataset is freely accessible to everyone and has the permission to re-use and redistribute both commercial and non-commercial purposes. Since the crop type labels are defined by farmers and most of them are not double-checked by ARIB, there can be mistakes (according to ARIB’s estimations, less than 5%). Therefore, for additional validation we ran our own cluster analysis on time-series to filter out obvious outliers in each class.

After we had the parcels and labels, we calculated time-series of different satellite-based, plus some ground-based features (precipitation, average temperatures, soil type). When extracting features from satellite images there are two ways to go: pixel- or parcel-based extraction. We selected the latter and averaged pixel values over the parcel to obtain one numerical feature value per statistic for each point in time (see Figure 1).
Figure 1. An example time-series of one Sentinel-1 feature (cohvh - 6-day coherence in the VH polarization) for one parcel.

For Sentinel-1 images preprocessing we have developed our own processing chain to produce reliable time-series for several features. From the previous studies it’s known that features (channel values and indexes) from Sentinel-2 images combined with features from Sentinel-1 images (coherence, backscatter) give better classification results than any of these features separately.

Figure 2. The whole dataset can be imagined as a three-dimensional tensor with the feature parameters on one axis, parameter statistics on another, and date-time on the third axis.

We used data from 2018 and 2019 seasons (altogether more than 200 000 parcels) and aggregated all crop type labels into 28 classes which were defined by the need of the ARIB.

Model architecture

Figure 3. Model architecture.
Due to the very unbalanced dataset we had to under-sample some classes and over-sample others for the training data. In small classes we used the existing time-series and added noise for data augmentation.

Model architecture was rather simple – input layer, flatten layer, three fully connected dense layers (two of them followed by batch normalization layer) and output (Figure 3). Our experiment with adding 1D CNN layer after input didn’t improve results significantly. More complicated ResNet (residual neural network) architecture increased training time by approx. 30%, but results were similar to a linear neural network.

Classification results

F1 score on validation set (9% of all dataset) was 0.85 and on test set (2% of all dataset) 0.84. In 10 classes the recall was more than 0.9 and in 16 classes more than 0.8. See more from Figure 4 and 5. 
Figure 4. Test set results.
Figure 5. Normalized confusion matrix of the crop classification results (recall values).

Some features are more important than others

In a near-real time operative system our model and feature extraction would have to be as efficient as possible. For an R&D project we could easily calculate 20+ features from satellite images, feed them all to the model and let the machines compute. But what if not all features are equally important?

They are not. We found that the 5 most important features are Sentinel-1 backscatter (s1_s0vh, s1_s0vv), NDVI, TC Vegetation and PSRI from Sentinel-2. To our surprise, soil type and precipitation sum before satellite image acquisition had low relevance.

The 5 most important features played different role during the season – Sentinel-2 features were more important in the beginning and in the end of the season, while Sentinel-1 features had more effect during mid-season.
Figure 6. Importance of different features in crop classification, estimated using Random Forest.

What next?

This project was part of a much larger initiative called “National Program for Addressing Socio-Economic Challenges through R&D. Using remote sensing data in favor of the public sector services.” Several research groups all over Estonia worked on prototypes to use remote sensing in fields like detecting forest fire hazard, mapping floods and monitoring urban construction. Now its up to Estonia’s public sector institutions to take the initiative and turn prototypes into operational services. With this work we have proved, that satellite-based crop classification in Estonia is possible, accurate enough and ready to be implemented as the next monitoring service for ARIB.

If you are more interested about this study, our Sentinel-1 processing pipeline or machine learning expertise, then feel free to get in touch. We have the mentality to share not hide our experience and learn together on this exciting journey.

Read more

Open access to ALOS-2 radar satellite data?

End of previous year (2019) Japan announced that the Japan Aerospace Exploration Agency (JAXA) will be providing open access to information and data from a suite of their radar satellites (original statement here). To be more specific, free and open access to the wide-swath observation data from the L-band radar satellites, ALOS (ALOS/AVINIR-2, PALSAR) and ALOS-2 (ALOS-2/ScanSAR) will be made available. The price of ScanSAR images is at the moment around 700 euros.

ALOS-2 spacecraft in orbit (image credit: JAXA)

The Japanese space and satellite program consist of two series of satellites – those used mainly for Earth observation and others for communication and positioning. There are 3 Earth Observation satellites in nominal phase, 3 in latter phase in operation and 3 more under development.

Greenhouse gases Observing SATellite-2 "IBUKI-2" (GOSAT-2) is measuring global CO2 and CH4 distribution of lower and upper atmosphere. Climate "SHIKISAI" (GCOM-C) satellite carries an optical sensor capable of multi-channel observation at wavelengths from near-UV to thermal infrared wavelengths (380nm to 12µm) to execute global, long-term observation of the Earth’s environment. Advanced Land Observing Satellite-2 "DAICHI-2" (ALOS-2) aims are to monitor disaster areas, cultivated areas and contribute to cartography.  

ALOS-2, which is specifically interesting for radar enthusiasts, is a follow-on mission from the ALOS “DAICHI”. Launched in 2006, ALOS was one of the largest Earth observation satellites ever developed and had 3 different sensors aboard: PRISM (Panchromatic Remote-sensing Instrument for Stereo Mapping) for digital elevation mapping, AVNIR-2 (Advanced Visible and Near Infrared Radiometer type 2) for precise land coverage observation and PALSAR (Phased Array type L-band Synthetic Aperture Radar) for day-and-night and all-weather land observation. ALOS operations were completed in 2011, after it had been operated for over 5 years.  

ALOS-2 was launched in 2014 and carries only radar instrument aboard. New optical satellite, ALOS-3, which will improve ground resolution by approx. three times from that of ALOS (2.5 to 0.8 m at nadir, wide-swath of 70 km at nadir), is already under development together with ALOS-4, which will take over from ALOS-2 to improve the functionality and performance.  

Let’s come back to present day. The state-of-the-art L-band Synthetic Aperture Radar (PALSAR-2) aboard ALOS-2 have enhanced performance compared to its predecessor. It has a right-and-left looking function and can acquire data in three different observation modes:

  • Spotlight – spatial resolution 1x3 m, NESZ -24, swath 25 km. 
  • Stripmap – spatial resolution 3-10 m, swath 30–70 km. Consist of Ultrafine (3 m), High sensitive (6 m) and Fine (10 m) modes. 
  • ScanSAR – spatial resolution 60-100 m, swath 350–490 km.  

PALSAR-2  specifications (images credit: JAXA)

Emergency observations have highest priority for ALOS-2, but for systematic observations Basic Observation Scenario (BOS) has been developed. This ensures spatial and temporal consistency at global scales and adequate revisit frequency.  ALOS-2 BOS has separate plans for Japan and for the rest of the world, success rate for these acquisitions is 70–80 %.  

PALSAR-2  observation modes (images credit: JAXA)

Basic observations over Japan are mostly undertaken in Stripmap Ultrafine mode and sea ice observations during winter in ScanSAR mode.

Stripmap Fine and ScanSAR modes are used for global BOS. There are several areas of interest, where ALOS-2 is putting more focus, for example:

  • Wetlands and rapid deforestation regions in ScanSAR mode
  • Crustal deformation regions both in Stripmap Fine and ScanSAR mode
  • Polar regions both in Stripmap Fine and ScanSAR mode

In addition to those special regions global land areas are observed in Stripmap Fine mode at least once per year.

We made a little experiment to test, how many acquisitions we get over city of Tartu per year. Here are the results (platform for viewing and ordering data is here):

Screenshot from Earth Observation Data Utilization Promotion Platform.
YearNumber of images per year

So, compared to Sentinel-1 radar-satellite, ALOS-2 acquisitions frequency is much lower over Europe, and its difficult to develop agriculture monitoring services only on this platform. For forestry and other environmental monitoring, where changes are not happing that often as in agriculture, ALOS-2 can be very useful due to its better spatial resolution than Sentinel-1. Being an L-band satellite it can also penetrate deeper into vegetation and provide information about the lower layers of the canopy. JAXA is already developing ALOS-4 with PALSAR-3 aboard, which will aim broader observation swath compared to the predecessor.

Read more

Overview of new RADARSAT Constellation Mission

Exciting remote sensing news from last year. Canadian Space Agency has launched new generation of Earth observation satellites called The RADARSAT Constellation Mission (RCM) on June 12, 2019 aboard a SpaceX Falcon 9 rocket. It became operational in December 2019 and provides data continuity to RADARSAT-1 (not operational anymore) and RADARSAT-2 (still operational) users.
Illustration of the three RCM satellites on the same orbital plane. Image credit: Canadian Space Agency

RCM is a combination of three identical and equally spaced satellites, flying in the same orbit plane 32 minutes apart at an altitude of 600 km. Each of the spacecraft carries Synthetic Aperture Radar (SAR) aboard, plus a secondary sensor for Automatic Identification System (AIS) for ships. When RADARSAT-2 has left- and right-looking operation, then RCM is only right-looking, because multiple satellites increase revisit times and eliminate the need to look both ways. The SAR device aboard RCM satellites is quite similar to RADARSAT-2 – C-Band antenna, 100 MHz bandwidth, 4 regular polarization modes (HH, VV, HV, VH) plus compact polarimetry.  Polarization isolation is slightly better: >30 dB. See detailed comparison of RADARSAT satellites here.

The constellation system provides better coverage with smaller and less expensive satellites. This configuration allows for daily revisits of Canada’s territory, as well as daily access to 90% of the world’s surface. RCM can provide a four-day exact revisit (3 satellites equally phased in a 12 day repeat cycle orbit), allowing coherent change detection with InSAR. For specific applications (ship detection, maritime surveillance) data latency from acquisition to delivery can be only 10-30 minutes, but in general it will be from hours to 1 day.

RCM has several observation modes, but the mission is primarily designed for medium-resolution monitoring:

  • Low resolution (100 m), swath 500 km, NESZ -22 dB
  • Medium resolution (16, 30, 50 m), swath 30-350 km, NESZ -25…-22 dB
  • High and very high resolution (3-5 m), swath 20-30 km, NESZ -17..-19 dB
  • Spotlight (1x3 m), swath 20 km, NESZ -17 dB
    RADARSAT Constellation Mission observation modes. Image credit: Canadian Space Agency.

Read more

Base Camp Hackathon 2020

KZ extended team during hackaton
In the beginning of March our team participated Base Camp Spring 2020 hackaton organized by Garage48 and Superangel.
Our aim was to develop prototype for our time series API sandbox and map customer segments. Long weekend was successful and besides great mentoring, new ideas and contacts, we won the runner-up prize "Superangel’s Support on Steroids package".
Base Camp Hackathon is an exclusive hackathon format, designed for young startups that already have a working prototype. Latest edition took place from 6th to 8th of March 2020 in Tallinn, Estonia and 12 teams, Kappazeta among them, were invited to develop their prototypes or products even further. For this 3-day event Teet Laja and Raja Azmir joined our team to help us out in front-end development and business analysis.

The results? During the productive weekend we conducted couple of possible users interviews to validate the new product idea and created landing page to test our concept and capture contacts of interested people. 
Also a simple API Sandbox was created, where potential customers have the chance to test core possibilities of our service. We are not yet there to provide world-wide near real time Sentinel-1 time series API, but we are moving towards there and improving our process chain rapidly. Next step is to take whole process to cloud platform (DIAS), to be closer to input satellite data and boost performance. Exciting times! 
Read more