ESA WorldCereal - Getting Locations for Global Field-Level Monitoring in Python

Introduction

With the release of the ESA WorldCereal system, we now have (for the first time ever) the precise location of essentially every winter cereal, spring cereal and maize farm in the world.

To put the precision in perspective, the best datasets available to date were at 10km and ~5km resolutions via SPAM2010 and GEOGLAM, respectively. WorldCereal is 10m resolution – that’s 1000 times more precise. In fact, this is even better than the 30m resolution that the USDA has on offer with the USDA CDL (Cropland Data Layer).

WorldCereal is monumental because it is the missing link that can enable global monitoring of the entire world cereal situation. That said, the key to unlocking that potential is to transform WorldCereals from its cumbersome native format (.tif / geotiff rasters) to point samples in Parquet or GeoParquet.

Once that is done, there is no limit to what data scientists can do with this treasure trove. For example, measuring crop health for any (or all!) of the world’s cereal farms is imminently doable. (Use NDVI as your metric and you can do this with our API in a single Python program.)

To that end then, this tutorial shows you exactly how to convert WorldCereals into points samples in Parquet or GeoParquet. It will show how to:

  • Generate any number of global maize, winter cereals, and spring cereals farm point locations (longitude, latitude coordinate pairs) from the .tif classification mask files
  • Export the locations to cloud-native (Parquet & GeoParquet) and standard tabular (CSV) formats

Parquet and GeoParquet enable maximum interoperability with data science workflows and the modern stack, including cloud storage like AWS, GCP, and Azure, cloud data warehouses like Snowflake, BigQuery, RedShift, and DataBricks, and data science / machine learning platforms like pandas, numpy, scikit-learn, and more.

About WorldCereal

The ESA WorldCereal dataset consists of 10m resolution global crop type maps (classification masks) covering maize, winter cereals and spring cereals. For maize and winter cereals, there are 106 .tif (raster) files corresponding to the 106 Agricultural Ecological Zones (AEZs) across the globe. Spring cereals are only covered for 21 of the 106 AEZs. Find the article (Van Tricht et al., 2023, in pre-print) describing the full methodology here. Note, while not covered in this tutorial, the WorldCereal dataset also has classification mask products for seasonal irrigated cropland and temporary crop extent which can be processed in the same manner as the 3 covered here.

Note: this tutorial is also available as a single python notebook. You can download it on Github below:

Preconstructed Data

If you don’t want to generate your own locations by running this tutorial, you can download the output at the following links:

The output datasets consist of 6,827,917 generated point locations and have the following information for each: Crop, Country (Admin 0), State (Admin 1), Region (World Bank), Subregion, Continent, and aez_id. See below: 

Requisite Data

To begin you need to pull the WorldCereals datasets:

Download all of the above, unzip and move the contents to a folder called ‘data’ in your working directory. (The full WorldCereals dataset is available here. For smaller, user-specified regions it can also be accessed through the OpenEO API.) 

Requisite Modules

There are a number of libraries that need to be imported. The first group is required for generating the point locations from the raster geotiff files (.tif). The second group is required for tagging the point locations with various administrative parameters. 

# Required for location generation
import os
import geopandas as gpd
import rasterio
import numpy as np
import pandas as pd

# Required for location tagging
import cartopy
import cartopy.crs as ccrs
import cartopy.feature as cf
import cartopy.io.shapereader as shpreader
from shapely.geometry import Point, Polygon

Define Functions

The .tif raster files are very large so we first define a function to read them in chunks to avoid any memory-related issues.

def read_raster_in_chunks(dataset):
for ji, window in dataset.block_windows(1):
band_window = dataset.read(window=window, indexes=1)
yield band_window, ji

Demo vs Full Run

Next we introduce a boolean constant called “demo_mode”. In demo-mode, running the program will generate Maize farm locations for just 3 of the 106 AEZs (covering Mexico, Guatemala, El Salvador, Honduras, and North-Western Nicaragua). It should take  less than 10min to run for these AEZs. If you wish to run the code on different AEZs covering different regions than specified, modify the aez_id_list. 

With demo-mode set to False, running the program will generate farm locations for maize, winter cereals, and spring cereals for all 106 AEZs (where available). Note running the notebook in full-mode can take tens of hours (probably between 10 and 20).

# Specify if the run is a demo
demo_mode = True # Change to False for a full run

Now we can proceed with point generation. The code below will iterate through each of the raster .tif files with rasterio for the list of specified AEZs, reading and processing each the .tif files in chunks. For each chunk, we calculate the total number of pixels classified as being a specific crop and then randomly select 0.01% of these pixels in which we will randomly generate points. This method allows us to generate points that will be scaled by the density of the crops in the original .tif files. Note that the outputs we show in this tutorial were produced with demo-mode set to False.

# Specify if the run is a demo
demo_mode = True # Change to False for a full run

# Read the geojson containing the aez footprints as polygons
aez = gpd.read_file('data/WorldCereal_AEZ.geojson')

# List of all aez_ids
aez_ids = aez['aez_id'].unique()

# Create a list to hold dataframes
dfs = []

# Define the directories and crop_types based on whether it is a demo run or not
if demo_mode:
    directories = ['data/WorldCereal_2021_tc-maize-main_maize_classification']
    crop_types = {'data/WorldCereal_2021_tc-maize-main_maize_classification': 'Maize'}
else:
    directories = [
        'data/WorldCereal_2021_tc-maize-main_maize_classification',
        'data/WorldCereal_2021_tc-wintercereals_wintercereals_classification',
        'data/WorldCereal_2021_tc-springcereals_springcereals_classification'
    ]
    crop_types = {
        'data/WorldCereal_2021_tc-maize-main_maize_classification': 'Maize',
        'data/WorldCereal_2021_tc-wintercereals_wintercereals_classification': 'Winter Cereals',
        'data/WorldCereal_2021_tc-springcereals_springcereals_classification': 'Spring Cereals'
    }

# Determine the aez_id list for processing
aez_id_list = [15081,19093,33117] if demo_mode else aez_ids

# Iterate over all directories
for directory in directories:

    print(f"Processing: {crop_types[directory]} ({directory})")

    # Iterate over all unique aez_ids
    for aez_id in aez_id_list:

        print(f"Processing aez_id: {aez_id}")

        # Get list of all .tif files in the directory
        all_files = os.listdir(directory)

        for file in all_files:
                    # Check if the file starts with the aez_id and ends with .tif
                    if file.startswith(str(aez_id)) and file.endswith('.tif'):
                        # Full path to the file
                        filename = os.path.join(directory, file)

                        # Open the raster
                        dataset = rasterio.open(filename)

                        # Read the band in chunks
                        chunks = read_raster_in_chunks(dataset)

                        # Process one chunk at a time
                        for i, (band_chunk, ji) in enumerate(chunks):

                            # Find the indices where band_chunk has values of 100
                            indices = np.where(band_chunk == 100)

                            # Determine the total number of classified pixels
                            total_classified = len(indices[0])

                            # Determine the proportion of classified pixels to sample
                            proportion = 0.0001  # For example, to sample 0.01% of classified pixels

                            # Determine the number of samples to draw
                            num_samples = int(np.round(total_classified * proportion))

                            # Skip this chunk if there are fewer than 1 pixels with a value of 100
                            if num_samples < 1:
                                continue

                            # Randomly sample num_samples points from the classified indices 
                            sample_indices = np.random.choice(total_classified, size=num_samples, replace=False)

                            # Get the row and column indices
                            rows = indices[0][sample_indices] + ji[0]*band_chunk.shape[0]
                            cols = indices[1][sample_indices] + ji[1]*band_chunk.shape[1]

                            # Transform the row and column indices to the original CRS (EPSG:4326)
                            x_coords, y_coords = dataset.xy(rows, cols)

                            data = {'lon': x_coords, 'lat': y_coords, 'aez_id': aez_id, 'crop': crop_types[directory]}

                            df = pd.DataFrame(data)

                            # Append the df to dfs
                            dfs.append(df)

# Concatenate all the dataframes in dfs
final_df = pd.concat(dfs, ignore_index=True)

Executing the above generates 7M points (for all 3 crop types across all applicable AEZs):

Tag Administrative Regions

The next step is to tag each of the points by the administrative boundaries they fall inside. This makes the dataset versatile for multiple use cases.

We use the Natural Earth Features country and state-level administrative datasets to tag the farm locations generated above. For simplicity, we will access the datasets through Cartopy, but the full list of datasets may also be accessed here. Note that you can of course tag locations using any boundary file that is appropriate for your use case. 

# shpfilename = the path to your shapefile
ne_level_0 = shpreader.natural_earth(resolution='10m', category='cultural', name='admin_0_countries')
ne_level_1 = shpreader.natural_earth(resolution='10m', category='cultural', name='admin_1_states_provinces')


# Read the shapefile using GeoPandas
countries_gdf = gpd.read_file(ne_level_0)
states_gdf = gpd.read_file(ne_level_1)

Countries (Admin Level 0), Regions, Subregions, Continents

# Convert the df DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(final_df['lon'], final_df['lat'])]
merged_geo = gpd.GeoDataFrame(final_df, geometry=geometry)

# Set the CRS of the merged_geo GeoDataFrame to EPSG:4326 (WGS 84)
merged_geo.crs = "EPSG:4326"

# Transform the CRS of merged_geo to match the CRS of county_boundaries
merged_geo = merged_geo.to_crs(countries_gdf.crs)

# Perform a spatial join
joined = gpd.sjoin(merged_geo, countries_gdf, predicate='within', how='left')

States (Admin Level 1)

# Convert the df DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(final_df['lon'], final_df['lat'])]
merged_geo = gpd.GeoDataFrame(final_df, geometry=geometry)

# Set the CRS of the merged_geo GeoDataFrame to EPSG:4326 (WGS 84)
merged_geo.crs = "EPSG:4326"

# Transform the CRS of merged_geo to match the CRS of county_boundaries
merged_geo = merged_geo.to_crs(states_gdf.crs)

# Perform a spatial join
joined = gpd.sjoin(merged_geo, states_gdf, predicate='within', how='left')

# Remove duplicate index after spatial join
joined = joined.loc[~joined.index.duplicated(keep='first')]

# # Copy the required columns back to the df DataFrame
final_df['State (Admin 1)'] = joined['name']

The DataFrame which originally contained only points locations and the corresponding crops is now tagged for the various administrative boundaries. Inspecting it visually:

Remove excess points generated in regions with AEZ overlap 

Because AEZ regions sometimes overlap each other, our random sampling methodology will yield excess points in areas of overlap. This can actually be seen visually in the image above. The density of points in these areas of overlap is around 40% higher than what it should be. There is an easy fix for this: delete 40% of the points that fall in regions where AEZ overlap occurs. To do this, use the WorldCereal_AEZ.geojson file containing the AEZ footprint polygons. 

# Convert the DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(final_df['lon'], final_df['lat'])]
geo_df = gpd.GeoDataFrame(final_df, geometry=geometry)

# Set CRS for geo_df to match aez
geo_df.set_crs(epsg=4326, inplace=True)

# Read the geojson to a GeoDataFrame
aez = gpd.read_file('data/WorldCereal_AEZ.geojson')

# Create a spatial join between the points and the polygons to determine which points fall in overlapping polygons
overlaps = gpd.sjoin(geo_df, aez, how='inner', predicate='within')

# Group points by 'aez_id' and find the groups that have more than one polygon (overlapping areas)
overlapping_points = overlaps.groupby(['lon', 'lat']).filter(lambda x: len(x) > 1)

# Sample 40% of the rows 
overlapping_points_sampled = overlapping_points.sample(frac=0.40)

# If you want to reset index
overlapping_points_sampled.reset_index(drop=True, inplace=True)

# Create a key in both dataframes for merging
overlapping_points_sampled['key'] = overlapping_points_sampled['lon'].astype(str) + '-' + overlapping_points_sampled['lat'].astype(str)
final_df['key'] = final_df['lon'].astype(str) + '-' + final_df['lat'].astype(str)

# Perform a left merge to identify rows in final_df that are not present in overlapping_points_sampled
merged_df = pd.merge(final_df, overlapping_points_sampled[['key']], on='key', how='left', indicator=True)

# Filter rows that are only present in final_df (not present in overlapping_points_sampled)
final_df_filtered = merged_df[merged_df['_merge'] == 'left_only']

# Remove the key and _merge columns
final_df_filtered = final_df_filtered.drop(columns=['key', '_merge'])

final_df_filtered = final_df_filtered[final_df_filtered['aez_id'] != 2008]
final_df_filtered=final_df_filtered.reset_index()
final_df_filtered=final_df_filtered.drop(columns='index')

By removing the excess points we have reduced the size of output by nearly 400,000 points:

final_df

Upon visual inspection, we can see that the excess points that were generated in areas of AEZ overlap have been eliminated: 

Filter Out Problem Areas

During the generation process, points can show up along the edges of some of the AEZs in regions that do not correspond to farms. We manually drew rectangular polygons around all areas where these points occurred, and saved the polygons in geojson format in a list.

problem_areas = [{"type":"Polygon","coordinates":[[[-91.41645262581089,54.84799699573268],[-91.41645262581089,54.40865331122363],[-88.50502911906254,54.40865331122363],[-88.50502911906254,54.84799699573268],[-91.41645262581089,54.84799699573268]]]},
{"type":"Polygon","coordinates":[[[-79.38191682251309,54.670532808406904],[-79.38191682251309,54.49882652204195],[-78.23689388953464,54.49882652204195],[-78.23689388953464,54.670532808406904],[-79.38191682251309,54.670532808406904]]]},
{"type":"Polygon","coordinates":[[[-113.05802983596446,14.41831703002663],[-113.05802983596446,10.575178190653537],[-111.7893818359246,10.575178190653537],[-111.7893818359246,14.41831703002663],[-113.05802983596446,14.41831703002663]]]},
{"type":"Polygon","coordinates":[[[-63.27121541570759,4.99991096599964],[-63.27121541570759,4.644926639828461],[-61.255111317670405,4.644926639828461],[-61.255111317670405,4.99991096599964],[-63.27121541570759,4.99991096599964]]]},
{"type":"Polygon","coordinates":[[[-55.22828268903994,4.8967529432137775],[-55.22828268903994,4.6843848378761095],[-54.05519296541787,4.6843848378761095],[-54.05519296541787,4.8967529432137775],[-55.22828268903994,4.8967529432137775]]]},
{"type":"Polygon","coordinates":[[[14.734498345968422,16.466073725257086],[14.734498345968422,16.153617663042045],[16.160038004412126,16.153617663042045],[16.160038004412126,16.466073725257086],[14.734498345968422,16.466073725257086]]]},
{"type":"Polygon","coordinates":[[[10.047319891400315,25.881031010977033],[13.892714535598202,25.881031010977033],[13.892714535598202,25.90303910294598],[10.047319891400315,25.90303910294598],[10.047319891400315,25.881031010977033]]]},
{"type":"Polygon","coordinates":[[[25.107042348440654,16.476830796786135],[25.107042348440654,16.16039393217118],[29.22283561548171,16.16039393217118],[29.22283561548171,16.476830796786135],[25.107042348440654,16.476830796786135]]]},
{"type":"Polygon","coordinates":[[[-4.002359251877331,3.201968162153812],[-4.002359251877331,0.9911969039241092],[-3.0296770780975124,0.9911969039241092],[-3.0296770780975124,3.201968162153812],[-4.002359251877331,3.201968162153812]]]},
{"type":"Polygon","coordinates":[[[31.16567670234616,6.839946329744742],[31.16567670234616,6.597515189353094],[32.33595020343498,6.597515189353094],[32.33595020343498,6.839946329744742],[31.16567670234616,6.839946329744742]]]},
{"type":"Polygon","coordinates":[[[40.89590023680697,19.097407436266657],[38.494290401310344,19.097407436266657],[38.494290401310344,19.368491481354894],[40.89590023680697,19.368491481354894],[40.89590023680697,19.097407436266657]]]},
{"type":"Polygon","coordinates":[[[32.93416524828737,31.775717980082778],[32.93416524828737,31.40013168824121],[33.136596368919584,31.40013168824121],[33.136596368919584,31.775717980082778],[32.93416524828737,31.775717980082778]]]},
{"type":"Polygon","coordinates":[[[85.98995391688953,39.165321115823744],[85.98995391688953,39.04406645824486],[91.67097903032186,39.04406645824486],[91.67097903032186,39.165321115823744],[85.98995391688953,39.165321115823744]]]},
{"type":"Polygon","coordinates":[[[57.034595092248644,59.83370125144946],[57.034595092248644,59.73558220090944],[58.68328068555554,59.73558220090944],[58.68328068555554,59.83370125144946],[57.034595092248644,59.83370125144946]]]},
{"type":"Polygon","coordinates":[[[16.901144386723477,37.11315362858931],[16.901144386723477,36.347826299638434],[17.259848338498678,36.347826299638434],[17.259848338498678,37.11315362858931],[16.901144386723477,37.11315362858931]]]},
{"type":"Polygon","coordinates":[[[-11.74332466793228,45.28783163983705],[-11.74332466793228,44.66939402069726],[-11.240611651323693,44.66939402069726],[-11.240611651323693,45.28783163983705],[-11.74332466793228,45.28783163983705]]]},
{"type":"Polygon","coordinates":[[[112.63187216718639,-34.99965785029128],[112.63187216718639,-35.981599748501544],[113.46733203935531,-35.981599748501544],[113.46733203935531,-34.99965785029128],[112.63187216718639,-34.99965785029128]]]},
{"type":"Polygon","coordinates":[[[73.69232615055289,15.376372183850695],[73.66955721328479,15.376372183850695],[73.66955721328479,15.715384183383437],[73.69232615055289,15.715384183383437],[73.69232615055289,15.376372183850695]]]},
{"type":"Polygon","coordinates":[[[73.58561245202628,15.226695719650461],[73.58561245202628,14.196424425273419],[73.80691834331373,14.196424425273419],[73.80691834331373,15.226695719650461],[73.58561245202628,15.226695719650461]]]},
{"type":"Polygon","coordinates":[[[121.77275981253091,53.3317057483241],[121.77275981253091,53.29519565546789],[124.27015494296099,53.29519565546789],[124.27015494296099,53.3317057483241],[121.77275981253091,53.3317057483241]]]},
{"type":"Polygon","coordinates":[[[100.19693890959938,40.29125609428906],[100.19693890959938,40.21829955885904],[104.88706903084042,40.21829955885904],[104.88706903084042,40.29125609428906],[100.19693890959938,40.29125609428906]]]}]

We use the list of polygons to filter out any points that may have been generated within these known problem areas.

# Create the list of Polygon objects from the geojson data
problem_polygons = [Polygon(area['coordinates'][0]) for area in problem_areas]

# Function to check if a point is in any of the problem areas
def is_in_problem_area(row):
    point = Point(row['lon'], row['lat'])
    return any(polygon.contains(point) for polygon in problem_polygons)

# Apply the function to each row in the dataframe
final_df_filtered['is_in_problem_area'] = final_df_filtered.apply(is_in_problem_area, axis=1)

# Filter the dataframe to remove points in problem areas
final_df_filtered = final_df_filtered[final_df_filtered['is_in_problem_area'] == False]

# You can drop the 'is_in_problem_area' column afterwards if you wish
final_df_filtered = final_df_filtered.drop(columns=['is_in_problem_area'])

final_df_filtered = final_df_filtered.reset_index()
final_df_filtered = final_df_filtered.drop(columns=['index'])

final_df_filtered

Final Processing

We’ll do some final formatting to the dataset we’ve generated. Specifically, we reorder the columns into something that makes a little bit more sense and capitalize the ‘crop’ column.

final_df_filtered = final_df_filtered[['lon', 'lat', 'crop', 'Country (Admin 0)', 'State (Admin 1)', 'Region (World Bank)', 'Subregion', 'Continent', 'aez_id']]
final_df_filtered = final_df_filtered.rename(columns={'crop': 'Crop'})

final_df_filtered

Saving Final Locations Dataset

Finally, we will save the locations dataset in both cloud-native (Parquet & GeoParquet) and standard tabular (CSV) formats.

Parquet (.parquet)

final_df_filtered.to_parquet('world_cereals_scaled_final_all.parquet') # 101 MB

GeoParquet (.geoparquet)

# Creating the geometry column from the coordinates
geometry = [Point(xy) for xy in zip(final_df_filtered['lon'], final_df_filtered['lat'])]

# Creating a GeoDataFrame by adding the geometry column to the original DataFrame
final_gdf = gpd.GeoDataFrame(final_df_filtered, geometry=geometry)

final_gdf.drop(columns=['lon','lat']).to_parquet('world_cereals_scaled_final_all.geoparquet') # 100 MB

CSV (.csv)

For CSV, we will export both the full dataset we generated, and a random 3M subset of the points.

final_df_filtered.to_csv('world_cereals_scaled_final_all_demo.csv') 
# final_df_filtered.sample(n=3000000, random_state=1).to_csv('world_cereals_scaled_final_3M.csv')

The ESA WorldCereal dataset will have massive implications for our ability to monitor crops no matter where they are in the world. With WorldCereal as point locations in the cloud-native Parquet and GeoParquet formats, these datasets are immediately unlocked for data science workflows and those wishing to do time series analytics with remote sensing data. From building more accurate crop yield models than has ever been possible, to constructing higher-resolution crop calendars globally, point sampling WorldCereals opens up all kinds of analysis possibilities. .