SpatiaL ETL Tutorial: How to integrate spatial data into ETL processes

We'll show you step by step how to use spatial data and spatial operations in ETL processes.

Using the Talend Open Studio open source software and the SpatiaL ETL plug-in GeoSpatial Integration for Talend, we will show you how you can manage geodata and spatial operations in your ETL processes. But first of all, please download and install a current version of Talend Open Studio for Data Integration and Disys GeoSpatial Integration for Talend plug-in. The download package from Disy contains all the data you need to join this tutorial:

 

 

For our SpatiaL ETL scenario we have chosen the topic “Wind Parks” (or wind farms), a group of wind turbines along the Elbe mouth near Cuxhaven, Germany. The individual wind turbines are available as single point objects in a shapefile.

The focus of this tutorial is generating a convex hull around the wind turbines. For this purpose we will use the dAggregateGeometryRow component, allowing grouping point features based on an argument generating a convex hull. The output is a set of polygons that represent the wind farms; these will be written into a shapefile. This file can be used to define and visualize the wind park’s expansion area. It creates the basis for further planning steps, e.g. in the fields of road construction or nature conservation.

This scenario can also be applied to many other applications where the expansion area of events/observations needs to be defined:

  1. Evacuations due to chemical or nuclear incidents: One way to determine the perimeters for immediate evacuation is the construction of a convex hull. Zones with different radioactivity levels can be defined.
  2. Outbreak of epidemic diseases: The convex hull allows for spread of diseases to be tracked. A scenario regarding the propagation of animal diseases can be found here.
  3. Creating complex elements: To calculate a set of points’ intersection, analyze the distance between the two most distant points. First create a convex hull, then identify the two points with the furthermost space from each other.
  4. Intersecting other objects: A convex hull can be used as a preparatory step to find out whether line geometries or polygons, such as transport routes or bodies of water, intersect a group of point objects.

Step 1: Define metadata

To get a clearer view, where the wird parks are, we have visualized the single tubines using Cadenza (Fig. 1).

Fig. 1: The map shows quickly which turbines together form a park

Fig. 1: The map shows quickly which turbines together form a park

In the Repository tree view, select Metadata ->  File Shapefile ->  Create file shapefile.

Fig. 2: Metadata are created

Fig. 2: Metadata are created

Specify name and purpose and select the shapefile Windfarm.shp. After unpacking the data provided with the plugin, the file is saved in <YourInstallationPath>\Disy_GeoSpatial_Sample_Data\in\DataPreparation. The coordinate reference system is read out automatically if a .prj file is available. The scheme of the shapefile is displayed; column names and data types are imported automatically. Note that by default, the SHAPE column holds the data type Dgeometry. This type is included in the plugin to handle geometry features.

By clicking "Finish" the shapefile has been added to Talend metadata.

In this step we generate a geometry function to create a convex hull, that covers all wind turbines (Fig. 3).

Fig. 3: All wind turbines of a wind farm in a convex hull

Fig. 3: All wind turbines of a wind farm in a convex hull

In the Repository click Job Designs and create a new folder Windfarm. (Fig. 4 and 5). Enter Name, Purpose and Description.

Fig. 4: Create a job under "Job Designs"
Fig. 5. Specify name, purpose and description
Fig. 5: Specify name, purpose and description

The job is ready for processing. First, extract the data. In Metadata select File Shapefile > windfarm and drag & drop it to the job’s workspace. A dialog will open automatically. Select the component you want to work with: dShapeFileInput (Fig. 6).

Fig. 6: Automatically a dialog opens when the shapefile "windpark" is dragged into the workspace

Fig. 6: Automatically a dialog opens when the shapefile "windpark" is dragged into the workspace

Add a component to create a convex hull. In the component palette, select Disy GeoSpatial > Geom Operations and drag & drop dAggregateGeometryRow to the job’s workspace (Fig. 7).

Fig. 7: The dAggregateGeometryRow component is dragged into the workspace

Fig. 7: The dAggregateGeometryRow component is dragged into the workspace

Connect dShapeFileInput component to dAggregateGeometryRow using Row > Main from the context menu. Label the connection as windfarm. (Fig. 8).

Fig. 8: A main-row connection is created

Fig. 8: A main-row connection is created

Select dAggregateGeometryRow component and double-click it to display its Basic settings view.
Click EditScheme to define the component’s output scheme. Click Add (+ symbol) to add the following attributes:

  • Wind Park Name: NAME (String Type)
  • New Geometry: CONVEXHULL (DGeometry Type)
Fig. 9: The output scheme of the component is defined. The columns "Name" and "Convexhull" are selected

Fig. 9: The output scheme of the component is defined. The columns "Name" and "Convexhull" are selected

Now the component properties need to be grouped by the wind farm name. In Group by section, select the NAME column for both Input and Output columns.

Additionally, in the Operators section, select the following settings:

  • Output column = CONVEXHULL
  • Function = Convex hull
  • Input column = SHAPE
Fig. 10: Function and input and output settings are selected

Fig. 10: Function and input and output settings are selected

It is recommended to create a unique identifier for the data records in the shapefile. For this purpose you must add a tMap component to the workspace.
Now create a main-row link between the dAggregateGeometryRow and the tMap components. Rename the link as convex_hull. (Fig. 11).

Fig. 11: The new main row connection is now called convex_hull (Main)

Fig. 11: The new main row connection is now called convex_hull (Main)

Double-click the tMap component to open the Map Editor. Click the Add (+) button to create a new output data table. (Fig. 12).

Fig. 12: Create a new output data stream

Fig. 12: Create a new output data stream

In the output data table, create a new ID column for the unique identifier. Select Integer data type and activate the Key checkbox.
Enter the following placeholder to generate the ID: Numeric.sequence("s1",1,1) (Fig. 13 and 14). Drag all the columns of the convex_hull table to the newly created output table.

Fig 13: Name and convexhull are marked….
Fig. 13: Name und convexhull are marked…
Fig. 14: ... and dragged to to output datenstream.
Fig. 14: ... and dragged to to output datenstream.

In the output data table, create a new ID column for the unique identifier. Select Integer data type and activate the Key checkbox.

Fig. 15: In the output data table, create a new ID column for the unique identifier.

Fig. 15: In the output data table, create a new ID column for the unique identifier.

Write the resulting polygons into a shapefile. Drag dShapeFileOutput component to your job. The component is available in the palette at Disy GeoSpatial > File > Shapefile. Connect the tMap component with the dShapeFileOutput using Row > out1 (the name of your tMap output, Fig. 16).

Fig. 16: Save the output as shapefile

Fig. 16: Save the output as shapefile

Display the Basic settings of the dShapefileOutput component (Fig. 17).

Set the following properties:

  • Shapefile: select the shapefile’s storage location and provide a name (out_convex_hull.shp).
  • Schema: the schema is applied from the tMap component.
  • GeometryType: by default the proper type Polygon/Multipolygon is already selected.
  • Coordinate Reference System: enter “EPSG:3857”. The output coordinate system is the same as the input data.
  • Encoding: select ISO88-59-15.

Additionally, you may rename the labels of each component.

Fig. 17: Select Coordinate Reference System

Fig. 17: Select Coordinate Reference System

Step 3: Finally run the job

Right-click on the job and select Run job from the context.´The output files will be saved to the file specified for further use. Your job should look like this after running (Fig. 18):

Fig. 18: Your Geo-ETL job should look like this.

Fig. 18: Your Geo- ETL job should look like this.

Congratulations, you have successfully finished the Tutorial!

The surfaces created can be visualized in Cadenza’s map component (Fig. 19).

Fig. 19: Visualization of single wind turbines as windparks using Cadenza.

Fig. 19: Visualization of single wind turbines as windparks using Cadenza.