Blog, COLDSURF

Handling Spatial Data from a Service Perspective

This article focuses on optimizing the handling of spatial data in MySQL from a service-oriented perspective.

While the methods described here are specific to MySQL, the basic principles can also apply to other databases (e.g., PostgreSQL).

Introduction

Handling spatial data from a service perspective can be challenging. Spatial data generally refers to two-dimensional map coordinates with latitude and longitude values.

So, why is this difficult? Here are a few reasons:

  1. Coordinate data often contains many decimal points, meaning that parsing exact data requires extensive pre- and post-processing.
  2. When working with large amounts of coordinate data, a lot of calculations are required, increasing exponentially with the number of rows.

This led me to wonder: is there a way to manage spatial data with elegance and simplicity from a service perspective?

Working with Coordinate Data from a Data Perspective

If we handle coordinate data purely from a data perspective (not from a service perspective), we might approach it as follows:

  1. Prioritizing accuracy over query speed.
  2. Extracting precise data as the highest priority.

MySQL offers several spatial data types and functions to handle coordinates.

Spatial Datatypes

MySQL provides several spatial datatypes for handling coordinates, including POINT, LINESTRING, POLYGON, MULTIPOINT, MULTILINESTRING, MULTIPOLYGON, and GEOMETRYCOLLECTION.

For instance, to define and query coordinates in MySQL, we can use the POINT type:

CREATE TABLE LOCATION (
    location_point POINT
);

SELECT
    ST_Y(location_point) AS latitude,
    ST_X(location_point) AS longitude
FROM LOCATIONS;

With the ST_Contains operator, we can check if a POINT data falls within a specific boundary:

SELECT
    NOT ST_Contains(
        (SELECT shape FROM spatial_data WHERE id = 1),
        (SELECT shape FROM spatial_data WHERE id = 2)
    ) AS is_outside;

However, this approach can be computationally expensive, especially as the complexity of polygons increases. Using spatial indexing on POINT types can help optimize operations like ST_Contains, making it better suited for applications requiring high precision in data science.

But from my perspective, handling raw spatial queries increases complexity and may reduce developer experience (DX).

Handling Spatial Data from a Service Perspective (Using Geohash)

In most service contexts, the following aspects of spatial data are typically prioritized:

  1. High precision is not necessary for location data.
  2. Filtering by approximate location is sufficient (e.g., retrieving coordinates within a city or region).

For these cases, using MySQLโ€™s spatial types may not be necessary, leading me to explore Geohash.

Geohash with Prefix

Geohashing enables us to represent a geographic location as a string of characters, where the prefix of the Geohash represents a broader area.

Hereโ€™s a sample table and query structure using Geohash for approximate location matching:

CREATE TABLE locations (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    latitude DOUBLE,
    longitude DOUBLE,
    geohash CHAR(12),
    INDEX (geohash)
);

INSERT INTO locations (name, latitude, longitude, geohash)
VALUES ('Location A', 37.7749, -122.4194, 'wydm'),
       ('Location B', 37.7750, -122.4180, 'wydm4'),
       ('Location C', 37.7800, -122.4200, 'wydms');

Using the Geohash prefix for querying allows us to filter data within a specific boundary without complex operations:

SELECT * FROM locations
WHERE geohash LIKE 'wydm%';

This approach keeps query code simple, improving DX and allowing ORM tools like Prisma to support the data types directly (e.g., using DOUBLE). Geohash also makes it easier for developers and planners to collaborate by using intuitive, human-understandable location precision levels.

For example, zoom levels on a map could correspond to specific Geohash precision levels, creating a consistent understanding across teams. The precision length of a Geohash string reflects its granularity; by zooming in on an area, only locations within a higher-precision Geohash boundary are retrieved.

Implementation with Prisma ORM and Typescript for Clustering Data

Define Prisma Schema

model Location {
  id        Int     @id @default(autoincrement())
  name      String
  latitude  Float
  longitude Float
  geohash   String  @db.VarChar(12) // Use a VARCHAR column for geohashes
  @@index([geohash]) // Index to optimize geohash queries
}

Install Geohash Utility

yarn add ngeohash

Define Utility Functions

To encode geohashes in geohashUtil.ts:

import * as ngeohash from 'ngeohash';

export function encodeGeohash(latitude: number, longitude: number, precision: number = 12): string {
  return ngeohash.encode(latitude, longitude, precision);
}

Insert and Query Data

Inserting location data:

import { PrismaClient } from '@prisma/client';
import { encodeGeohash } from './geohashUtils';

const prisma = new PrismaClient();

async function insertLocation(name: string, latitude: number, longitude: number) {
  const geohash = encodeGeohash(latitude, longitude);
  await prisma.location.create({
    data: {
      name,
      latitude,
      longitude,
      geohash
    }
  });
}

Querying locations by Geohash prefix:

export async function findLocationsByGeohashPrefix(prefix: string) {
  return await prisma.location.findMany({
    where: {
      geohash: {
        startsWith: prefix,
      },
    },
  });
}

Grouping Data by Geohash Prefix

To retrieve and group data by Geohash prefix:

async function groupLocationsByGeohashPrefix() {
  const prefixes = ['wydm', 'wydn', 'wydp']; // Example prefixes
  const groupedLocations: { [key: string]: any[] } = {};

  for (const prefix of prefixes) {
    const locations = await findLocationsByGeohashPrefix(prefix);
    groupedLocations[prefix] = locations;
  }

  return groupedLocations;
}

groupLocationsByGeohashPrefix().then(groups => {
  console.log('Grouped Locations by Geohash Prefix:', groups);
});

Calculating Cluster Centroids

If we need to compute centroids for Geohash clusters:

type Location = { latitude: number; longitude: number };
type Cluster = { [key: number]: Location[] };

function computeCentroids(clusters: Cluster): { centroidLat: number; centroidLng: number }[] {
  return Object.values(clusters).map((locations) => {
    const totalLat = locations.reduce((sum, loc) => sum + loc.latitude, 0);
    const totalLng = locations.reduce((sum, loc) => sum + loc.longitude, 0);
    const count = locations.length;

    return {
      centroidLat: totalLat / count,
      centroidLng: totalLng / count,
    };
  });
}

async function exampleUsage() {
  const locations = await findLocationsByGeohashPrefix('wydm');

  const clusterAssignments = {
    0: [{ latitude: 37.7749, longitude: -122.4194 }, { latitude: 37.7750, longitude: -122.4180 }],
    1: [{ latitude: 34.0522, longitude: -118.2437 }],
    2: [{ latitude: 40.7128, longitude: -74.0060 }]
  };

  const centroids = computeCentroids(clusterAssignments);
  console.log('Cluster Centroids:', centroids);
}

Conclusion

This guide outlines the core approach to working with spatial data from a service perspective, using Geohash to simplify and optimize data processing.

โ† Go home