Why I rolled my own gis_puller

MissMissM (she/her)
5 min readJan 19, 2022

--

Photo by Greg Rosenke on Unsplash

I needed some Geographic Information System (GIS) boundary data around Australian suburbs (a.k.a localities) as well as the Geoscape Geocoded National Address File (GNAF) data for reverse/forward coordinate/address nearby()/lookups for a rapid protype I was iterating.

Sometimes there is something…

For base maps these are fairly standardised thing allowing you to be relatively agnostic of service providers but doing more interactive things like at basic level clicking a point on a map and translating it into an address or on other hand auto-completing an address, you’ll need to make some API calls somewhere and things can get complicated (and expensive) fast.

So yes normally one could just use some SaaS for specific functionality but for my use-case I didn’t just feel I wanted either the anxiety of dealing with overage charges like one can do easily.

And often there isn’t..

And considering each of these service providers have their own APIs moving along from SaaS to SaaS can be a painful experience that requires balancing act of figuring out whether to bring dependenc(y/ies) or just roll it yourself.

Nor did I like the fact that every request I submit gets logged in somebody else’s thing and the said provider I choose in a given time might disappear or change their revenue/licensing model later suddenly leading you to potentially re-tool in a rush if you are not provider agnostic.

Additionally I like the idea that I can just spin up additional nearby() / lookup nodes like in say a consul service mesh and let them be near where ever I need them to be on a protocol (like gRPC/protobuf) I consider appropriate —

Considering in my use-case for the prototype I needed a low latency service for auto-completing address and such things addressing typo’s people make and be as helpful and fast as possible.

So I’ve started an attempt to create one..

I came to a conclusion that the real world ecosystems just did not have either a library nor a usable standard way for acquiring GIS datasets from the world in a generic manner so I decided to iterate a prototype in attempt to try out if we can change this.

Since there are so many different ways to obtain these datasets I had to make some hard initial decisions how I would go about standardising this in a library format so I split my use-case specific libraries between:

  • Acquisition
  • Transform / Load and
  • Consumer

For the acquisition

I had a closed-source implementation that fit my use-case but I decided to make a shareable version for the benefit of myself and others.

You can find my early shareable prototype from GitHub or Crates.io

After I have stabilised the interfaces and added few more datasources I’ll move it under Georust.

I used the Australian Geoscape datasets to implement first from data.gov.au and found the below difficulties.

  • Resource URIs for the datasets are not standardised nor enforced

Indexing difficult resource URIs

I had a go at data.gov.au end-point where one apparently can do some type of “full-text” search for datasets which seems to beg for more documentation on how to construct the query part.

I could have started from searching organisations for Geoscape but I wanted to keep things simpler at this stage.

I found that there was quite bit of deviations between Australian states how the datasets were loaded into data.gov.au and I could not rely on uniformity.

Like if you search for locality boundaries and looking at downloadURLs:

Catalog = [
"https://data.gov.au/data/dataset/af33dd8c-0534-4e18-9245-fc64440f742e/resource/3b946968-319e-4125-8971-2a33d5bf000c/download/vic_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/af33dd8c-0534-4e18-9245-fc64440f742e/resource/4d6ec8bb-1039-4fef-aa58-6a14438f29b1/download/vic_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/0257a9da-b558-4d86-a987-535c775cf8d8/resource/d9100544-182d-470c-b3b2-75812322c495/download/act_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/0257a9da-b558-4d86-a987-535c775cf8d8/resource/b91e5877-5426-416d-99c6-355d15d2c461/download/act_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/dataset/09276b8e-1447-4892-99a3-0f67c421f327/resource/95be2deb-2050-4ef3-b549-4e8a83421bfc/download/ot_locality_polygon_shp.zip",
"https://data.gov.au/dataset/09276b8e-1447-4892-99a3-0f67c421f327/resource/e441ed10-fbf0-4ac5-aa69-eecce6c41a74/download/ot_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/6bedcb55-1b1f-457b-b092-58e88952e9f0/resource/d6d33141-3a9b-4648-a810-1c20387fdb28/download/qld_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/6bedcb55-1b1f-457b-b092-58e88952e9f0/resource/d20d0a54-7680-43c4-8c46-a08e3bc43fa0/download/qld_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/12eca357-6bad-4130-9c47-eaaf4c11e039/resource/0def58c2-343e-47b2-a270-a35637c2f7b9/download/nt_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/12eca357-6bad-4130-9c47-eaaf4c11e039/resource/55c87a60-19bf-4ef3-b76d-d8c274ab7ae0/download/nt_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/91e70237-d9d1-4719-a82f-e71b811154c6/resource/5e295412-357c-49a2-98d5-6caf099c2339/download/nsw_loc_polygon_shp.zip",
"https://data.gov.au/data/dataset/91e70237-d9d1-4719-a82f-e71b811154c6/resource/5f5ca807-0586-4b93-87dd-891691985272/download/nsw_loc_polygon_shp_gda2020.zip",
"https://data.gov.au/data/dataset/6a0ec945-c880-4882-8a81-4dbcb85e74e5/resource/141fc7bd-c75f-49b5-a116-35250eea68cd/download/wa_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/6a0ec945-c880-4882-8a81-4dbcb85e74e5/resource/9fff5439-7af5-42f4-9102-42c4199c5c1c/download/wa_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/8bd7b6c1-1258-4df5-a98f-b6706e87de1e/resource/eef84972-78c3-4013-957f-45fa03196e05/download/tas_locality_polygon_shp.zip",
"https://data.gov.au/data/dataset/8bd7b6c1-1258-4df5-a98f-b6706e87de1e/resource/c0e8ebd4-2be7-414e-956e-6ed25a1c11a4/download/tas_locality_polygon_shp-gda2020.zip",
"https://data.gov.au/data/dataset/bcfcfc9a-7c8d-479a-9bdf-b95ca66ad29a/resource/d8297c5a-2c92-43ab-a42e-e04392376386/download/sa_loc_polygon_shp.zip",
"https://data.gov.au/data/dataset/bcfcfc9a-7c8d-479a-9bdf-b95ca66ad29a/resource/8937ec5b-69d5-4d8c-86e0-687e61a02903/download/sa_loc_polygon_shp_gda2020.zip",
]

You see the URIs deviate starting from the base:

https://data.gov.au/data/dataset/
https://data.gov.au/dataset/

Atleast all have some consistent type UUID identifiers which may or may not change — on my closed source implementation I had just hardcoded them.

Plus naming convention for the actual files NSW vs other states:

nsw_loc_polygon_shp.zip
ot_locality_polygon_shp.zip

I decided to solve this problem by using a configurable Regex:

^https://data.gov.au/data(?:(/dataset|set))/[a-f0-9\\-]+/resource/[a-f0-9\\-]+/download/[A-Za-z0-9_]+(?:([_\\-]?gda[0-9]*|)).zip$

Now this is a hack and it will break when ever they rename their filenames but at least I don’t need a re-compliation nor rely on some other funky data which is not under my control.

So my attempt at making a library for now is really a bunch of hacks that abstract the way we acquire datasets since the world again is not perfect.

If the world would be perfect we would have no hacks but that would require the indexing at all the world’s data source providers follow the exact same patterns on how to acquire them — perhaps we can both standardise this and motivate all the providers to follow it.

But for now just like git clone we can now clone a dataset to say a container and transform/load it into a database in our local cluster(s) nearby where we need those.

Ironically, my gis_puller is as of writing this missing the actual data pull outside pulling index as I went down the rabbithole of evaluating-benchmarking different ways of pulling small datasets vs large datasets which one might to use via my shareable library compared to my specific use-case.

You see GNAF is a 1.5GB single zip file where as AU locality boundary:

$ ls -hsS | awk 1
total 322M
60M nsw_loc_polygon_shp_gda2020.zip
60M nsw_loc_polygon_shp.zip
22M vic_locality_polygon_shp-gda2020.zip
22M vic_locality_polygon_shp.zip
22M sa_loc_polygon_shp_gda2020.zip
22M sa_loc_polygon_shp.zip
21M wa_locality_polygon_shp-gda2020.zip
21M wa_locality_polygon_shp.zip
20M tas_locality_polygon_shp-gda2020.zip
20M tas_locality_polygon_shp.zip
16M qld_locality_polygon_shp-gda2020.zip
16M qld_locality_polygon_shp.zip
1.3M act_locality_polygon_shp-gda2020.zip
1.3M act_locality_polygon_shp.zip
1.1M nt_locality_polygon_shp-gda2020.zip
1.1M nt_locality_polygon_shp.zip
312K ot_locality_polygon_shp-gda2020.zip
308K ot_locality_polygon_shp.zip

--

--