Home | Benchmarks | Categories | Atom Feed

1.1 Billion Taxi Rides using DuckDB

I examine the performance of DuckDB against my 1.1B taxi rides benchmark.


Tokyo Walking Tour Guide

I build a walking tour guide using DuckDB and QGIS.


Extracting OSM Features

I break up an OSM file into 1,087 themed GeoPackage files.


Global Flight Tracking

I explore adsb.lol's flight tracking dataset.


Mapping Estonia with LiDAR

I explore the Estonian Land Board's LiDAR scans dataset.


Natural Earth's Global Geospatial Datasets

I explore Natural Earth's freely available global geospatial datasets.


Maxar's Open Satellite Feed

I explore 1 TB of Maxar's freely available satellite imagery.


Overture's Global Geospatial Datasets

I explore Overture's three global and free-to-use mapping dataset releases.


A Review of Esri's Imagery in Action MOOC

A review of their six-week spatial imagery course.


Versatile Video Coding

I walk through setting up a research and development environment for H.266 / VVC encoding.


Segmenting Satellite Images

I identify objects in aerial and phone camera imagery using Meta AI's Segmentation Model.


A Review of Esri's Spatial Data Science MOOC

A review of their six-week course which focuses on their ArcGIS Pro offering.


Enhancing ClickHouse's Geospatial Support

I review Clickgis, a Rust-based extension that adds WKB and GeoJSON support to ClickHouse.


Asking a Large Language Model How YouTube Works

I ask Platypus2 13B questions about a PDF.


Geospatial Clustering with Uber's H3 in DuckDB & QGIS

I revisit Uber's H3 with a more concise method for producing geospatial clusters.


Popular Airline Passenger Routes Refresh

I've extracted the most popular commercial airline passenger routes from 21 GB of Wikipedia articles.


Streaming Video

I walk through hosting streaming videos using FFmpeg, Bento4, Caddy Server and HLS.


IPinfo's Free IP Address Location Database

I walk through IPinfo's free IPv4 and IPv6 location database.


DuckDB's Spatial Extension

DuckDB can now open 50+ GIS file formats. I use it to help examine the Bing Maps team's AI road detection project.


Geospatial DuckDB

I walk through basic geospatial workflows in DuckDB.


European Route Planning

I build a pan-European Bus Route Planner.


Faster PostgreSQL To BigQuery Transfers

I compare shipping data via CSV and Parquet from PostgreSQL to BigQuery.


1.1 Billion Taxi Rides in ClickHouse on DoubleCloud

I investigate how fast DoubleCloud can query 1.1 billion taxi journeys using their managed ClickHouse solution.


Awesome Isochrones

I show how you can create beautiful isochrone maps using Valhalla and QGIS.


ECharts for Python

I explore a Python wrapper for Apache ECharts.


Python Data Visualisation

I explore Altair, a concise API for charting in Python.


Hardening SSH

I walk through setting up BastionZero on an AWS EC2 instance.


Pretty Maps in Python

I show how you can create beautiful maps in Python.


Making Heatmaps

I walk through a GIS toolchain for creating heatmaps.


Minimalist Guide to Poem

A review of the Rust-based Web Framework Poem.


Minimalist Guide to Axum

I review the features and community benchmarks of the Rust-based Web Framework Axum.


File Sharing with Caddy & MinIO

Cost-effective, mobile-friendly file sharing using two Go-based offerings.


Deploying 5G Around Trees

I explain how Open5G digs through 3.5 trillion records produced by a deep learning algorithm trained on a massive cluster in Switzerland that was fed imagery of the entire earth from two satellites to decide how to roll out 5G in California.


The Streets of Monaco

I walk through a GIS toolchain for visualising the streets of Monaco and its Formula 1 circuit.


Install ClickHouse Faster

I look at the latest way to get ClickHouse running quickly.


Faster Geospatial Enrichment

I compare latitude and longitude to h3 binning times between PostgreSQL, BigQuery and ClickHouse.


Where is every IP Address?

I describe how IPinfo finds the location of almost every IP address on earth.


Faster Top Level Domain Name Extraction with Go

I port a Python-based TLD extraction script to Go.


The Fastest FizzBuzz Implementation

I look at an implementation of FizzBuzz that can generate output at a rate of 56 GB/s.


ROAPI: An API Server for Static Datasets

I review the features and benchmark ROAPI.


Actix: A Web Framework for Rust

I review the features and community benchmarks of Actix.


Rocket: A Web Framework for Rust

I review the features and community benchmarks of Rocket.


Building PostgreSQL Extensions with Rust

I build a PostgreSQL function in Rust and use it to try and transform 1.27B records.


Faster Top Level Domain Name Extraction with Rust

I port a Python-based TLD extraction script to Rust.


Track changes in Excel, Word, PowerPoint, PDFs, Images & Videos with Git

I walk through tracking changes in rich documents using Git.


Faster Compression with Snappy's S2 Extension

I walk through installing and running Snappy's S2 extension.


MeiliSearch: A Minimalist Full-Text Search Engine

I walk through installing and running MeiliSearch.


MinIO: A Bare Metal Drop-In for AWS S3

I walk through running an AWS S3-compatible storage service on HDFS.


Monitor ClickHouse with Prometheus & Grafana

Keep an eye on ClickHouse with Prometheus and Grafana.


Data Fluent for PostgreSQL

Build a better understanding of your data in PostgreSQL.


1.1 Billion Taxi Rides using Hydrolix on AWS

I examine the performance of Hydrolix against my 1.1B taxi rides benchmark.


1.1 Billion Taxi Rides using OmniSciDB and a MacBook Pro

I investigate how fast OmniSciDB can query 1.1 billion taxi journeys using a 16" MacBook Pro.


Python Web Scraping with Virtual Private Networks

Proxy Python and curl web requests through WireGuard and OpenSSH.


Fast IPv4 to Host Lookups

I compare PostgreSQL and ClickHouse performance characteristics while performing IPv4 to hostname lookups.


Faster ZIP Decompression

I compare the decompression times of various DEFLATE implementations.


Faster ClickHouse Imports

I compare import times of various formats into ClickHouse.


YouTube's Database "Procella"

I analyse material recently published on Google's "Procella" query processing engine which powers YouTube.


Is Hadoop Dead?

I analyse and debate arguments surrounding the "demise" of Hadoop.


Minimalist Guide to Lossless Compression

I look at various aspects of lossless compression.


Faster File Distribution with HDFS and S3

I look for faster ways of transferring files between HDFS and AWS S3.


A Minimalist Guide to Flume

I take a look at Apache Flume and walk through an example using it to connect Kafka to HDFS.


A Minimalist Guide to FoundationDB

I take a short look at FoundationDB and walk through a leaderboard example using Python.


"Architecting Modern Data Platforms" Book Review

I review the Hadoop-focused book "Architecting Modern Data Platforms".


1.1 Billion Taxi Rides: 108-core ClickHouse Cluster

I investigate how fast ClickHouse 18.16.1 can query 1.1 billion taxi journeys on a 3-node, 108-core AWS EC2 cluster.


Convert CSVs to ORC Faster

I compare the ORC file construction times of Spark 2.4.0, Hive 2.3.4 and Presto 0.214.


1.1 Billion Taxi Rides: Spark 2.4.0 versus Presto 0.214

I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using a 21-node EMR cluster.


Working with the Hadoop Distributed File System

I explore several HDFS interfaces and compare them to the JVM-based Apache Hadoop HDFS CLI.


Systems Monitoring: top vs Htop vs Glances

An examination and comparison of top, Htop and Glances; three tools for performing ad-hoc monitoring of systems and application performance.


Working with Data Feeds

This tutorial covers converting Wikipedia's XML dump of its English-language site into CSV, JSON, AVRO and ORC file formats as well as analysing the data using ClickHouse.


A Minimalist Guide to Microsoft SQL Server 2017 on Ubuntu Linux

This tutorial covers importing CSV data into SQL Server 2017, automating data pipeline tasks via Apache Airflow and visualising data using Pandas and Jupyter Notebooks.


1.1 Billion Taxi Rides with SQLite, Parquet & HDFS

I investigate how fast SQLite can query 1.1 billion taxi journeys from Parquet files off of HDFS.


Customising Airflow: Beyond Boilerplate Settings

I walk through setting up Apache Airflow to use Dask.distributed, PostgreSQL, logging to AWS S3 as well as create User accounts and Plugins.


Using SQL to query Kafka, MongoDB, MySQL, PostgreSQL and Redis with Presto

A guide to connecting to five different data stores using Presto.


Python & Big Data: Airflow & Jupyter Notebook with Hadoop 3, Spark & Presto

A guide to running Airflow and Jupyter Notebook with Hadoop 3, Spark & Presto.


1.1 Billion Taxi Rides: EC2 versus EMR

I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using an i3.8xlarge EC2 instance with 1.7 TB of NVMe storage versus a 21-node EMR cluster.


Hadoop 3 Single-Node Install Guide

A simple Hadoop 3 installation guide for Ubuntu 16 that includes Hive, Spark and Presto.


1.1 Billion Taxi Rides with BrytlytDB 2.1 & a 5-node IBM Minsky Cluster

I investigate how fast BrytlytDB 2.1 can query 1.1 billion taxi journeys using five IBM Minsky servers with 20 Nvidia P100 GPUs.


1.1 Billion Taxi Rides with BrytlytDB 2.0 & 2 GPU-Powered p2.16xlarge EC2 Instances

I investigate how fast BrytlytDB 2.0 can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.


A Minimalist Guide to SQLite

This tutorial covers importing CSV data into SQLite 3, manipulating data via Python and visualising data using Pandas and Jupyter Notebooks.


1.1 Billion Taxi Rides with Spark 2.2 & 3 Raspberry Pi 3 Model Bs

I investigate how fast Spark 2.2 can query 1.1 billion taxi journeys using a cluster of three Raspberry Pis.


1.1 Billion Taxi Rides with BrytlytDB & 2 GPU-Powered p2.16xlarge EC2 Instances

I investigate how fast BrytlytDB can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.


Compiling MapD's Source Code

In this tutorial I walk-through building MapD from source on an Ubuntu 16.04.2 machine.


1.1 Billion Taxi Rides with MapD 3.0 & 2 GPU-Powered p2.8xlarge EC2 Instances

I investigate how fast MapD 3.0 can query 1.1 billion taxi journeys using two p2.8xlarge AWS EC2 instances.


Detecting Bots in Apache & Nginx Logs

I explore the task of bot detection in web traffic logs.


Doom Bots in TensorFlow

I walk through using TensorFlow to train AI Bots to play Doom, a classic first-person shooter.


Analysing Petabytes of Websites

I demonstrate how to extract analytical data from petabytes worth of websites collected by Common Crawl.


A Review of "Designing Data-Intensive Applications"

I review an early release of Martin Kleppmann's book "Designing Data-Intensive Applications".


1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5

I investigate how fast ClickHouse can query 1.1 billion taxi journeys on an Intel Core i5 4670K.


1.1 Billion Taxi Rides on Vertica & an Intel Core i5

I investigate how fast Vertica Community Edition 8.0.1 can query 1.1 billion taxi journeys on an Intel Core i5 4670K.


1.1 Billion Taxi Rides on AWS EMR 5.3.0 & Spark 2.1.0

I investigate how fast an 11-node Spark 2.1.0 cluster can query over a billion records.


1.1 Billion Taxi Rides on kdb+/q & 4 Xeon Phi CPUs

I investigate how fast kdb+/q can query 1.1 billion taxi journeys on 4 Intel Xeon Phi 7210 CPUs.


1.1 Billion Taxi Rides on Amazon Athena

I investigate how fast Amazon Athena can query 1.1 billion taxi journeys.


Alenka: A GPU-Driven, Open Source Database

I walk through installing, loading in data and querying Alenka.


1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs

I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Pascal-based Titan X cards.


TensorFlow on a GTX 1080

I walk through setting up TensorFlow, a Deep Learning Framework, on Ubuntu 16 with an Nvidia GTX 1080 and use it to build "Deep Fizz buzz".


Building a Data Pipeline with Airflow

I walk through setting up a data pipeline for currency exchange rates using Airflow, PostgreSQL and Redis.


1.1 Billion Taxi Rides with MapD & AWS EC2

I investigate how fast MapD can query 1.1 billion taxi journeys using 4 g2.8xlarge EC2 instances.


1.1 Billion Taxi Rides with MapD & 4 Nvidia Titan Xs

I investigate how fast MapD can query 1.1 billion taxi journeys using 4 Nvidia Titan X cards.


1.1 Billion Taxi Rides with MapD & 8 Nvidia Tesla K80s

I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Telsa K80 GPU cards.


1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

I investigate how fast a series of graph generated using R can be created across 4 different types of AWS RDS instances.


1.1 Billion Taxi Rides on a Large Redshift Cluster

I investigate how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records.


All 1.1 Billion Taxi Rides on Redshift

I investigate how fast a single Redshift ds2.xlarge instance can query over a billion records.


All 1.1 Billion Taxi Rides in Elasticsearch

I look at ways of fitting every column of the 1.1 billion taxi rides into Elasticsearch on a single, 850 GB SSD.


50-node Presto Cluster on Google Cloud's Dataproc

I investigate how fast a 50-node Dataproc cluster queries the metadata of 1.1 billion taxi trips.


Performance Impact of File Sizes on Presto Query Times

I investigate the performance impact of ORC file sizes on Presto query times using Google Cloud's Dataproc service.


Faster IPv4 WHOIS Crawling

I examine the performance and reliably increases from using Redis across a 51-node IPv4 WHOIS crawling cluster.


33x Faster Queries on Google Cloud's Dataproc

I look at speeding up Presto queries on 1.1 billion records run on a 10-node Dataproc cluster.


Mass IP Address WHOIS Collection with Django & Kafka

I investigate how fast a cluster of EC2 instances can collect WHOIS records of IPv4 addresses.


A Billion Taxi Rides: AWS S3 versus HDFS

I investigate the speed differences between S3 and HDFS when querying over a billion records using Presto on AWS EMR.


A Billion Taxi Rides on Google's Dataproc running Presto

I investigate how fast a small Dataproc cluster can query over a billion records using Presto.


50-node Presto Cluster on Amazon EMR

I investigate how fast a 50-node AWS EMR cluster can query over a billion records using Presto.


A Billion Taxi Rides on Google's BigQuery

I investigate how fast BigQuery can query the metadata of 1.1 billion NYC taxi journeys.


Bulk IP Address WHOIS Collection with Python and Hadoop

I investigate how fast a 40-node Hadoop cluster on AWS EMR can collect WHOIS records of IPv4 addresses.


A Billion Taxi Rides in PostgreSQL

I look at query speeds on 1.1 billion records on a single PostgreSQL installation running on an SSD.


A Billion Taxi Rides in Elasticsearch

I investigate how fast a single instance of Elasticsearch can query over a billion records.


A Billion Taxi Rides on Amazon EMR running Spark

I investigate how fast a small AWS EMR cluster can query over a billion records using Spark.


A Billion Taxi Rides on Amazon EMR running Presto

I investigate how fast a small AWS EMR cluster can query over a billion records using Presto.


Kafka Producer Latency with Large Topic Counts

I look at the relationship between topic counts and producer latency with Kafka.


A Billion Taxi Rides in Hive & Presto

Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into ORC-formatted, columnar-based files on HDFS and query them using Hive & Presto.


A Billion Taxi Rides in Redshift

Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into a columnar-based Data Warehouse.


Presto, Parquet & Airpal

Using Airpal to execute queries on Parquet-fomatted data via Presto.


A Million Songs on AWS Redshift

Parallel imports of CSV data from AWS S3 into Redshift.


Hadoop Up and Running

I explore three ways to get Hadoop installed and running.


Faster Testing with RAM Drives

Reduce the I/O overhead of running tests in Django.


Popular Airline Passenger Routes

Scraping 29K Wikipedia pages to find the most popular commercial airline passenger routes.


Recommendation Engine built using Spark and Python

An end-to-end guide to building a film recommendation engine.


Tightening Django Admin Logins

A strategy for blocking dictionary attacks and restricting access to a white list of IP addresses.


Linting UK Postcodes

Parsing and linting UK postcodes is ripe with edge cases.


Passwords in Django

A review of Django auth's password storage format and password storage upgrading capabilities.


Faster Python

Six tips for speeding up Python code.


Crushing, caching and CDN deployment in Django

A strategy for crushing, caching and deploying front-end-optimised Django sites.


Better Python Package Management

Python's most popular package management tool is pip. I explore some tools to increase its functionality.


Load balancing Django

Setup a load-balanced, two-node Django cluster with a minimal Ansible footprint.


Faster Django Testing

Run Django tests concurrently with pytest-xdist.


Django exception archaeology

How to capture, monitor and analyse exceptions raised from a Django project.


Python's killer apps for blogging: Pelican and S3cmd

I look into the steps of creating a blog using Pelican and hosting it with low-cost CDN services from Amazon with the help of S3cmd.


Collecting all IPv4 WHOIS records in Python

An exploratory effort to see how hard it is to collect all IPv4's WHOIS records.


Former PHP developer

I stopped coding in PHP in 2011, here are the thoughts that led me to that decision.


File uploads to Amazon S3 in Django

How to upload files to Amazon S3 from a form in Django as well as (very important) how to test the upload process.


IP Address lookups using Python

A comparison of four methods used to find the country of an IP address.


Django speaking JSON

django-jsonview offers a method decorator which will cause all responses (including exceptions) to return in API-friend, JSON format.


Querying Elasticsearch from Google App Engine

GAE strips HTTP body payloads if sent via HTTP GET. Elasticsearch excepts post bodies sent via HTTP GET. Re-writing the HTTP verb fixes the communications problem.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.