fake declarative data with plaitpy

plaitpy is a program that generates synthetic fake data that has patterns in it. All too often, synthetic data looks like meaningless noise, which is not particularly helpful for validating statistical techniques or database performance.

With plaitpy, one can write declarative yaml templates that specify how fields should be generated and the relationships between them. I’ve used plaitpy to generate fake browser data, website traffic and taxi-trip datasets, for example.

Part of the powerfulness of plaitpy is that it can sample from CSV files and use the populations in them to generate realistic data: for example, we can use a zipcode CSV of populations to create addresses that more accurately reflect the population of a state (or the entire US).

To date, plaitpy has been one of my most successful github projects, in that its garnered 400+ stars. I don’t think it means much, though - I am more proud of its use for my own projects, having used it to test the performance of query caching in sybil with realistic data.

okay