page segmentation with xycuts

If you are familiar with competitive programming, you may recall the Prefix Sum Table - a way to quickly calculate cumulative sums in an array. My first introduction to the prefix sum table was in the recursive xycut algorithm - it’s a page segmentation technique that can chop a page up based on whitespace.

The problem that xycut solves is something like the following: we have an image that we want to OCR and we need to segment the different portions of the page into self-contained areas: for example, maybe we have a two column layout - the xycut algorithm can recognize the two columns and split them. I haven’t seen many public implementations of xycut (despite being invented in 1984), but it is not too difficult to implement. I have written code for it, but just like most other people’s implementations, it is not public.

The way xycut works is relatively simple: the page is translated to 0 (white) and 1 (black) pixels using adaptive thresholding, then a prefix sum table is built in the X and Y directions. For a given column or row to be all whitespace, the sum of all the pixels in that column or row would be 0.

We start by picking a direction (let’s say X first) and find the largest contiguous section of rows with all whitespace. We then split the page by cutting that whitespace region out in the X direction. We now have two rectangles after cutting this whitespace out. We then recurse into each rectangle and do the same procedure but in the opposite axis (the Y Axis this time). We continue to do this, switching which axis we use until we can no longer find a large enough section of whitespace to cut on.

This algorithm is cool and all, but it fails on certain cases. For example: what if there is no straight line of whitespace in page, but instead it has an L shape? The xycut algo can’t handle this and would recognize the whole page as a single rectangle.

To that end, there is a modified xycut algorithm for solving block ordering problems

okay