Hybrid Page Layout Analysis via Tab-Stop Detection
read more
Citations
Reverse Engineering Mobile Application User Interfaces with REMAUI (T)
ICDAR 2003 page segmentation competition
Table detection in heterogeneous documents
Adapting the Tesseract open source OCR engine for multilingual OCR
A comprehensive survey of mostly textual document segmentation algorithms since 2008
References
An Overview of the Tesseract OCR Engine
Block segmentation and text extraction in mixed text/image documents
Page segmentation and classification
Page segmentation and classification
Related Papers (5)
Frequently Asked Questions (14)
Q2. What are the future works mentioned in the paper "Hybrid page layout analysis via tab-stop detection" ?
The algorithm described has no table detection or analysis, but the tab-stops make particularly useful features for both, so table analysis will be added in the future.
Q3. What is the purpose of the polygon edges?
The polygon edges are chosen to minimize the number of vertices, while satisfying the constraint that all CPs are contained within their region polygon, and no CP from another region intersects.
Q4. What is the way to test page layout analysis?
Combining the top-down concept of column structure with bottom-up classification methods enables page layout analysis to easily handle the complex nonrectangular layouts of modern magazine pages without losing sight of the “bigger picture” that often happens when bottom-up methods are used alone.
Q5. What is the stroke width of a CC?
On “stressed” fonts, the strokewidth is greater on vertical lines than on horizontal lines, so stroke width is calculated separately in both directions.
Q6. How do you make a list of CPs?
After the initial candidates are made, they are improved by adding new CPs and widening existing CPs, by using the edge of a CP in a different CPSet while widening doesn't cause overlap of CPs.
Q7. How do you make tab stops end at the same coordinate?
The final step attempts to make connected tab lines end at the same y coordinate, by allowing the ends to move between the last member CC whose edge was used for the tab line, and the first nonmember CC that the line intersects.
Q8. What is the process of finding tab-stop lines?
The process of finding tab-stop line segments has several major sub-steps: candidate tab-stop CCs that look like they may be at the edge of a text region are found and then grouped into tab-stop lines, then connections between tab-stop lines are found, enabling removal of false positives.
Q9. What is the size of the list of registered partners?
The size of the list of registered partners is forced to become zero or one for each of upper and lower, using the following rules in order: 1. Type.
Q10. What are the disadvantages of top-down methods?
Although top-down methods have theadvantage that they start by looking at the largest structures on the page, they are unable to handle the variety of formats that occur in many magazine pages, such as non-rectangular regions and cross-column headings that blend seamlessly into the columns below.
Q11. What is the main purpose of the paper?
This paper does not address logical layout analysis, which detects headers, footers, body text, numbered lists, and segmentation into articles.
Q12. What is the definition of a CPset?
A good CP either touches a tab line on both vertical edges of its bounding box, or its width is close to a frequently occurring width.
Q13. What is the way to explain headings in a column?
This allows headings that merge columns in B to be explained by A.A list of column candidates is made from the set of CPsets on the page, ordered best first, and with duplicates eliminated by the A explains B rules above.
Q14. What is the key advantage of the recursive top-down methods?
This solves some of the flaws in the recursive top-down methods, by finding gaps between columns by a bottom-up analysis of the gaps, looking explicitly for white rectangles.