helpmeinvestigate
Blog for Help Me Investigate, a project to make it easier for journalists, bloggers, and anyone else to collaborate on investigating questions of public interest.
In addition to this blog there are sub-sites on health, welfare and education, and open source tools (code available on GitHub and there's a Wordpress plugin here too). For more on the project, see this 'About' page.
Search
Tags
- video (27)
- foi (20)
- freedom of information (8)
- data (6)
- health (6)
- hmi (6)
- education (5)
- welfare (5)
- adrian goldberg (4)
- Iain Overton (3)
- View all 244 tags
- bribery (3)
- help me investigate (3)
- planning (3)
- roundup (3)
- EIR (2)
- Heather Brooke (2)
- about (2)
- audit commission act (2)
- bbc (2)
- birmingham city council (2)
- car (2)
- climatecamp (2)
- computer assisted reporting (2)
- contacts (2)
- data protection (2)
- david higgerson (2)
- election expenses (2)
- exemptions (2)
- expenses (2)
- gijc2011 (2)
- googledocs (2)
- guardian (2)
- hminetworks (2)
- howto (2)
- jon walker (2)
- local government (2)
- mark lee hunter (2)
- open data (2)
- open source (2)
- organisation (2)
- people (2)
- public interest (2)
- scraperwiki (2)
- sheilacoronel (2)
- sources (2)
- transparency (2)
- website (2)
- whatdotheyknow (2)
- wobbing (2)
- 30-year-rule (1)
- 5 live investigates (1)
- Academicfoi (1)
- Anthony Barnett (1)
- Buckingham Palace (1)
- CIJ (1)
- Daily Mail (1)
- David Donald (1)
- Denmark (1)
- Department for Communities and Local Government (1)
- Dr Rita Pal (1)
- EU (1)
- FOI Man (1)
- GP (1)
- Goldacre (1)
- HMIN (1)
- Hacks/Hackers (1)
- HelenaBengtsson (1)
- ICO (1)
- Maha Rafi Atal (1)
- MySQL (1)
- NHS (1)
- PGP (1)
- Paul Lewis (1)
- Prince Charles (1)
- Prince William (1)
- Public Business (1)
- SQL (1)
- Sandringham estate (1)
- Section 8 (1)
- Southwark Council (1)
- TFL (1)
- The Telegraph (1)
- Vexatious requests (1)
- Wikipedia (1)
- academies (1)
- accountability (1)
- accounts (1)
- annie machon (1)
- asktheEU (1)
- averages (1)
- ben harrow (1)
- bill morgan (1)
- birmingham university (1)
- bo elkjaer (1)
- boards (1)
- brigitte alfter (1)
- business journalism (1)
- buzzdata (1)
- cardiff (1)
- case studies (1)
Archive
Subscribe
You're a contributor here (Edit)
This is your Space (Edit)
Follow by email »
Get the latest updates in your email box automatically.
Subscribe via RSS
Get the latest updates in your email box automatically.
May 31st, 4:52am
4 comments
7 ways to get data out of PDFs
Posted by
Paul Bradshaw
A frequent obstacle in data journalism is when the information you want to analyse is locked away in a PDF. Here are 6 ways to tackle that problem - with space for a 7th:
1) For simple PDFs: Google Docs' conversion facility
Google Docs recently added a feature that allows you to convert a PDF to a 'Google document' when you upload it. It's pretty powerful, and about the simplest way you can extract information.
It does not work, however, if the PDF was generated by scanning - in other words if it is an image, rather than a document that has been converted to PDF.
2) For scanned documents and pulling out key players: Document Cloud
Document Cloud is a tool for journalists to convert PDFs to text. It will also add 'semantic' information along the way, such as what organisations, people and 'entities' such as dates and locations are mentioned within it, and there are some useful features that allow you to present documents for others to comment on.
The good news is that it works very well with scanned documents, using Optical Character Recognition (OCR). The bad news is that you need to ask permission to use it, so if you don't work as a professional journalist you may not be able to use it. Still, there's no harm in asking.
3) For scanned documents: The Data Science Toolkit
The Data Science Toolkit allows you to do lots of clever things, including converting PDFs using OCR with theFile2Text converter. Upload your document, and you're away. Also works on other document formats, and PNGs, TIFFs and JPEGs.
4) For stripping out tables: PDF2XL
If you're willing to shell out around £70 then PDF2XL is recommended as a useful piece of software for stripping out tables from Excel files.
5) For automating the process: Scrape from PDF to XML using Scraperwiki
Scraperwiki is a collaborative website for scraping all sorts of hard-to-find information into some sort of useful format, so it's no surprise that PDFs are a common problem there. They have a template scraper for converting PDF documents to XML (a more structured format) - if you can understand a little bit of programming then you can try to adapt it to your own purposes.
6) If it's held by a public body and you have time: a well-written FOI request
Do you need all the data in the PDF or just some? Is that data available elsewhere? Try an advanced search using a phrase from the data in quotes and adding filetype:xls to see if you can find the spreadsheet it comes from. Or submit an FOI request for the data stipulating that it be provided in spreadsheet or CSV (comma separated values) format (if the PDF was supplied in response to an FOI request in the first place, go back and ask for the information to be provided in spreadsheet or CSV (comma separated values) format).
It's a good idea to also ask how the information is stored, including any software used, as you can check with the software vendor how easily the information can be extracted and bat away any excuses the body may come back at you with.
7) Add your own here
There must be others - tell me your own tips.
UPDATE: On Twitter: Simon Rogers uses Acrobat Pro; Kevin Anderson uses Omnipage. And Jack Schofield uses Zamzar.

Comments (4)
This means you can either copy and paste into a new spreadsheet, or it opens as a CSV.
Sometimes a line break in the table can screw this up, so I often just bring in the data first, sometimes in two or more parts - and then add the headings manually afterwards.
Some of my colleagues do it by saving the PDf as an xml file but that's beyond me.
As I always say, PDFs are the devil's format and should never, EVER, be used for data. It's fine for pictures but why organisations take a spreadsheet and put it into a PDF, I will never know.
Leave a comment...