How to build your own online newspaper archive

A newspaper cuttings archive is a very handy thing.


While there is (rightly) much focus within online journalism upon sources and technologies whose purpose is to provide news as it breaks, what about all that content from a day, a week, a year, or a decade ago?


News is meaningless without context, and the best place to look for context to a piece of breaking news, is past newspaper content on that theme or event.


Of course you could always Google it – but what about all the non-newspaper sources you’ll have to wade through to find what you’re after?


Equally, you could always Google News it – but their archives only go back 30 days, while Newsnow’s archives only goes back a couple of months.


All major news organisations use newspaper cuttings databases, and the bigger providers in the field (Lexis Nexis, Dow Jones Factiva, etc.) charge vast sums for access.  That’s not just because it’s premium content, but because journalists save time and money searching multiple titles in the one place.  At a ‘citizen’ level, Google also has a News Archive search containing content which is mostly pay-per-use.


But it’s worth bearing in mind that coverage within proprietary services isn’t always universal.


I started investigating this area because the coverage of Northern Irish press within the service we use at the BBC lacks some key titles.  It’s not the provider’s fault – often there are technical and legal reasons why newspapers can’t pass on their content to proprietary aggregators.


But perversely some of these key titles are currently available free online, and from a journalistic point of view, it makes sense to be able to isolate just these sources for searching.


To start making my newspaper archive, I discovered Agent55.


It’s a simple (albeit not particularly easy on the eye) service which allows you to create your own metasearch engine – that is, a search engine which will search across several domains of your choosing (including news domains).


For directions on how to use Agent55, take a look at this video:

The results are brought back in a tabbed format, presenting each individual set of results within a range of windows in their original format (hence getting round the problem of creating your own metasearch engine, and inadvertently getting rid of all the advertising the sites require that you don’t mess with in their Terms and Conditions).


The only problem was, it didn’t work in all the sites I was interested in, for reasons well beyond my ken.


And so I moved on, and discovered that both Google and Yahoo allow you to build your own customised search engines (once you’re registered) using their indexing.


But this wasn’t the end of the story.  Different newspaper titles (just like all websites) are indexed to differing degrees by these two search engines. Content from some is indexed more heavily on Google, and vice-versa.


Having created a custom search in either service, I needed a means of bringing both together in the same source – and so I went back to Agent55.


Both work well when combined in Agent55, and so the Northern Ireland News Archive was born!


So to summarise, if you can take all those sites you are interested in searching, then create custom searches of them within the Google and Yahoo search builders (which is child’s play), you can then use these two searches (in this case the Google one is here, and the Yahoo one here – be warned though, you need your own web space in which to embed the Yahoo search) within Agent55.


The sources in question for this particular search (for those of you who are interested) are:


  • The Belfast Telegraph
  • The Coleraine Times    
  • The Portadown Times
  • Farming Life
  • U.TV
  • Belfast Media
  • Mid-Ulster Mail
  • Carrick Today
  • Irish News
  • Belfast Daily
  • The Dromore Leader
  • The Banbridge Leader
  • The Ballymoney Times
  • The Ballymena Times
  • Ulster Net
  • The Lurgan Mail
  • Farming Life
  • The Newsletter


(for a thorough list of newspapers by area in the UK, check ABYZ News).


It’s worth pointing out that you can also include those sections of major news sources which deal exclusively with a geographical (or subject-based area), hence if you’re only interested in Northern Ireland stories from The Guardian, then you can put the following domain (or folder) in your custom search:




Now – Instructions:


To access the search engine I made, click on this link to Agent55, and then click on the Shared Categories link at the top (right) of the page, where you will find a Northern Ireland News archive option (June 25th 2008). Click on Import!, and you’ll see a tab for this search at the top of your personalised page.


Just run a search, and you’ll find the results returned in two panes on the results page – the Yahoo one on top, and you have to scroll down for the Google one.


If there’s any take-up with this (we are running an in-house equivalent in conjunction with what I’ve blogged here), then this process will be rolled out to other areas who have specific research needs that aren’t always reflected in what we provide at the moment.


So basically, I’ll be developing more engines for other areas (say Wales, or Education, or Religion etc etc etc) and posting them up here on the blog.


Any feedback (or at least any feedback beyond ‘it’s shit’) is very welcome.


Tags: , , , , , , ,

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: