db time conversion: Qs. for the experts here...Help!

Monday, March 12, 2012

Qs. for the experts here...Help!

We are doing a community wide project where we need to extract phone numbers and emails from over 1000s local non-profit and gov. webpages. Any suggestion on the best tool outthere that could help us automate this somewhat?

Well... I'm not a specialist like many other here, and I'm not sure if I got your problem, but I think what you need is a crawler to connect the webpages and go through it's structure.

Once connected to the webpage, you will need a HTML parser (or a specific parser to the format you're facing) to clean the tags, and then, with clean and readable text, extract the information you need.

Parsing isn't an easy task. I've done one almost two years ago and it isn't 100% perfect yet. More tags and formats are always comming (clean LaTeX formats, for example, is a very hard task).

I can't give the crawler/parser to you because it belongs to the university I work, but it works like this:

1) Connect to a given URL

2) Be sure the extension is known (.html, .xhtml, .jsp, .asp, .xml, .php, etc...)

2) Find links (a href, etc...) and put then on a row. If the link is relative, transform it on a absolute link.

3) Parse the URL.

4) Store the information you need and do the appropriated calculations.

5) Goto 1 using the next link on the row.

So... In this metodology, you will need to tell the algorithm an initial webpage we call "seed".

Monday, March 12, 2012

Qs. for the experts here...Help!

No comments:

Post a Comment

db time conversion

Blog Archive

About Me