Well... I'm not a specialist like many other here, and I'm not sure if I got your problem, but I think what you need is a crawler to connect the webpages and go through it's structure.
Once connected to the webpage, you will need a HTML parser (or a specific parser to the format you're facing) to clean the tags, and then, with clean and readable text, extract the information you need.
Parsing isn't an easy task. I've done one almost two years ago and it isn't 100% perfect yet. More tags and formats are always comming (clean LaTeX formats, for example, is a very hard task).
I can't give the crawler/parser to you because it belongs to the university I work, but it works like this:
1) Connect to a given URL
2) Be sure the extension is known (.html, .xhtml, .jsp, .asp, .xml, .php, etc...)
2) Find links (a href, etc...) and put then on a row. If the link is relative, transform it on a absolute link.
3) Parse the URL.
4) Store the information you need and do the appropriated calculations.
5) Goto 1 using the next link on the row.
So... In this metodology, you will need to tell the algorithm an initial webpage we call "seed".
No comments:
Post a Comment