dimanche 20 janvier 2013

Processing HTML with Scala as if XML

The controversial XML API for Scala is still usefull for simple use cases. However it comes short when dealing with HTML as found on public websites since it is not well formed.




It is quite easy however to overcome this limitation by using the library tagsoup which allows to "fix" the HTML markup to make it look like XML.

You will add this dependency to your pom.xml :



This simple object can then be used to load an URL containing HTML markup as if it was XML.



Note that it also allows to set HTTP headers to the request, for instance if you want to use a cookie or a sessionId to get logged in.

Example calling code, which will list all HTML links from the loaded page :



And as a bonus, an object to force proxy settings from the code (for sure, you could also set it from the command line).


3 commentaires:

  1. How it behaves in case of html is not formatted with pure html tags?

    RépondreSupprimer
    Réponses
    1. This is exactly the point of the tagsoup library ! It will convert the html, even ill-formatted, to SAX events which will always give you some well-formatted XML.

      Supprimer
  2. Or you could use the validator.nu parser directly.

    RépondreSupprimer