next up previous
Next: Providing Code Through the Up: Document: pillow_www6 Previous: Merging the Form Producer

Accessing WWW documents

The facilities presented in the previous sections allow generating HTML documents, including forms, and handling the input coming from forms. In many applications such as search tools, content analyzers, etc., it is also desirable to be able to access documents on the Internet. Such access is generally done through protocols such as FTP and HTTP which are built on top of TCP/IP. In LP/CLP systems which have TCP/IP connectivity (i.e., a sockets/ports interface) the required protocols can be easily coded in the source language using such facilities and DCG parsers. At present, only the HTTP protocol is supported by PiLLoW. As with HTML code, the library uses an internal representation of Uniform Resource Locators (URLs), and provides predicates which translate between the internal representation and the textual form. The facilities provided by PiLLoW for accessing WWW documents include the following predicates:

For example, a simple fetch of a document can be done as follows:

    url_info('http://www.foo.com',UI), fetch_url(UI,[],R), member(content(C),R).
Note that if an error occurs (the document does not exist or has moved, for example) this will simply fail. The following call retrieves a document if it has been modified since October 2, 1996:
    fetch_url(http('www.foo.com',80,"/doc.html"),
        [if_modified_since('Wednesday',2,'October',1996,'00:00:00')],R).
This last one retrieves the header of a document (with a timeout of 10 seconds) to get its last modified date:
    fetch_url(http('www.foo.com',80,"/last_news.html"),[head,timeout(10)],R),
    member(last_modified(Date),R).

The following is a simple application illustrating the use of fetch_url and html2terms. The example defines check_links(URL,BadLinks). The predicate fetches the HTML document pointed to by URL and scours it to check for links which produce errors when followed. The list BadLinks contains all the bad links found, stored as compound terms of the form: badlink(Link,Error) where Link is the problematic link and Error is the error explanation given by the server.

check_links(URL,BadLinks) :-
        url_info(URL,URLInfo),
        fetch_url(URLInfo,[],Response),
        member(content_type(text,html,_),Response),
        member(content(Content),Response),
        html2terms(Content,Terms),
        check_source_links(Terms,URLInfo,[],BadLinks).

check_source_links([],_,BL,BL).
check_source_links([E|Es],BaseURL,BL0,BL) :-
        check_source_links1(E,BaseURL,BL0,BL1),
        check_source_links(Es,BaseURL,BL1,BL).

check_source_links1(env(a,AnchorAtts,_),BaseURL,BL0,BL) :-
        member((href=URL),AnchorAtts), !,
        check_link(URL,BaseURL,BL0,BL).
check_source_links1(env(_Name,_Atts,Env_html),BaseURL,BL0,BL) :- !,
        check_source_links(Env_html,BaseURL,BL0,BL).
check_source_links1(_,_,BL,BL).

check_link(URL,BaseURL,BL0,BL) :-
        url_info_relative(URL,BaseURL,URLInfo), !,
        fetch_url_status(URLInfo,Status,Phrase),
        ( Status \== success ->
          name(P,Phrase),
          name(U,URL),
          BL = [badlink(U,P)|BL0]
        ; BL = BL0
        ).
check_link(_,_,BL,BL).

fetch_url_status(URL,Status,Phrase) :-
        fetch_url(URL,[head,timeout(20)],Response), !,
        member(status(Status,_,Phrase),Response).
fetch_url_status(_,timeout,timeout).


next up previous
Next: Providing Code Through the Up: Document: pillow_www6 Previous: Merging the Form Producer

<herme@fi.upm.es>-< webmaster@clip.dia.fi.upm.es>
Last updated on Mon Mar 31 18:18:15 MET DST 1997