Next: Providing Code Through the Up: Document: pillow_www6 Previous: Merging the Form Producer

Accessing WWW documents

The facilities presented in the previous sections allow generating HTML documents, including forms, and handling the input coming from forms. In many applications such as search tools, content analyzers, etc., it is also desirable to be able to access documents on the Internet. Such access is generally done through protocols such as FTP and HTTP which are built on top of TCP/IP. In LP/CLP systems which have TCP/IP connectivity (i.e., a sockets/ports interface) the required protocols can be easily coded in the source language using such facilities and DCG parsers. At present, only the HTTP protocol is supported by PiLLoW. As with HTML code, the library uses an internal representation of Uniform Resource Locators (URLs), and provides predicates which translate between the internal representation and the textual form. The facilities provided by PiLLoW for accessing WWW documents include the following predicates:

url_info(URL,Info) Translates a URL URL to an internal structure Info which details its various components and vice-versa. For now non-HTTP URLs make the predicate fail. E.g.
url_info('http://www.foo.com/bar/scooby.txt',Info)
gives Info = http('www.foo.com',80,"/bar/scooby.txt"),
url_info(URL, http('www.foo.com',2000,"/bar/scooby.txt")
gives URL = "http://www.foo.com:2000/bar/scooby.txt" (a string).
url_info_relative(URL,BaseInfo,Info) Translates a relative URL URL which appears in the HTML page refered to by BaseInfo (given as an url_info structure) to a complete url_info structure Info. Absolute URLs are translated as with the previous predicate. E.g.
url_info_relative("/guu/intro.html", http('www.foo.com',80,"/bar/scoob.html"), Info)
gives Info = http('www.foo.com',80,"/guu/intro.html")
url_info_relative("dadu.html", http('www.foo.com',80,"/bar/scoob.html"), Info)
gives Info = http('www.foo.com',80,"/bar/dadu.html").
url_query(Dic,Args) Translates a list of attribute=value pairs Dic (in the same form as the dictionary returned by get_form_input/1) to a string Args for appending to a URL pointing to a form handler.
fetch_url(URL,Request,Response) Fetches a document from the Internet. URL is the Uniform Resource Locator of the document, given as a url_info structure. Request is a list of options which specify the parameters of the request, Response is a list which includes the parameters of the response. The request parameters available are:
head
To specify that we are only interested in the header.
timeout(Time)
Time specifies the maximum period of time (in seconds) to wait for a response. The predicate fails on timeout.
if_modified_since(Date)
Get document only if newer than Date. An example of a structure that represents a date is date('Tuesday',15,'January',1985,'06:14:02').
user_agent(Name)
Provide a user-agent field.
authorization(Scheme,Params)
Provides an authentication field when accessing restricted sites.
name(Param)
Any other functor translates to a field of the same name (e.g. from('user@machine')).

The parameters wich can be returned in the response list include (see the HTTP/1.0 definition for more information):
content(Content)
Returns in Content the actual document text, as a list of characters.

status(Type,Code,Phrase)
Gives the status of the response. Type can be any of informational, success, redirection, request_error, server_error or extension_code, Code is the status code and Phrase is a textual explanation of the status.
pragma(Data)
Miscellaneous data.
message_date(Date)
The time at which the message was sent.
location(URL)
Where has moved the document.
http_server(Server)
Identifies the server responding.
allow(methods)
List of methods allowed by the server.

last_modified(Date)
Date/time at which the sender believes the resource was last modified.

expires(Date)
Date/time after which the entity should be considered stale.

content_type(Type,Subtype,Params)
Returns the MIME type/subtype of the document.
content_encoding(Type)
Encoding of the document (if any).

content_length(Length)
Length is the size of the document, in bytes.

authenticate(Challenges)
Request for authentication.
html2terms(Chars,Terms) We have already explained how this predicate transforms HTML terms to HTML format. Used on the other way it can parse HTML code, for example retrieved by fetch_url. The resulting list of HTML terms Terms is normalized: it contains only comment/1, declare/1, env/3 and $/2 structures.

For example, a simple fetch of a document can be done as follows:

    url_info('http://www.foo.com',UI), fetch_url(UI,[],R), member(content(C),R).

Note that if an error occurs (the document does not exist or has moved, for example) this will simply fail. The following call retrieves a document if it has been modified since October 2, 1996:

    fetch_url(http('www.foo.com',80,"/doc.html"),
        [if_modified_since('Wednesday',2,'October',1996,'00:00:00')],R).

This last one retrieves the header of a document (with a timeout of 10 seconds) to get its last modified date:

    fetch_url(http('www.foo.com',80,"/last_news.html"),[head,timeout(10)],R),
    member(last_modified(Date),R).

The following is a simple application illustrating the use of fetch_url and html2terms. The example defines check_links(URL,BadLinks). The predicate fetches the HTML document pointed to by URL and scours it to check for links which produce errors when followed. The list BadLinks contains all the bad links found, stored as compound terms of the form: badlink(Link,Error) where Link is the problematic link and Error is the error explanation given by the server.

check_links(URL,BadLinks) :-
        url_info(URL,URLInfo),
        fetch_url(URLInfo,[],Response),
        member(content_type(text,html,_),Response),
        member(content(Content),Response),
        html2terms(Content,Terms),
        check_source_links(Terms,URLInfo,[],BadLinks).

check_source_links([],_,BL,BL).
check_source_links([E|Es],BaseURL,BL0,BL) :-
        check_source_links1(E,BaseURL,BL0,BL1),
        check_source_links(Es,BaseURL,BL1,BL).

check_source_links1(env(a,AnchorAtts,_),BaseURL,BL0,BL) :-
        member((href=URL),AnchorAtts), !,
        check_link(URL,BaseURL,BL0,BL).
check_source_links1(env(_Name,_Atts,Env_html),BaseURL,BL0,BL) :- !,
        check_source_links(Env_html,BaseURL,BL0,BL).
check_source_links1(_,_,BL,BL).

check_link(URL,BaseURL,BL0,BL) :-
        url_info_relative(URL,BaseURL,URLInfo), !,
        fetch_url_status(URLInfo,Status,Phrase),
        ( Status \== success ->
          name(P,Phrase),
          name(U,URL),
          BL = [badlink(U,P)|BL0]
        ; BL = BL0
        ).
check_link(_,_,BL,BL).

fetch_url_status(URL,Status,Phrase) :-
        fetch_url(URL,[head,timeout(20)],Response), !,
        member(status(Status,_,Phrase),Response).
fetch_url_status(_,timeout,timeout).

Next: Providing Code Through the Up: Document: pillow_www6 Previous: Merging the Form Producer

<herme@fi.upm.es>-< webmaster@clip.dia.fi.upm.es>
Last updated on Mon Mar 31 18:18:15 MET DST 1997