Next: Providing Code Through the
Up: Document: pillow_www6
Previous: Merging the Form Producer
The facilities presented in the previous sections allow generating
HTML documents, including forms, and handling the input coming from
forms. In many applications such as search tools, content analyzers,
etc., it is also desirable to be able to access documents on the
Internet. Such access is generally done through protocols such as
FTP and HTTP which are built on top of TCP/IP. In LP/CLP
systems which have TCP/IP connectivity (i.e., a sockets/ports
interface) the required protocols can be easily coded in the source
language using such facilities and DCG parsers. At present, only the
HTTP protocol is supported by PiLLoW. As with HTML code, the library
uses an internal representation of Uniform Resource Locators (URLs), and
provides predicates which translate between the internal representation
and the textual form. The facilities provided by PiLLoW for accessing
WWW documents include the following predicates:
- url_info(URL,Info) Translates a URL
URL to an internal structure Info which details its
various components and vice-versa. For now non-HTTP URLs make the
predicate fail. E.g.
url_info('http://www.foo.com/bar/scooby.txt',Info)
gives Info = http('www.foo.com',80,"/bar/scooby.txt"),
url_info(URL, http('www.foo.com',2000,"/bar/scooby.txt")
gives URL = "http://www.foo.com:2000/bar/scooby.txt" (a string).
- url_info_relative(URL,BaseInfo,Info)
Translates a relative URL URL which appears in the HTML page
refered to by BaseInfo (given as an url_info structure)
to a complete url_info structure Info. Absolute URLs are
translated as with the previous predicate. E.g.
url_info_relative("/guu/intro.html", http('www.foo.com',80,"/bar/scoob.html"), Info)
gives Info = http('www.foo.com',80,"/guu/intro.html")
url_info_relative("dadu.html", http('www.foo.com',80,"/bar/scoob.html"), Info)
gives Info = http('www.foo.com',80,"/bar/dadu.html").
- url_query(Dic,Args) Translates a list of
attribute=value pairs Dic (in the same form as the
dictionary returned by get_form_input/1) to a string
Args for appending to a URL pointing to a form handler.
- fetch_url(URL,Request,Response)
Fetches a document from the Internet. URL is the Uniform
Resource Locator of the document, given as a url_info structure.
Request is a list of options which specify the parameters of the
request, Response is a list which includes the parameters of the
response. The request parameters available are:
- head
- To specify that we are only interested in the
header.
- timeout(Time)
- Time specifies the maximum
period of time (in seconds) to wait for a response. The
predicate fails on timeout.
- if_modified_since(Date)
- Get document only if
newer than Date. An example of a structure that
represents a date is
date('Tuesday',15,'January',1985,'06:14:02')
.
- user_agent(Name)
- Provide a user-agent field.
- authorization(Scheme,Params)
-
Provides an authentication field when accessing restricted sites.
- name(Param)
- Any other functor translates
to a field of the same name (e.g.
from('user@machine')
).
The parameters wich can be returned in the response list include (see the
HTTP/1.0 definition for more information):
- content(Content)
- Returns in Content the
actual document text, as a list of characters.
- status(Type,Code,Phrase)
- Gives the
status of the response. Type can be any of
informational
, success
, redirection
,
request_error
, server_error
or
extension_code
, Code is the status code and
Phrase is a textual explanation of the status.
- pragma(Data)
- Miscellaneous data.
- message_date(Date)
-
The time at which the message was sent.
- location(URL)
- Where has moved the document.
- http_server(Server)
- Identifies the server responding.
- allow(methods)
- List of methods allowed by the server.
- last_modified(Date)
- Date/time at which the
sender believes the resource was last modified.
- expires(Date)
- Date/time after which the entity
should be considered stale.
- content_type(Type,Subtype,Params)
-
Returns the MIME type/subtype of the document.
- content_encoding(Type)
- Encoding of the
document (if any).
- content_length(Length)
- Length is
the size of the document, in bytes.
- authenticate(Challenges)
- Request for authentication.
- html2terms(Chars,Terms) We have already
explained how this predicate transforms HTML terms to HTML format.
Used on the other way it can parse HTML code, for example retrieved by
fetch_url. The resulting list of HTML terms Terms is
normalized: it contains only comment/1, declare/1,
env/3 and $/2 structures.
For example, a simple fetch of a document can be done as follows:
url_info('http://www.foo.com',UI), fetch_url(UI,[],R), member(content(C),R).
Note that if an error occurs (the document does not exist or has
moved, for example) this will simply fail. The following call
retrieves a document if it has been modified since October 2, 1996:
fetch_url(http('www.foo.com',80,"/doc.html"),
[if_modified_since('Wednesday',2,'October',1996,'00:00:00')],R).
This last one retrieves the header of a document (with a timeout of 10
seconds) to get its last modified date:
fetch_url(http('www.foo.com',80,"/last_news.html"),[head,timeout(10)],R),
member(last_modified(Date),R).
The following is a simple application illustrating the use of
fetch_url and html2terms. The example defines
check_links(URL,BadLinks). The predicate fetches the
HTML document pointed to by URL and scours it to check for links
which produce errors when followed. The list
BadLinks contains all the bad links found, stored as compound
terms of the form: badlink(Link,Error) where Link is
the problematic link and Error is the error explanation given by
the server.
check_links(URL,BadLinks) :-
url_info(URL,URLInfo),
fetch_url(URLInfo,[],Response),
member(content_type(text,html,_),Response),
member(content(Content),Response),
html2terms(Content,Terms),
check_source_links(Terms,URLInfo,[],BadLinks).
check_source_links([],_,BL,BL).
check_source_links([E|Es],BaseURL,BL0,BL) :-
check_source_links1(E,BaseURL,BL0,BL1),
check_source_links(Es,BaseURL,BL1,BL).
check_source_links1(env(a,AnchorAtts,_),BaseURL,BL0,BL) :-
member((href=URL),AnchorAtts), !,
check_link(URL,BaseURL,BL0,BL).
check_source_links1(env(_Name,_Atts,Env_html),BaseURL,BL0,BL) :- !,
check_source_links(Env_html,BaseURL,BL0,BL).
check_source_links1(_,_,BL,BL).
check_link(URL,BaseURL,BL0,BL) :-
url_info_relative(URL,BaseURL,URLInfo), !,
fetch_url_status(URLInfo,Status,Phrase),
( Status \== success ->
name(P,Phrase),
name(U,URL),
BL = [badlink(U,P)|BL0]
; BL = BL0
).
check_link(_,_,BL,BL).
fetch_url_status(URL,Status,Phrase) :-
fetch_url(URL,[head,timeout(20)],Response), !,
member(status(Status,_,Phrase),Response).
fetch_url_status(_,timeout,timeout).
Next: Providing Code Through the
Up: Document: pillow_www6
Previous: Merging the Form Producer
<herme@fi.upm.es>-< webmaster@clip.dia.fi.upm.es>
Last updated on Mon Mar 31 18:18:15 MET DST 1997