Greytree

TamWiki

For a mouse who is a packrat

Technology » Getting The Head Of A Web Page
a one line description

Summary:this is what goes at the top of the site

(redirected from Main.GettingTheHeadOfAWebPage)

Sometimes, all you want to retrieve is just the headers from a web page. This can be done in a number of ways depending on what language you are using. Here, I will show how one can do it in PHP and Perl.

PHP methods

HTTP.php

Installation

First, you need to install the HTTP.php module. This is done with the Pear? utility:

$ sudo pear install HTTP.php

Using HTTP.php

Once that is installed, you need to include it in your script:

require_once('HTTP.php');

Then, you can call the head method with the URL you are seeking information from:

$headers = HTTP::head($uri);

You can check if there are valid headers by checking to see what type of object $headers is:

if (get_class($headers) == 'PEAR_Error') {
    $errors[] = "Invalid URI.";
    return NULL;
  }

From then on, you can discover more about the headers that got returned with the call.

if ($headers['response_code'] >= 400) {
      # process error code
}

PHP cURL Library

CURL is a library that handles fetching of web items that is built in to PHP. It has many many options and can also be used to fetch just the head of the page. It's not quite as convenient as using the HTTP.php pear extension, but if you can't use that, CURL is still available.

From the basic curl example page on php.net:

The basic idea behind the cURL functions is that you initialize a cURL session using the curl_init(), then you can set all your options for the transfer via the curl_setopt(), then you can execute the session with the curl_exec() and then you finish off your session using the curl_close().

The option we're most interested in here is the CURLOPT_HEADER option, which should be set to 1 to retrieve the header.

  1. $ch = curl_init("http://www.tamaratemple.com/"); // sets the URL to retrieve
  2. $fp = fopen("example_homepage.txt", "w"); // opens a file to receive the body of the page
  3. $hp = fopen("example_homepage_header.txt", "w"); // opens of file to receive the header of the page
  4.  
  5. curl_setopt($ch, CURLOPT_FILE, $fp); // writes page contents to file
  6. curl_setopt($ch, CURLOPT_HEADER, false); // do not include the header with the page contents
  7. curl_setopt($ch, CURLOPT_WRITEHEADER, $hp); // file to hold the header
  8.  
  9. curl_exec($ch); // execute the curl operation
  10.  
  11. // close curl and file handles
  12. curl_close($ch);
  13. fclose($fp);
  14. fclose($hp);

Then, you can read in the header and do something with it:

  1. $header = file_get_contents("example_homepage_header.txt");
  2. echo "Header:".PHP_EOL;
  3. echo "----".PHP_EOL.$header."----".PHP_EOL;
  4. echo PHP_EOL;
  5.  
  6. // Processing the header
  7. $header_lines = explode("\n",$header); // turn the text into lines in an array
  8. $header_lines = array_map("rtrim",$header_lines);
  9. print_r($header_lines);
  10.  
  11. // First line of header is the response text and code
  12. $response = explode(" ",$header_lines[0], 3);
  13. $protocol = explode("/",$response[0]);
  14. $header_hash['Protocol'] = $protocol[0];
  15. $header_hash['ProtocolVersion'] = $protocol[1];
  16. $header_hash['ResponseCode'] = $response[1];
  17. $header_hash['ResponseText'] = $response[2];
  18.  
  19. // Remainder of header is field: value lines
  20. for ($i=1; $i<count($header_lines); $i++) {
  21.   if (empty($header_lines[$i])) continue;
  22.   $line = explode(": ",$header_lines[$i],2);
  23.   $header_hash[$line[0]] = $line[1];
  24. }
  25.  
  26. print_r($header_hash);

The output from the combined example would be:

$ php curlexample.php 
Header:
----
HTTP/1.1 200 OK
Date: Mon, 09 Apr 2012 22:00:17 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.13
X-Pingback: http://www.tamaratemple.com/xmlrpc.php
Set-Cookie: PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

----

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 09 Apr 2012 22:00:17 GMT
    [2] => Server: Apache/2.2.3 (CentOS)
    [3] => X-Powered-By: PHP/5.2.13
    [4] => X-Pingback: http://www.tamaratemple.com/xmlrpc.php
    [5] => Set-Cookie: PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
    [6] => Transfer-Encoding: chunked
    [7] => Content-Type: text/html; charset=UTF-8
    [8] => 
    [9] => 
)
Array
(
    [Protocol] => HTTP
    [ProtocolVersion] => 1.1
    [ResponseCode] => 200
    [ResponseText] => OK
    [Date] => Mon, 09 Apr 2012 22:00:17 GMT
    [Server] => Apache/2.2.3 (CentOS)
    [X-Powered-By] => PHP/5.2.13
    [X-Pingback] => http://www.tamaratemple.com/xmlrpc.php
    [Set-Cookie] => PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
    [Transfer-Encoding] => chunked
    [Content-Type] => text/html; charset=UTF-8
)

Perl

The standard way of getting urls in perl is via the estimable LWP::UserAgent? module. A simple example:

  1. #!/opt/local/bin/perl -w
  2. use strict;
  3. use LWP::UserAgent;
  4. use Data::Dumper::Names;
  5.  
  6. my $ua = LWP::UserAgent->new;
  7. my $response = $ua->get("http://www.tamaratemple.com");
  8. die "$response->status_line" if ! $response->is_success;
  9. # The header is in $response->{_headers}
  10. # Access fields in header with $response->header($field)
  11. print Dumper($response->{_headers}) . "\n\n";
  12. print "Content-type is: " . $response->header("Content-type") . "\n";

Produces:

$VAR1 = bless( {
                 'x-meta-viewport' => 'width=device-width',
                 'connection' => 'close',
                 'set-cookie' => 'PHPSESSID=h15e25e00vqcg06fviallshsk5; path=/',
                 'date' => 'Mon, 09 Apr 2012 22:49:39 GMT',
                 'client-peer' => '97.74.126.203:80',
                 'client-date' => 'Mon, 09 Apr 2012 22:49:40 GMT',
                 'content-type' => 'text/html; charset=UTF-8',
                 'client-transfer-encoding' => [
                                                 'chunked'
                                               ],
                 'server' => 'Apache/2.2.3 (CentOS)',
                 'link' => [
                             '<http://gmpg.org/xfn/11>; rel="profile"',
                             '<http://www.tamaratemple.com/wp-content/themes/twentyeleven/style.css>; media="all"; rel="stylesheet"; type="text/css"',
                             '<http://www.tamaratemple.com/xmlrpc.php>; rel="pingback"',
                             '<http://www.tamaratemple.com/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Feed"; type="application/rss+xml"',
                             '<http://www.tamaratemple.com/comments/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Comments Feed"; type="application/rss+xml"',
                             '<http://www.tamaratemple.com/showcase/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Showcase Comments Feed"; type="application/rss+xml"',
                             '<http://www.tamaratemple.com/wp-content/themes/twentyeleven/colors/dark.css>; id="dark-css"; media="all"; rel="stylesheet"; type="text/css"',
                             '<http://www.tamaratemple.com/wp-content/plugins/rtsocial/styles/style.css?ver=3.3.1>; id="styleSheet-css"; media="all"; rel="stylesheet"; type="text/css"',
                             '<http://www.tamaratemple.com/wp-content/plugins/twitter-blackbird-pie/css/blackbirdpie.css?ver=20110416>; id="blackbirdpie-css-css"; media="all"; rel="stylesheet"; type="text/css"',
                             '<http://www.tamaratemple.com/xmlrpc.php?rsd>; rel="EditURI"; title="RSD"; type="application/rsd+xml"',
                             '<http://www.tamaratemple.com/wp-includes/wlwmanifest.xml>; rel="wlwmanifest"; type="application/wlwmanifest+xml"',
                             '<http://www.tamaratemple.com/visitors/>; rel="prev"; title="Visitors"',
                             '<http://www.tamaratemple.com/posts/>; rel="next"; title="Posts"',
                             '<http://www.tamaratemple.com/>; rel="canonical"'
                           ],
                 'x-powered-by' => 'PHP/5.2.13',
                 'x-meta-generator' => 'WordPress 3.3.1',
                 'client-response-num' => 1,
                 'x-meta-charset' => 'UTF-8',
                 'x-pingback' => 'http://www.tamaratemple.com/xmlrpc.php',
                 'title' => 'Tamara Temple\'s Blog | Rumblings and Ramblings of a Well-Travelled Mouse',
                 'x-meta-google-site-verification' => 'X8LsSEQsjd45g6RMdTR0CCVkl-3pgVAgOb4oG7s5JH0'
               }, 'HTTP::Headers' );


Content-type is: text/html; charset=UTF-8


Tags: Categories: Articles

Recent Changes | Printable View | Page History | Edit Page
Page last modified on April 09, 2012, at 11:00 PM by tamara