I have been working on some very challenging projects recently that require retrieving remote pages, manipulating them and then sending them back to the user. There are several problems with this scenario:
- Due to a bunch of greedy and malicious hacks, many of the obvious and easier techniques are blocked by necessary security measures.
- Whenever you deal with outside code you don’t know what you’re going to get. It could be half formed, have all kinds of javascript workarounds, rely on frames in frames etc.
- Relative URLs. If you retrieve a remote page and display it locally, any relative URLs will be off, so links will be broken and image will not show. This also applies to stylesheet and script references.
Here’s the solution in a nutshell:
- Use CURL http://us3.php.net/manual/en/ref.curl.php to retrieve the remote page.
- Use Tidy http://us3.php.net/manual/en/ref.tidy.php to put the code into a proper XHTML format. This will close tags, give you predictable formatting, make it very easy to read the source and most importantly put it in a format that can be parsed as XML.
- Parse the XML using xml_parse http://us3.php.net/manual/en/function.xml-parse.php This is a wonderful function! I have spent hours coding XML parsing functions that recurse, save parents etc. This is a much more efficient method (Depending on your task) it creates 2 arrays, 1 for an index, and one for values. You can use the index to find what position you need in the values array and get any info you need.
- Manipulate the cleaned up HTML. Since you have XHTML to work with now you can manipulate it as XML or you can use string replacement. This would be the time to replace all of the relative URLs with Absolute ones.
- Display it back to the user! I don’t want to know how you apply this method, I just like coming up with the solutions