Wednesday, September 2, 2009

Getting the proper tease - get links from html

One of the issues with the ONTE site is the hassel with downloading the proper higest quality version of the picture sets. Some sets come in plain size, others in both plain and large sizes and some, luckily most these days, comes in ultra high quality too.

This article will give you one way to filter out all the unwanted versions of the zipped picture sets.

In order for this to work you have to go to every month you wish to download.

Then you should save every month on the site to a htm file, no need to save the graphics along, just keep it to html only. Secondly you have to remove all newlines, carrige returns and tabs from the htm files..

You can accomplish that by running a TR command on the commandline in the OS of your choice.

The TR should look like the one below.

tr -d "\n\r\t" < onteXYZ.htm > onteXYZ.html
Do that on every page you saved. Once they are all saved to their .html counterpart, then run the command below.
grep -Po "http://[\w.]+/members/(.....)?zips/[-\w]+\.zip(?=..[\w ]+.?\([\d]+.[\d]+ MB\)./div../div.)" onte*.html
That will extract all the picture zip files in the higest quality available. No need to filter stuff manually. And no need to clean up after the download.

What the grep actually does is to look for any http:// link which links to a zip file and have some trailings which consists of two ending div's.

Have a look below to better understand the result.

No comments:

Post a Comment