Maintainers Guild: scripting

Showing posts with label scripting. Show all posts

Wednesday, September 2, 2009

Finding Luba - creating custom csvs with your fav

Luba, one of the fine wonders of the world, would quite a bunch of people claim. And they are not all wrong.

Luba is a model and a photographer, and now married with with Petter Hegre, who is the owner of the fine art site Hegre-Art.

Luba has perfectly shaped b(.Y.)bs and a hourglass body figure like a 50's pin-up. Pretty admireable. And she was/is still young.

But this article is not as much about Lubas as it is about finding her. Finding in csv specific terms. So MG has prepared a small video of easy ways to find Luba and even a way to bringing it all together.

Think of this as ways to make your own custom collection of everything you like.

Finding.Luba.XViD.zip

Download the small video clip. It is 6:38min and is meant as inspiration.

Get the tools used a http://gnuwin32.sourceforge.net/packages.html

PS: A note from ash1: "Luba is not all that good in bed!"

Luba and Petter

Hard labour of maintaining - auto schedule download and organizing

Well, maintaining can be a bitch sometimes, and most often demand quite some of your time. And you have to look out for pitfalls and every other nasty thing a webmaster can pull on you.

But sometimes it can also be easy. Take a look at the screenshot.

Click to enlarge

What you see is 2 scripted update maintainig scripts, which are hooked into Windows' Task Schedule program (Start -> Accessories -> System Tools). One of the scripts even unzip's the contents after download (if you look closely).

This is just for inspiration and not a production run.

Making a directory list

Here's a small video example of how anyone can create a list of directories. The example uses some lines from some html pages. It is raw html, which we turn into a nice list.

Click to enlarge thumbnail

Download video: UltraEdit.DirectoryLists.XViD.zip

One pickup line for Bang Bros

This summary is not available. Please click here to view the post.

Alternative - DOS / commandline scripting

This summary is not available. Please click here to view the post.

The Alternative - Scripting the Bad Jojo

There is various ways to download.

A) You can download by using your browser and choose Save.

B) You can also use download managers like Down Them All, ReGet or GetRight and even extend it with download lists.

C) You can also use an advanced Offline Browser like Offline Explorer EE, BlackWidow, HTTrack, HTTP Weazel or Teleport Pro. And you can even spice it up with filters, depth, HAM and macros while the program crawls like a spider over the website of choice.

State of the art tools all of them, doing what they do best. But they all have overhead. Either you have to click too many pages, or you will have to hand pick every download with the download managers or your "spider" traverses too many pages to download only what you want.

What you need is an alternative

The alternative to choose is:

Scripting is only something for you if you have skills. You need skills in programming, you need to have knowledge about http/url's and html. Knowledge about JavaScript and Ajax is an advantage. And especially knowledge about regular expressions is very beneficial to have.

These skills are all learnable. Here on MG you can read about regular expressions and learn them here in article 1, 2, 3 and 4 and you can also gain knowledge about URL's.

But before you go all cold and say that this is way above your head or maybe even too tedious, then you should take some time to watch a small 3:39min video of a tiny script (62 lines) which makes some things easier. The script is not flashy or anything, but it may inspire you.

Download Scripting.badjojo.XViD.zip (6.26Mb)

Thanks to _Store_ for keeping Perl in the focus area and thanks to lobos for script inspiration.

Getting the proper tease - get links from html

One of the issues with the ONTE site is the hassel with downloading the proper higest quality version of the picture sets. Some sets come in plain size, others in both plain and large sizes and some, luckily most these days, comes in ultra high quality too.

This article will give you one way to filter out all the unwanted versions of the zipped picture sets.

In order for this to work you have to go to every month you wish to download.

Then you should save every month on the site to a htm file, no need to save the graphics along, just keep it to html only. Secondly you have to remove all newlines, carrige returns and tabs from the htm files..

You can accomplish that by running a TR command on the commandline in the OS of your choice.

The TR should look like the one below.

tr -d "\n\r\t" < onteXYZ.htm > onteXYZ.html

Do that on every page you saved. Once they are all saved to their .html counterpart, then run the command below.

grep -Po "http://[\w.]+/members/(.....)?zips/[-\w]+\.zip(?=..[\w ]+.?\([\d]+.[\d]+ MB\)./div../div.)" onte*.html

That will extract all the picture zip files in the higest quality available. No need to filter stuff manually. And no need to clean up after the download.

What the grep actually does is to look for any http:// link which links to a zip file and have some trailings which consists of two ending div's.

Have a look below to better understand the result.

Click to enlarge

TXT2WJR gets graphical - New Windows GUI

New version, now with GUI, bugfixes and new feature.

Program purpose:

Convert a text file with url's and an optional text file with directory names into a proper ReGet Deluxe download queue. Ready for downloading.

New features:

Appends to an existing ReGet queue if one is choosen
Set cookie (optional)
Set referer (optional)
Set username (optional)
Set password (optional)
Handles target directories as both absolut and relative or plain.

You can have a quick look at the GUI for TXT2WJR v0.04. This tiny GUI is made with Microsoft C# language and some stuff called WPF which is hidden inside .NET framework v3.5. It's all some new Microsoft Windows Siesta (sorry Vista) jazz. Anyway, it is my first attempt ever on writing a C# program, at the same time it is my first time ever with .NET and even my first time with WPF (Windows Presentation Foundation). So bare with me if it looks and acks like something the cat dragged in.

Click to enlarge

This package will be available as a ZIP download hosted on MediaFire sometime pretty darn soon. (When I bother to package it, hehe).

Beginners RegEx - part 4 - repeats and boundries

This is a addition to the former article about metacharacters. And a little new feature which allows us to tell how many occurances we want to allow of a centain character.

We have seen that * means zero or more, and that + mean one or more of the preceeding characters.

In this case we will use dates as the taget for our example. Consider some dates like the following.

30 April, 2008
14 May 2008
30. April 2008
1 - May - 2008 Description
2nd May 2008
Date: 12TH - 4 - 2008
02-05-2008

All valid dates to my eye, so we want to make a regular expression which assures us that we get such dates if we look for them. Let's outline some of the rules for dates. I will skip the real validation test regarding invalid dates like the 65th February. But merely consider the format.

Day comes first
Month comes second
Year comes last
Day is 1 or 2 digits long
Month is letters or a number which 1 or 2 digits long
Year is 4 digits long
We have some kind of seperators between day, month and year

Repeats - minimum and maximum
Those are the basic rules for the date formats we wish to accept and match. The first thing we need to assure is that the digit versions of day, month and year do not exceed their natural limitations of 1-2 digits, 1-2 digits and 4 digits. This can be done by setting minimum and maximum.

[0-9]{1,2}

The character class [0-9] must be repeated atleast 1 and maximum 2 times in sequence. So every number between 0 and 99 is good. {minimum,maximum} of the preceeding character or character class. There is a flavour of the {minimum,maximum} specifier. It is used in a case like this:

[0-9]{4}

Repeats - exact count
Using just a {exact count} can be useful when we want to make assure that the year in our dates is exactly 4 digits.

One other thing which could become useful for us is to find the boundry of the dates we wish to match. What we want is to check from the boundry of the date to the ending boundry of the date. Let's consider a simpel version which could match one of our dates.

[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}

Boundries
That one could match 02-05-2008 with success. But what if the date wasn't actually a date, but part of a calculation like 2002-05-20089124, then we would still get a match on some of it, hence 2002-05-20089124. Which is not something we want. For the purpose regular expressions has a metacharacter called \b (boundry). What \b actually does is to distinguish between what is considered a word ( \w = [a-zA-Z0-9_] ) and anything else. So if we re-write the previous regular expression into:

\b[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}\b

Then we get the desired effect. Hence we are saying that we have to have some kind of boundry first then some digits and - (hyphen's) and finally a boundry again. This well allow us to not confuse 2002-05-20089124 with a date.

Multiple choices
There is also the issue about the month being spelled instead of written with digits. This means that we will have to adjust out month part a bit to allow either digits (as now [0-9]{1,2}) or letters ( [a-zA-Z]+ ). The subexpression [a-zA-Z]+ says letters a-z and A-Z and we have to have a minimum of 1 letter and maximum unlimited; but only letters. So we adjust our regex alittle again. This time we use the logical OR construction ( thispart | orthispart ). The expression will look like this:

\b[0-9]{1,2}-([a-zA-Z]+|[0-9]{1,2})-[0-9]{4}\b

Metacharacters for digits
Now all the 0-9 becomes a bit confusing with all the 1,2 numbers in there, so we re-write it using the \d (digit) metacharacter.

\b\d{1,2}-([a-zA-Z]+|\d{1,2}+)-\d{4}\b

Seperators between day, month and year
We're getting there. But we still lack a few issues. Namely the seperating characters. We have to accept , comma, . dot, - hyphen, space, tab etc. as seperating character. If you read the previous article you can find that \s is space, tab, newline etc., so a character class like [\s,.-] will cover our possible seperator. And since we need to have atleast 1 seperator between our day, month and year, we tweak the character class to [\s,.-]+ and we modify it to look like below.

\b\d{1,2}[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

Extensions to day
It's is almost complete we can now have any number of seperators between our day, month and year. So all we need is to allow days like: 1st, 2nd, 3rd,4th etc. We basically need to allowed some text extensions to the day member of our expression. We could do this by a simple subexpression like (st|nd|rd|th). The trick is that the dates may contain the extensions or they may not, so we have to make the subexpression optional. So we add a ? to the subexpression and get (st|nd|rd|th)?. Our expression looks like this:

\b\d{1,2}(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

So what do we need to assure ?, basically the only things left is that the dates MAY have some kind of seperation between the day and the extension. Such as 1 st or 2 nd. So lets inject our seperator subexpression [\s,.-] in between and let us make it optional as to zero or more times. We do that with a *, so we get:

\b\d{1,2}[\s,.-]*(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

Case in-sensitive
What else ?. What if the day was written like 1 St ? There is some case sensitive issues. We could extend our subexpression (st|nd|rd|th)? into (st|nd|rd|th|St|Nd|Rd|Th)? or we could make the whole thing in-case sensitive. That is done by (?i), so now the expression looks like:

\b\d{1,2}[\s,.-]*(?i)(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

And I believe that's is as good as it gets today. Take time to elaborate the regular expression. There is a few nitty gritty things one can do to make it really tight.

Beginners RegEx - part 3 - metacharacters

In "Beginners RegEx - part 1" we saw some character classes (the [a-z] style) to march any character between a and z. We also saw that you can mix character classes to include more characters by using styles like [a-z0-9] to match every letter from a to z and every digit from 0 to 9. And if we wanted allow the uppercase versions of the letters we'd have to write something like [a-zA-Z0-9].

Today we'll look at some metacharacters \char, which can make life easier when using character classes [ characters ].

Here is a short list of some character class metacharacters

\t = Tabulator
\n = Newline
\r = Carrige return
\s = Any white space character, ie. space, tab, newline, formfeed etc.
\S = Any character which is NOT a whitespace
\w = Word, which is basically [a-zA-Z0-9_]
\W = Any character which is NOT defined as being a word
\d = Digit, which is basically [0-9]
\D = Any character which is NOT defined as being a digit

So now we can construct some tighter regular expressions which eventually we be easier to read, once you grow acustomed to it.

So if we have a situation in a html file which looks like this:

Then we can construct a regular expression which matches the url part of the html by writing the following regex.

http://[\w\s/.-]+

This will give us a result like:

We see a new feature of regular expressions in the regex above. One we haven't discussed yet. It is the + (plus), which tells us to match atleast one or more occurances of the preceeding character class. Opposite to the * which means match zero or more occurances of the preceeding character or character class.

Let's take the regex apart and explain what is going on.

Click to enlarge

One of the things you may wonder about, hopefully, is that inside the character class we wrote . (dot) to allow marching of the dots between the url name parts w4nd0rn.blogspot.com and the final . just before the type (.html).

Inside a character class [ ] the . (dot) does not mean any character what so ever. In a character class it means just what it is, ie. . (dot).

Beginners RegEx - part 2

We're going to concentrate on negating ranges and a regex feature called zero-width positive lookbehind. Dom't let the words scare you. It is harmless.

Negation is used to tell the regex that it should match everything which is NOT in a range []. Zero-width positive lookbehind is used to get the regex to match characters which will be used to find the stuff we are looking for, but to avoid getting the stuff as the result from our regex. The latter sounds really cryptic, but is quite useful.

Zero-width positive lookbehind. Have a look at the regex below

(?<=.H1.)[^<]*

The first part of the regex is the zero-width positive lookbehind (?<=text). The second is a range [characters] and the last is * which matches zero or more of the preceeding range. Let's take it from the top.

(?<=text) tells that we are looking for text, but if we find a match then we do not want the text "printed" as part of the result. In the case we are looking for .H1. which, if we were searching a html file, would match <H1> (header size 1). Remember that . (dot) matches any character so . (dot) before H1 matches both the < before and the > after the H1. The only real restriction when using (?<=text) is that the text has be of fixed length. So it is not possible to put ? and * inside a (?<=text) because these could change the length of the zero width look behind sub expression.

[^<]* is a little tricky to understand. First of, it is a range we specify. So anything inside the [ and ] is what we are looking for. The * after the range tells us that whatever we find that matches the range can be repeated zero or more times. Inside the range we use 1 metacharacter, the ^.

^ as the very first character in a range means "anything but". So if we had [^a] means that we want to match anything but the letter a. With the [^<] we look for anything but the character <, and as we have a * succeding the range [^<]* we look for zero or more occurances of characters which are NOT <. Practically we get a match on a html string like this:

<H1>here is some header text</H1>

The first part of the full regex finds <H1>, but omits it in the result, the second part fines the text after the first < and matches everything till it reaches a <. So the result is that we get the text inside the html tags <H1> and </H1>.

You can negate more than 1 character when using ranges. A regex like [^<=@] tells us that we want anything besides the characters < and = and finally @.

Beginners RegEx - part 1

Here's a short article which could be called "RegEx for dummies" or beginners. The most important thing about all of this is to read the primer (below).

PRIMER: Regular Expressions (RegEx) is used to find text which matches certain criteria. RegEx does NOT find words or sentences. It finds single characters. So a regular expression like "gin" does not find the word "gin" it finds the letter "g" followed by the letter "i" followed by the letter "n". Make a not of this because it will become much easier this way.

So here is a few examples followed by explanation of some of the meta characters in RegEx.

RegEx: "gin" matches Beginners.

RegEx: "gin?" matches Begixxers. Because the n is optional (? = preceeding character is optional)

RegEx: "(ht|f)tp://" matches both http:// and ftp:// (| = an logical or between the enclosed character sequences)

RegEx: "[1-2][0-9]" matches any number sequence where the 1st digit is 1 or 2 and the following digit is between 0 and 9. So numbers between 10-29. But also 10000 or 294.51 or 14degrees.

RegEx: "/[a-z0-9]/" marches any sequence where a / is followed by letters from a-z or digits from 0 to 9 and then followed by a /

RegEx: "s.x" matches any sequence where a s is followed by ANY character and then followed by a x. So sex, six, sux, sax are all good matches.

RegEx "http:.*\.zip" matches anything followed by after the letters http: 0 or more times and then finally .zip. The * means reperition 0 or more times . means any character \. means the literal dot itself.

So here is a tiny explanation:

? = The preceeding character is optional. Can be there or not
(x|y) = A sub group where either x or y is required. One could make them both optional by (x|y)?, see ?
[a-z0-9/] = A range for 1 character which can be any letter between a-z, 0-9 or a /.
. = Means any character what so ever.
* = Like ? it tells that the preceeding character can be repeated 0 or an unlimited time.
\. = The literal dot itself. It is "escaped" by the \ to tell that we mean . and not the .(any char)

That's all. Download EditPad Pro and load/write some text, use the search (Ctrl+F) and write your first regular expressions to try searching.

In later articles we shall construct more "game on" kind of regular expressions. And explore tools like GREP which has a slightly different and less featured regex than EditPad Pro.

Making the URL / link list

If you've bother to watch the video tutorial on how to easily transform an url list file into a ReGet queue with all nessary information, then you must face the job of creating an URL list file.

Now this can be done manually by copying the nessary information from the html files in your browser....OR....you may lean against regular expressions and GREP.

NOOOOOOOO!, I can hear you scream...too complex!

Yeeees!, I shout and tell you to watch the video tutorial on how to produce a URL list file by using regex. I will make some articles on the regular expression language. But for now.... watch the videoclip and decide for yourself if it is worth using.

Download videoclip here: urllist.avi.zip

Now, you may say "I could have done this with ReGet and IE integration or FireFox and FlashGot." and you're right. But when the contents spans 25 pages or hundreds then using ReGet integration or FlashGot becomes a hazzel. That is why I introduce this way of doing things as it is easier if you have to produce URL lists for several pages by this method. Using only 1 html file is to keep the complexity to a minimum.

Practical TXT2WJR

Here you have a video tutorial with a practical example of how to use the tool. The video takes advantage of the newest version of TXT2WJR, which is v0.03. You can download it here.

Be advised that it is better if you download the video (XViD format) and view it as the image may seem blurred here on the blog.

Download video file for better view: http://www.mediafire.com/?zjzwsymflwe

From text list to download queue

I've done some programming. Mainly to make it easier to use ReGet Deluxe as download manager. The tool is called TXT2WJR because it converts a text file (txt) into a .wjr (ReGet Deluxe Queue) file.

Now, that may sound like kiddie stuff, and it almost is. Except that with the tool you can come from a plain text file looking like this:

hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4276/streaming&fname=ms4276500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4242/streaming&fname=ms4242500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4159/streaming&fname=ms4159500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4121/streaming&fname=ms4121500k.wmv

To a ReGet download queue where the the following is taken care of (on all entries):

The download folder for the files is the folder where you ran TXT2WJR.
You can custom set the filename from the URL if needed.
You can custom set a referer to every download file if needed.
You can custom set a cookie to the download queue entries if needed.

Another nice things is that it is fairly easy to produce a plain text file holding download links by using GREP and some regex, possibly a column capable editor and maybe a few SED commands. All pending on how you work.

I might enhance the tool slightly by adding capability to custom set every single download path for every file in the queue. Possible also a migration from commandline console into Windows GUI drag'n'drop (but do not bet on that though).

Click to enlarge

Download TXT2WJR v0.01

Bangbros to directories

Bang Brothers member page after you click 'Next page of XYZ updates>>'. Save the page to you HDD and do so with additional pages if you need it.

Lets say you saved the page as bangbros.html. Here is afew commands you can fire against your html file to get a nice and clean directory structure for your clips and pics.

tr -d "\n\r\t" < bangbros.html ¦ sed -e "s/ //g" > bb2.txt

grep -Po "(..\d{4}).html.\>\<b\>(.*?)\</b\>\</a\>\</p\>(.*?)Added: (\w*?) (\d{2}), (\d{4})" bb2.txt >bb3.txt

sed -e "s/January/01/" -e "s/February/02/" -e "s/March/03/" -e "s/April/04/" -e "s/May/05/" -e "s/June/06/" -e "s/July/07/" -e "s/August/08/" -e "s/September/09/" -e "s/October/10/" -e "s/November/11/" -e "s/December/12/" bb3.txt >bb4.txt

sed -e "s/.html.>//" bb4.txt>bb5.txt

sed -e "s/<b>/-/" -e "s/<\/b><\/a><\/p>Added: /-/" -e "s/,//" bb5.txt>bb6.txt

sed "s/\(......\)-\(.*\)-\(..\) \(..\) \(....\)/mkdir \"\5-\3-\4 - (\1) - \2\"/" bb6.txt

Those few commands makes you go from some html whick looks like this:

...
<td align="left" valign="top" width="24%">
<p><a href="http://members.bangbrosnetwork.com/bangbus/intro/bb4222.html"><b>Spring Break Hottie</b></a></p>

Added: March 12, 2008<br>
<p>Website: <a href="http://members.bangbrosnetwork.com/bangbus/main-1.html">bangbus.com</a></p>

<div><img src="bangbus_files/small_7.gif" alt="bar 7" border="0" height="12" width="58"></div>
<p><small>Rating: 6.78 (674 votes)</small></p></td>
...

into a nice clean command list which you can run, that looks like this:

mkdir "2008-03-12 - (bb4222) - Spring Break Hottie"
mkdir "2008-03-05 - (bb4197) - Kangaroo spotting"
mkdir "2008-02-27 - (bb4173) - Cock Hungry SaraJay"
...

Wednesday, September 2, 2009

Finding Luba - creating custom csvs with your fav

Hard labour of maintaining - auto schedule download and organizing

Making a directory list

One pickup line for Bang Bros

Alternative - DOS / commandline scripting

The Alternative - Scripting the Bad Jojo

Getting the proper tease - get links from html

TXT2WJR gets graphical - New Windows GUI

Beginners RegEx - part 4 - repeats and boundries

Beginners RegEx - part 3 - metacharacters

Beginners RegEx - part 2

Beginners RegEx - part 1

Making the URL / link list

Practical TXT2WJR

From text list to download queue

Bangbros to directories

Blog Archive

Labels