wget goes blind?

robot image courtesy of Dirty Bunny http://www.flickr.com/photos/angrybee/Trying to download a bunch of ogg files from a server directory containing lots of other things and NOT wanting to download one by one, I turned to wget.

Seemed to remember something about wildcards not working with wget in what I would think of as an intuitive way, so I googled wget and wildcards. Soon found reference to the use of an “-A” option that looked like it would do the trick.

Ran:
wget -r -l1 --no-parent -A.ogg http://domain/directory/subdirectory/

and got a copy of the directory tree and an index html file that was removed again immediately by wget “since it should be rejected”. What the heck? Where are the oggs?

Had a look at the subdirectory using lynx and got a directory listing with various files including the ogg files. The files seemed to really be there, no sign of a base address or other magic that might be making the files look like they were where they were not. Lynx could see them. Had wget gone blind?

Read (ok, skimmed) the man page, tried playing with the options a bit,. Same result. There was an http 302 redirect involved so I tried various options with the domain that the redirect was to. No joy.

By this time I could have just downloaded the files one by one! Have to solve the mystery though. Why can’t I wget these oggs!?

Finally, and looking at bash history gives me no clue what made the difference, one of my attempts gave output that included “Loading robots.txt; please ignore errors.”

Robots.txt! What does this robots.txt say?

User-agent: *
Disallow: /

That robots.txt disallows robots from visiting all pages, and of course wget’s going to play nice. So one more Google for wget and robots.txt and at last the answer!

wget -r -l1 --no-parent -A.ogg -e robots=off http://domain/directory/subdirectory/

and I got the oggs. Happy day.

btw: the wget man page does mention under the description that wget respects the Robot Exclusion Standard. I missed that. Rarely look at the descriptions. There is no mention in the man page of a robots command and I hadn’t thought to look at /etc/wgetrc. After finding the answer I needed on Google I looked for where the robots command might be documented on my system. I finally found it in the info file under Wgetrc Commands with some explanation in the appendices.

Also, it turned out that wget did need to be aimed at the domain that the http redirect directed to. I don’t know whether that may have been because of the use of the -l1 option. Didn’t play around with that for fear of downloading the entire internet. So much to learn. So little time.

robot image courtesy of Dirty Bunny http://www.flickr.com/photos/angrybee/
Advertisements
Explore posts in the same categories: Software

Tags: ,

You can comment below, or link to this permanent URL from your own site.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: