TITLE
urlremove - bulk request that Google remove URLS from the search db
SYNOPSIS
urlremove [-v+] [-e email] [-p password_file] [-c config_file]
An email and password must be set, either at command line or in config
file
DESCRIPTION
If you discover that a user put their social security number on a web
page, you can take it down, but you also want to request that Google
remove that page from their index and cache. The tool they have is a
form that lets you enter one URL at a time, which is fine unless you
discover that the user has pages of social security numbers.
STATUS
This is beta code. I've used it myself in production and you should be
able to use it too. I hasn't been tested/used by anyone but me, so while
it has every chance of success, don't be too surprised if it fails. I'd
love any feedback.
ARGUMENTS
--verbose -v
More output the more times you list it. You can type '-v', '-vv',
'-vvv'
--email, -e gmail_address
The gmail account to use for the webmaster tools. You needn't be the
registered webmaster for the sites you're submitting.
--pwfile, -p password_file
A file containing your gmail password. Alternatively, you can
specify the password in a config file if you use one. You cannot
enter your password on the command line because that is insecure. If
no configfile is found with a password in it and no --pwfile is
specified, this script looks for '.pass' in the current working
directory.
--cfgfile, -c config_file
A configuration file to use. If not specified, this script looks for
'urlremove.cfg' in the current working directory.
--url,-u url_to_remove
a url to add to the list of urls to remove. You can specify this
multiple times. If you also have a urlfile, these will all be added
together. If you specify the same url multiple times, this script
will report the url as already added for each time you specify the
same url.
--urlfile file_containing_urls
A file containing one url per line. This script does not verify that
URLS are correctly formatted and it does not add url-encoding
(turning space into %20, etc). If a badly formatted url is
encountered, WWW::Mechanize will crap out. You are responsible for
making the list correct.
The default url file name is 'urlremove.urls'.
--testing
Exit after gathering up all the configuration variables and the URL
list. If you run with -vv, you get to see what that all looks like.
DISABLED ARGUMENTS
This script will only tell google that you've blocked/removed the
offending page. The previous version let you choose the options supplied
by google. Google changed their page and I'm too lazy to add the
functionality back in for the new page.
If you need this, let me know via the email at the end of this file.
Here's what you're missing...
--type, -t (link|result|safesearch)
DISABLED
Google thinks of URL removal request as falling into three
categories and you need to pick one. "result" means the summary that
shows up with a Google result contains info you want to disappear.
"link" means the link Google has for your content is bad.
"safesearch" means you found porn while doing a Google search with
their safesearch enabled. The default is "result".
--method, -m (modified|removed)
Google requires you to change the actual web page before they
re-spider it. They want to know if you modified the offending page
or completely removed it. The default is "removed".
CONFIG FILE
The config file is parsed by Config::Tiny. You can set any of the
"ARGUMENTS" by putting the long form of the argument in the config file.
This does not include the 'cfgfile' or 'testing' arguments.
For Example:
email = pileofrogs
password = secret
Would get you going
EXAMPLE
If you're the webmaster at www.example.com and you discover that HR
thought it would be helpful to publish everyone's home phone number in a
separate page about that person. For example
http://www.example.com/employees/john_smith contains johns home phone
number. This is in violation of your company employee privacy policy so
you whip up a script that removes the phone numbers from all 100 of
those pages. You discover that when you Google anyone at your company,
the Google results page includes their home phone number. Ack!
Just put the URLS of those bad pages into a file called urlremove.urls
like so:
http://www.example.com/employees/john_smith
http://www.example.com/employees/jack_aubrey
http://www.example.com/employees/wilber_arable
...
NOTE - you have to remove the original page if you're using this version
of this script.
And create a config file called urlremove.cfg with your gmail account
name and gmail password. Because these files are still around, just
changed, you want to specify the method = removed (not with this
version!) If you had deleted or blocked those files you wouldn't need to
specify anything because that's the default. (and your only choice with
this version) This would give you a config like so:
email = pileofrogs@gmail.com
password = secret
Make sure both of these files are in your current working directory and
run
urlremove -v
DEPENDENCIES
* WWW::Mechanize
* Config::Tiny
* Anything that satisfies LWP::Protocol::implementor('https') E.G.
Crypt::SSLeay or IO::Socket::SSL
AUTHOR
This was written by Dylan Martin, a unix admin at Seattle Central
Community College. You can email me at dmartin at cpan dot org.
REPORTING BUGS
Please report any bugs to
<https://sourceforge.net/projects/urlremove/support>.
BUGS LIMITATIONS AND OTHER SUCKAGE
* I don't handle badly formatted URLs in the urlfile in a reasonable
way. It should emit a useful warning and continue processing instead
it says something unhelpful and croaks.
* I don't prompt for a password and/or email if they're not in a
config. This is something any reasonable person might expect of this
script, but I don't need it so I haven't written it. If anyone else
uses this script and wants this functionality, let me know and I'll
probably add it.
* This script scrapes & submits a Google form instead of using some
cool REST API that may or may not exist. If Google changes the form,
this script will not work and it will probably eat your dog.
aaand that just happened. It works, but with reduced functionality.
* The stuff to tell google if the page have just changed and needs
re-spidering, vs removed entirely is gone. The only way to use this
at this time is to remove the offending page.
Let me know if that makes this script useless to you and I'll see
what I can do. It shouldn't be too hard to recreate that
functionality.
* I don't bother to check if the same URL has been specified twice. It
won't hurt anything if you do specify the same URL twice, as the
script will detect that you're attempting to add an already-added
URL and continue with the next URL, but it clutters the screen a
bit.
* There hasn't been much testing. I've used this on a big batch of
URLS and released it to the world in case anyone else has the same
problem I had.
* If Google thinks you're a bot, it will ask you to prove you're a
human. Don't submit the wrong password! I don't know what else
Google thinks of as bot-like behavior. If they start demanding a
captcha, you probably need to try from another machine. I probably
could write some code to download the captcha image, so if this is a
problem for you, let me know and I might be able to implement a
solution.
* The logic to determine if a URL has already submitted will freak out
if you have more than 1000 submitted URLS. This is because Google
separates them into pages. I figured out how to tell Google to list
1000 entries per page, so that shouldn't be a problem unless you're
submitting a huge amount of URLS. If you do exceed this number,
never fear, Google handles it gracefully. It just means this script
will take longer to run.
LICENSE
Copyright (c) 2010, Dylan Martin & Seattle Central Community College All
rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the Seattle Central Community College nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.