Home
Name Modified Size Downloads / Week Status
Totals: 2 Items   26.6 kB
urlremove.pl 2011-07-08 16.0 kB 0
README 2011-07-08 10.6 kB 0
TITLE urlremove - bulk request that Google remove URLS from the search db SYNOPSIS urlremove [-v+] [-e email] [-p password_file] [-c config_file] An email and password must be set, either at command line or in config file DESCRIPTION If you discover that a user put their social security number on a web page, you can take it down, but you also want to request that Google remove that page from their index and cache. The tool they have is a form that lets you enter one URL at a time, which is fine unless you discover that the user has pages of social security numbers. STATUS This is beta code. I've used it myself in production and you should be able to use it too. I hasn't been tested/used by anyone but me, so while it has every chance of success, don't be too surprised if it fails. I'd love any feedback. ARGUMENTS --verbose -v More output the more times you list it. You can type '-v', '-vv', '-vvv' --email, -e gmail_address The gmail account to use for the webmaster tools. You needn't be the registered webmaster for the sites you're submitting. --pwfile, -p password_file A file containing your gmail password. Alternatively, you can specify the password in a config file if you use one. You cannot enter your password on the command line because that is insecure. If no configfile is found with a password in it and no --pwfile is specified, this script looks for '.pass' in the current working directory. --cfgfile, -c config_file A configuration file to use. If not specified, this script looks for 'urlremove.cfg' in the current working directory. --url,-u url_to_remove a url to add to the list of urls to remove. You can specify this multiple times. If you also have a urlfile, these will all be added together. If you specify the same url multiple times, this script will report the url as already added for each time you specify the same url. --urlfile file_containing_urls A file containing one url per line. This script does not verify that URLS are correctly formatted and it does not add url-encoding (turning space into %20, etc). If a badly formatted url is encountered, WWW::Mechanize will crap out. You are responsible for making the list correct. The default url file name is 'urlremove.urls'. --testing Exit after gathering up all the configuration variables and the URL list. If you run with -vv, you get to see what that all looks like. DISABLED ARGUMENTS This script will only tell google that you've blocked/removed the offending page. The previous version let you choose the options supplied by google. Google changed their page and I'm too lazy to add the functionality back in for the new page. If you need this, let me know via the email at the end of this file. Here's what you're missing... --type, -t (link|result|safesearch) DISABLED Google thinks of URL removal request as falling into three categories and you need to pick one. "result" means the summary that shows up with a Google result contains info you want to disappear. "link" means the link Google has for your content is bad. "safesearch" means you found porn while doing a Google search with their safesearch enabled. The default is "result". --method, -m (modified|removed) Google requires you to change the actual web page before they re-spider it. They want to know if you modified the offending page or completely removed it. The default is "removed". CONFIG FILE The config file is parsed by Config::Tiny. You can set any of the "ARGUMENTS" by putting the long form of the argument in the config file. This does not include the 'cfgfile' or 'testing' arguments. For Example: email = pileofrogs password = secret Would get you going EXAMPLE If you're the webmaster at www.example.com and you discover that HR thought it would be helpful to publish everyone's home phone number in a separate page about that person. For example http://www.example.com/employees/john_smith contains johns home phone number. This is in violation of your company employee privacy policy so you whip up a script that removes the phone numbers from all 100 of those pages. You discover that when you Google anyone at your company, the Google results page includes their home phone number. Ack! Just put the URLS of those bad pages into a file called urlremove.urls like so: http://www.example.com/employees/john_smith http://www.example.com/employees/jack_aubrey http://www.example.com/employees/wilber_arable ... NOTE - you have to remove the original page if you're using this version of this script. And create a config file called urlremove.cfg with your gmail account name and gmail password. Because these files are still around, just changed, you want to specify the method = removed (not with this version!) If you had deleted or blocked those files you wouldn't need to specify anything because that's the default. (and your only choice with this version) This would give you a config like so: email = pileofrogs@gmail.com password = secret Make sure both of these files are in your current working directory and run urlremove -v DEPENDENCIES * WWW::Mechanize * Config::Tiny * Anything that satisfies LWP::Protocol::implementor('https') E.G. Crypt::SSLeay or IO::Socket::SSL AUTHOR This was written by Dylan Martin, a unix admin at Seattle Central Community College. You can email me at dmartin at cpan dot org. REPORTING BUGS Please report any bugs to <https://sourceforge.net/projects/urlremove/support>. BUGS LIMITATIONS AND OTHER SUCKAGE * I don't handle badly formatted URLs in the urlfile in a reasonable way. It should emit a useful warning and continue processing instead it says something unhelpful and croaks. * I don't prompt for a password and/or email if they're not in a config. This is something any reasonable person might expect of this script, but I don't need it so I haven't written it. If anyone else uses this script and wants this functionality, let me know and I'll probably add it. * This script scrapes & submits a Google form instead of using some cool REST API that may or may not exist. If Google changes the form, this script will not work and it will probably eat your dog. aaand that just happened. It works, but with reduced functionality. * The stuff to tell google if the page have just changed and needs re-spidering, vs removed entirely is gone. The only way to use this at this time is to remove the offending page. Let me know if that makes this script useless to you and I'll see what I can do. It shouldn't be too hard to recreate that functionality. * I don't bother to check if the same URL has been specified twice. It won't hurt anything if you do specify the same URL twice, as the script will detect that you're attempting to add an already-added URL and continue with the next URL, but it clutters the screen a bit. * There hasn't been much testing. I've used this on a big batch of URLS and released it to the world in case anyone else has the same problem I had. * If Google thinks you're a bot, it will ask you to prove you're a human. Don't submit the wrong password! I don't know what else Google thinks of as bot-like behavior. If they start demanding a captcha, you probably need to try from another machine. I probably could write some code to download the captcha image, so if this is a problem for you, let me know and I might be able to implement a solution. * The logic to determine if a URL has already submitted will freak out if you have more than 1000 submitted URLS. This is because Google separates them into pages. I figured out how to tell Google to list 1000 entries per page, so that shouldn't be a problem unless you're submitting a huge amount of URLS. If you do exceed this number, never fear, Google handles it gracefully. It just means this script will take longer to run. LICENSE Copyright (c) 2010, Dylan Martin & Seattle Central Community College All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Seattle Central Community College nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Source: README, updated 2011-07-08

Thanks for helping keep SourceForge clean.

Screenshot instructions:
Windows
Mac
Red Hat Linux   Ubuntu

Click URL instructions:
Right-click on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)

More information about our ad policies
X

Briefly describe the problem (required):

Upload screenshot of ad (required):
Select a file, or drag & drop file here.

Please provide the ad click URL, if possible:

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks