Home
Name Modified Size InfoDownloads / Week
urlremove.pl 2011-07-08 16.0 kB
README 2011-07-08 10.6 kB
Totals: 2 Items   26.6 kB 0
TITLE
    urlremove - bulk request that Google remove URLS from the search db

SYNOPSIS
    urlremove [-v+] [-e email] [-p password_file] [-c config_file]

    An email and password must be set, either at command line or in config
    file

DESCRIPTION
    If you discover that a user put their social security number on a web
    page, you can take it down, but you also want to request that Google
    remove that page from their index and cache. The tool they have is a
    form that lets you enter one URL at a time, which is fine unless you
    discover that the user has pages of social security numbers.

STATUS
    This is beta code. I've used it myself in production and you should be
    able to use it too. I hasn't been tested/used by anyone but me, so while
    it has every chance of success, don't be too surprised if it fails. I'd
    love any feedback.

ARGUMENTS
    --verbose -v
        More output the more times you list it. You can type '-v', '-vv',
        '-vvv'

    --email, -e gmail_address
        The gmail account to use for the webmaster tools. You needn't be the
        registered webmaster for the sites you're submitting.

    --pwfile, -p password_file
        A file containing your gmail password. Alternatively, you can
        specify the password in a config file if you use one. You cannot
        enter your password on the command line because that is insecure. If
        no configfile is found with a password in it and no --pwfile is
        specified, this script looks for '.pass' in the current working
        directory.

    --cfgfile, -c config_file
        A configuration file to use. If not specified, this script looks for
        'urlremove.cfg' in the current working directory.

    --url,-u url_to_remove
        a url to add to the list of urls to remove. You can specify this
        multiple times. If you also have a urlfile, these will all be added
        together. If you specify the same url multiple times, this script
        will report the url as already added for each time you specify the
        same url.

    --urlfile file_containing_urls
        A file containing one url per line. This script does not verify that
        URLS are correctly formatted and it does not add url-encoding
        (turning space into %20, etc). If a badly formatted url is
        encountered, WWW::Mechanize will crap out. You are responsible for
        making the list correct.

        The default url file name is 'urlremove.urls'.

    --testing
        Exit after gathering up all the configuration variables and the URL
        list. If you run with -vv, you get to see what that all looks like.

  DISABLED ARGUMENTS
    This script will only tell google that you've blocked/removed the
    offending page. The previous version let you choose the options supplied
    by google. Google changed their page and I'm too lazy to add the
    functionality back in for the new page.

    If you need this, let me know via the email at the end of this file.

    Here's what you're missing...

    --type, -t (link|result|safesearch)
        DISABLED

        Google thinks of URL removal request as falling into three
        categories and you need to pick one. "result" means the summary that
        shows up with a Google result contains info you want to disappear.
        "link" means the link Google has for your content is bad.
        "safesearch" means you found porn while doing a Google search with
        their safesearch enabled. The default is "result".

    --method, -m (modified|removed)
        Google requires you to change the actual web page before they
        re-spider it. They want to know if you modified the offending page
        or completely removed it. The default is "removed".

CONFIG FILE
    The config file is parsed by Config::Tiny. You can set any of the
    "ARGUMENTS" by putting the long form of the argument in the config file.
    This does not include the 'cfgfile' or 'testing' arguments.

    For Example:

      email = pileofrogs
      password = secret

    Would get you going

EXAMPLE
    If you're the webmaster at www.example.com and you discover that HR
    thought it would be helpful to publish everyone's home phone number in a
    separate page about that person. For example
    http://www.example.com/employees/john_smith contains johns home phone
    number. This is in violation of your company employee privacy policy so
    you whip up a script that removes the phone numbers from all 100 of
    those pages. You discover that when you Google anyone at your company,
    the Google results page includes their home phone number. Ack!

    Just put the URLS of those bad pages into a file called urlremove.urls
    like so:

      http://www.example.com/employees/john_smith
      http://www.example.com/employees/jack_aubrey
      http://www.example.com/employees/wilber_arable
      ...

    NOTE - you have to remove the original page if you're using this version
    of this script.

    And create a config file called urlremove.cfg with your gmail account
    name and gmail password. Because these files are still around, just
    changed, you want to specify the method = removed (not with this
    version!) If you had deleted or blocked those files you wouldn't need to
    specify anything because that's the default. (and your only choice with
    this version) This would give you a config like so:

      email = pileofrogs@gmail.com
      password = secret

    Make sure both of these files are in your current working directory and
    run

      urlremove -v

DEPENDENCIES
    *   WWW::Mechanize

    *   Config::Tiny

    *   Anything that satisfies LWP::Protocol::implementor('https') E.G.
        Crypt::SSLeay or IO::Socket::SSL

AUTHOR
    This was written by Dylan Martin, a unix admin at Seattle Central
    Community College. You can email me at dmartin at cpan dot org.

REPORTING BUGS
    Please report any bugs to
    <https://sourceforge.net/projects/urlremove/support>.

BUGS LIMITATIONS AND OTHER SUCKAGE
    *   I don't handle badly formatted URLs in the urlfile in a reasonable
        way. It should emit a useful warning and continue processing instead
        it says something unhelpful and croaks.

    *   I don't prompt for a password and/or email if they're not in a
        config. This is something any reasonable person might expect of this
        script, but I don't need it so I haven't written it. If anyone else
        uses this script and wants this functionality, let me know and I'll
        probably add it.

    *   This script scrapes & submits a Google form instead of using some
        cool REST API that may or may not exist. If Google changes the form,
        this script will not work and it will probably eat your dog.

        aaand that just happened. It works, but with reduced functionality.

    *   The stuff to tell google if the page have just changed and needs
        re-spidering, vs removed entirely is gone. The only way to use this
        at this time is to remove the offending page.

        Let me know if that makes this script useless to you and I'll see
        what I can do. It shouldn't be too hard to recreate that
        functionality.

    *   I don't bother to check if the same URL has been specified twice. It
        won't hurt anything if you do specify the same URL twice, as the
        script will detect that you're attempting to add an already-added
        URL and continue with the next URL, but it clutters the screen a
        bit.

    *   There hasn't been much testing. I've used this on a big batch of
        URLS and released it to the world in case anyone else has the same
        problem I had.

    *   If Google thinks you're a bot, it will ask you to prove you're a
        human. Don't submit the wrong password! I don't know what else
        Google thinks of as bot-like behavior. If they start demanding a
        captcha, you probably need to try from another machine. I probably
        could write some code to download the captcha image, so if this is a
        problem for you, let me know and I might be able to implement a
        solution.

    *   The logic to determine if a URL has already submitted will freak out
        if you have more than 1000 submitted URLS. This is because Google
        separates them into pages. I figured out how to tell Google to list
        1000 entries per page, so that shouldn't be a problem unless you're
        submitting a huge amount of URLS. If you do exceed this number,
        never fear, Google handles it gracefully. It just means this script
        will take longer to run.

LICENSE
    Copyright (c) 2010, Dylan Martin & Seattle Central Community College All
    rights reserved.

    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions are
    met:

    *   Redistributions of source code must retain the above copyright
        notice, this list of conditions and the following disclaimer.

    *   Redistributions in binary form must reproduce the above copyright
        notice, this list of conditions and the following disclaimer in the
        documentation and/or other materials provided with the distribution.

    *   Neither the name of the Seattle Central Community College nor the
        names of its contributors may be used to endorse or promote products
        derived from this software without specific prior written
        permission.

    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
    IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
    TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
    PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
    HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
    TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
    PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
    LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
    NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
    SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Source: README, updated 2011-07-08