Menu

Chinese names from Google's perspective

Kerry Choy
2009-05-30
2013-05-30
  • Kerry Choy

    Kerry Choy - 2009-05-30

    Thanks to the help of the forum, my Chinese names are coming out just dandy. Or so I thought.

    After entering the SURN and GIVN, I override the NAME field to get the family name coming out in the correct order. In the GED, it looks like this:

    1 NAME /蔡/洱熙
    2 GIVN 洱熙
    2 SURN 蔡

    Display in the various lists and pages is great. Except I didn't seem to be able to find people via Google. Searching for "蔡洱熙" was not getting the hit.

    I think what's happening is that the embedded slashes that identify the surname component of NAME are what Google sees. Hence "/蔡/洱熙" gets a hit. Not really what I want! Is there a configuration item I missed or is this another RFE? 

     
    • Greg Roach

      Greg Roach - 2009-05-30

      PGV doesn't show the slashes.  If the name looks OK on screen, then this is what google sees.  You can confirm this by looking at the page source.

      Maybe it is a problem with google?  Does a chinese search engine (baidu???) work any better?

       
    • Kerry Choy

      Kerry Choy - 2009-05-30

      Good call but Baidu can't see him with or without the slash.

      The page source looks OK. By the same token, the page source doesn't have "蔡/文粹" in it at all. But the forward slash is defenitely getting into Google somehow.

      Does PGV show a 'different' page spiders? Perhaps recognising that the client is a 'bot? Would make sense to show a "text only" version. What's also confusing is that the pages in Google's cache. The page is identified as http://kerrychoy.id.au/phpGedView/index.php?view=preview&" This doesn't seem to be a valid URL while the cache appears to be of the default home page with some extra rubbish - nothing actually related the INDV in question.

       
      • Greg Roach

        Greg Roach - 2009-05-30

        Kerry,

        Assuming I have copied/pasted OK, putting this into google finds your page.

        鄺三七 site:kerrychoy.id.au

        It highlights both

        鄺三七 and 鄺, 三七

        So, I guess it is all working.

        But you don't seem to have your indi pages indexed.  There was a bug a while bag that caused google not to follow certain links. Check noindex,nofollow in your settings. (Sorry - just about to leave the house, so you'll have to track these down yourself....)

         
    • Kerry Choy

      Kerry Choy - 2009-05-30

      Thanks Greg on the noindex,nofollow. I've cleared that setting (it's in config for the GEDCOM).

      However, the guy that you searched for seems to appear in the google index because he's on the home page. I have the patriarchs of the families listed as Favorites. The index does not see the INDV's who aren't on the home page. The Google site index does have their entries but I'll take a guess that the robot will need to visit again before it starts finding them. <sigh> come back in 2 weeks.

       
      • Greg Roach

        Greg Roach - 2009-05-30

        <<The index does not see the INDV's who aren't on the home page>>

        No, but the names are all printed using the same logic/formatting/etc.  If google can find this one, then it will be able to find the others (when it reindexes your site).

         
    • Kerry Choy

      Kerry Choy - 2009-06-28

      I've given it a bit of time for the googlebot to visit again to see if the status has changed. Apparently not. To restate the problem and add some more findings:

      Chinese names would typically be entered with family name first, then given names. There would be no embedded spacing or punctuation. Would kind of read like this:

      "ObamaBarak".

      Excluding the individuals who appear on my home page, Google is not indexing the names as they are apparently appearing on the Individual's page. For example, the following name is not indexing:

      "蔡際昌"

      In this case, "蔡" is the surname, the rest is his given or personal name,

      However, Google is picking up and indexing the following constructs:

      "蔡/際昌" - bringing back Individuals page and the list of surnames page. Some result with an extra leading slash.
      "蔡 際昌" -

      I've looked at the generated source for the pages and they look right. eg, from the Individual page and can see this:
      <span class="name_head">蔡敏高</span>. It still seems more than co-incidental that the googlable string is the value that I can see in the NAME tag.

      Note that this is not a nitpick. I really am trying to make the names visible to native Chinese users. The strings that are searchable are very unlikely to be the combinations that Chinese users would use to find an ancestor.   

       
      • Lester Caine

        Lester Caine - 2009-06-28

        The thing I think you need to look at here is whether there is actually anything WE can do about this?
        Have you found Google - or any of the other search engines - actually indexing these types of string properly?
        Since what is displayed on the page *IS* correct, it is the search engines processing that is faulty, and to be honest it's not the first time I've seen these sorts of problems even with 'ascii' character strings.
        If you are allowing bots to index your site, then have a look at some of the others - MSN, Yahoo, etc and see if they do any better ...
        English is my only spoken language, but I can appreciate the problems with some of the processing of UNICODE which is an area other 'english' sites still have to learn to handle correctly.

         
    • Greg Roach

      Greg Roach - 2009-06-28

      I tried searching google for

      "For technical support or genealogy questions, please contact Kerry Choy"

      which should show all the pages on your site that it indexes.  There are only 25.

      Picking one at random, and asking to see the cache

      http://209.85.229.132/search?q=cache:tZABQ4ofPMgJ:kerrychoy.id.au/phpGedView/indilist.php%3Fged%3Dkerrysfamily.ged%26surname%3DCollins+%22For+technical+support+or+genealogy+questions,+please+contact+Kerry+Choy%22&cd=24&hl=en&ct=clnk

      It shows that

      a) the page was cached 30 May 2009.  Maybe it hasn't visited your site since then.

      b) there is a session cache error.  Is this a current problem, or have you already seen it?

       
    • Kerry Choy

      Kerry Choy - 2009-07-02

      Didn't know about the cache error. I didn't have a value there before so have added the suggested path.

      Aside from robots, does this error actually manifest itself to users? I can't say I've ever seen it directly in my browser.

       
    • Greg Roach

      Greg Roach - 2009-07-02

      <<Aside from robots, does this error actually manifest itself to users?>>

      The error appears to occur when a page ends its session.  For most pages, this happens at the end of the page, after all processing is complete.  For some (long running) pages, we end it explicitly, before processing the page.  This prevents one slow page from holding up others.

      Now, I'm guessing that this error is preventing google from seeing certain pages, which is why it is not spidering your whole site.

      You should be able to change your user-agent string (plugins exist for FF to do this), so that you can impersonate the googlebot, and will be able to see exactly what PGV is sending to google..

       
    • Kerry Choy

      Kerry Choy - 2009-07-04

      Looking better. Having made the correction to the path, I think we're on the improve. A few pages seem to be searched properly now. I suppose I'll just wait for the rest.

      The FF plugin is interesting. I'd try it on my site but I don't want to break again. That is what I was thinking about. I'll save it for next time. thx.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.