#67 21852 NF stories without discussions

closed
None
7
2004-07-12
2004-07-12
No

I found this when I peeked at NF's slashd.log and saw numerous
stories that couldn't have their cchp data written. This is because
_print_cchp chokes and fails to print the correct data when it finds
an SQL error, in this case, a stories.discussion that points to a
missing row in the discussions table. E.g.:

Mon Jul 12 14:41:09 2004 [freshenup.pl] article.pl
virtual_user=newsforge ssi=yes sid='02/03/15/0018214'
cchp='LQi6yi5pKk' section='software' bytes=6988
Mon Jul 12 14:41:09 2004 [freshenup.pl] Commentcount/hitparade
data was not retrieved, reason unknown (cchp: '' for param '
cchp='LQi6yi5pKk'' file '/usr/local/slash/site/newsforge.com/logs/
cchp.LQi6yi5pKk' exists '1' len '0')

Partly I want this fixed because it's bad to have broken data in the
DB; mostly I want it fixed because freshenup.pl will chew many
more cycles than necessary, 24/7, until the broken data is fixed.

Turns out we have a lot of them. My first guess is that the script to
copy this data over failed, which is probably something I should
fix. Or, it may be some kind of problem with our code that was
quasi-fixed in March 2002 because there were a great many before
then, and there seem to be many fewer after March 16, 2002.

The stories and discussions tables, before my script massaged all
their data into place, are still in the newsforge DB as stories.old
and discussions.old. Can you look into this and determine whether
it's a new bug, a conversion-script bug, or an old bug?

mysql> select min(time), max(time), count(*) from stories left
join discussions on stories.discussion=discussions.id where
discussions.id is null;
+---------------------+---------------------+----------+
| min(time) | max(time) | count(*) |
+---------------------+---------------------+----------+
| 0000-00-00 00:00:00 | 2004-05-23 19:30:00 | 21852 |
+---------------------+---------------------+----------+
1 row in set (0.39 sec)

mysql> select count(*) from stories where discussion is not null
and discussion > 0; select count(*) from discussions;
+----------+
| count(*) |
+----------+
| 37410 |
+----------+
1 row in set (0.07 sec)

+----------+
| count(*) |
+----------+
| 15605 |
+----------+
1 row in set (0.02 sec)

mysql> select substring(stories.sid, 1, 3) AS sid_year, min(time),
max(time), count(*) from stories left join discussions on
stories.discussion=discussions.id where discussions.id is null group
by sid_year;
+----------+---------------------+---------------------+----------+
| sid_year | min(time) | max(time) | count(*) |
+----------+---------------------+---------------------+----------+
| 00/ | 2000-07-20 14:53:10 | 2001-01-02 18:00:21 | 6358 |
| 01/ | 2000-06-25 00:44:01 | 2002-01-04 09:20:07 | 13826 |
| 02/ | 2001-12-31 23:59:51 | 2002-03-16 14:49:27 | 1653 |
| 03/ | 0000-00-00 00:00:00 | 2003-12-05 10:00:00 | 8 |
| 04/ | 2003-02-10 15:30:00 | 2004-05-23 19:30:00 | 7 |
+----------+---------------------+---------------------+----------+
5 rows in set (0.63 sec)

Discussion

  • Chris Nandor

    Chris Nandor - 2004-07-12

    Logged In: YES
    user_id=3660

    SELECT MIN(time), MAX(time), COUNT(*) FROM stories LEFT JOIN
    discussions ON stories.discussion=discussions.id WHERE discussions.id IS
    NULL AND discussion IS NOT NULL;

    mysql> SELECT MIN(time), MAX(time), COUNT(*) FROM stories LEFT
    JOIN discussions ON stories.discussion=discussions.id WHERE discussions.id
    IS NULL;
    +---------------------+---------------------+----------+
    | MIN(time) | MAX(time) | COUNT(*) |
    +---------------------+---------------------+----------+
    | 0000-00-00 00:00:00 | 2004-05-23 19:30:00 | 21851 |
    +---------------------+---------------------+----------+
    1 row in set (0.39 sec)

    mysql> SELECT MIN(time), MAX(time), COUNT(*) FROM stories LEFT
    JOIN discussions ON stories.discussion=discussions.id WHERE discussions.id
    IS NULL AND discussion IS NOT NULL;
    +---------------------+---------------------+----------+
    | MIN(time) | MAX(time) | COUNT(*) |
    +---------------------+---------------------+----------+
    | 2000-06-25 00:44:01 | 2002-03-16 14:49:27 | 21837 |
    +---------------------+---------------------+----------+
    1 row in set (0.71 sec)

    mysql> SELECT stories.stoid, stories.sid, time, discussion FROM stories
    LEFT JOIN discussions ON stories.discussion=discussions.id WHERE
    discussions.id IS NULL ORDER BY sid DESC LIMIT 15;
    +-------+------------------+---------------------+------------+
    | stoid | sid | time | discussion |
    +-------+------------------+---------------------+------------+
    | 36239 | 04/05/23/1154229 | 2004-05-23 19:30:00 | NULL |
    | 35997 | 04/05/12/1915249 | 2004-05-12 22:38:00 | NULL |
    | 28483 | 04/03/27/0131259 | 2003-03-27 08:31:00 | NULL |
    | 28076 | 04/03/03/0121208 | 2003-03-03 01:20:00 | NULL |
    | 27741 | 04/02/10/2043205 | 2003-02-10 20:41:00 | NULL |
    | 34739 | 04/02/02/1557221 | 2004-03-09 11:00:00 | NULL |
    | 33110 | 03/12/04/1448208 | 2003-12-05 10:00:00 | NULL |
    | 32986 | 03/11/24/211230 | 2003-11-30 08:00:00 | NULL |
    | 32740 | 03/11/17/0352253 | 2003-11-17 04:00:00 | NULL |
    | 32405 | 03/10/30/0536234 | 2003-10-30 05:31:00 | NULL |
    | 31646 | 03/09/26/0653258 | 2003-09-26 06:52:00 | NULL |
    | 31474 | 03/09/17/2040254 | 2003-09-17 20:38:00 | NULL |
    | 7 | 03/02/16/1459211 | 0000-00-00 00:00:00 | NULL |
    | 6 | 03/02/16/1433240 | 0000-00-00 00:00:00 | NULL |
    | 21871 | 02/03/16/1520206 | 2002-03-16 14:49:27 | 22066 |
    +-------+------------------+---------------------+------------+
    15 rows in set (0.38 sec)

    All I know for sure is that the problem of missing discussions table rows is
    limited to March 2002 and earlier. There are just a few broken stories
    over the last two years, the rest are all earlier. Also, these sids are all
    broken in discussions_old and stories_old too:

    mysql> SELECT stories_old.sid, time, discussion FROM stories_old LEFT
    JOIN discussions_old ON stories_old.discussion=discussions_old.id WHERE
    discussions_old.id IS NULL ORDER BY sid DESC LIMIT 15;
    +------------------+---------------------+------------+
    | sid | time | discussion |
    +------------------+---------------------+------------+
    | 04/05/23/1154229 | 2004-05-23 19:30:00 | NULL |
    | 04/05/12/1915249 | 2004-05-12 22:38:00 | NULL |
    | 04/03/27/0131259 | 2003-03-27 08:31:00 | NULL |
    | 04/03/03/0121208 | 2003-03-03 01:20:00 | NULL |
    | 04/02/10/2043205 | 2003-02-10 20:41:00 | NULL |
    | 04/02/02/1557221 | 2004-03-09 11:00:00 | NULL |
    | 03/12/04/1448208 | 2003-12-05 10:00:00 | NULL |
    | 03/11/24/211230 | 2003-11-30 08:00:00 | NULL |
    | 03/11/17/0352253 | 2003-11-17 04:00:00 | NULL |
    | 03/10/30/0536234 | 2003-10-30 05:31:00 | NULL |
    | 03/09/26/0653258 | 2003-09-26 06:52:00 | NULL |
    | 03/09/17/2040254 | 2003-09-17 20:38:00 | NULL |
    | 03/02/16/1459211 | 0000-00-00 00:00:00 | NULL |
    | 03/02/16/1433240 | 0000-00-00 00:00:00 | NULL |
    | 02/03/16/1520206 | 2002-03-16 14:49:27 | 22066 |
    +------------------+---------------------+------------+
    15 rows in set (4.17 sec)

    So I am saying this is not a current problem, or a conversion problem.
    The missing discussions in the last couple years could be intentional, or
    could be mistakes by editors, or could be temporary code glitches.
    Question remaining is what to do with existing stories, if anything.

     
  • Chris Nandor

    Chris Nandor - 2004-07-12
    • assigned_to: pudge --> jamiemccarthy
     
  • Jamie McCarthy

    Jamie McCarthy - 2004-07-12

    Logged In: YES
    user_id=3889

    mysql> select min(date), max(date), min(comments.sid),
    max(comments.sid), count(*) from comments left join discussions on
    comments.sid=discussions.id where discussions.id is null;
    +---------------------+---------------------+-------------------
    +-------------------+----------+
    | min(date) | max(date) | min(comments.sid) |
    max(comments.sid) | count(*) |
    +---------------------+---------------------+-------------------
    +-------------------+----------+
    | 2000-09-22 17:25:45 | 2003-10-15 18:38:13 | 1 |
    33596 | 9232 |
    +---------------------+---------------------+-------------------
    +-------------------+----------+
    1 row in set (1.58 sec)

    So 9232 comments are not attached to any discussions.

    Best solution may be to just delete those 9232 comments, and NULL any
    stories.discussion which does not point to an existing discussions.id.
    Sound OK to you? We'd still have backups in comments_old and
    stories_old for as long as we kept those around. And freshenup.pl would
    not hang on all those old stories, every time.

     
  • Jamie McCarthy

    Jamie McCarthy - 2004-07-12

    Logged In: YES
    user_id=3889

    Backups are on 170.20 in /root/newsforge-{stories,comments}-without-
    discussions.sql.

    mysql> SELECT MIN(time), MAX(time), COUNT(*) FROM stories LEFT
    JOIN discussions ON stories.discussion=discussions.id WHERE
    discussions.id IS NULL AND discussion IS NOT NULL;
    +-----------+-----------+----------+
    | MIN(time) | MAX(time) | COUNT(*) |
    +-----------+-----------+----------+
    | NULL | NULL | 0 |
    +-----------+-----------+----------+
    1 row in set (0.14 sec)

    And freshenup.pl is no longer complaining; we should be good now.

     
  • Jamie McCarthy

    Jamie McCarthy - 2004-07-12
    • status: open --> closed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks