Hi, i wouldn't expect 7zip to compress the sqlite database of the German images so much. 1G out of 9.
Do you have an idea what it is? The indexes? Empty (pre occupied) records?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i wouldn't expect 7zip to compress the sqlite database of the German images so much. 1G out of 9.
Hmmm.... Which file is the 9 to 1 compression? Let me know, and I'll download them and take a look. I looked at https://archive.org/details/Xowa_dewiki_latest and the file sizes look correct. You can compare them to the 2015-04 files and see that they are similar.
In general, there should be very little compression with any of the images database. Sqlite is pretty good at allocating disk space.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's probably the unfilled page size. Databases usually store their data on fixed page sizes. I'm using 4096 bytes for the xowa databases, so there will be leftover space (since the data will rarely fill all 4096 bytes) . If you're interested, you can always run sqlite3_analyzer on the db. SQLite is pretty good about accounting for its internals. See below.
Page size in bytes.................... 4096
Pages in the whole file (measured).... 140295
Pages in the whole file (calculated).. 140295
Pages that store data................. 140295 100.0%
Pages on the freelist (per header).... 0 0.0%
Pages on the freelist (calculated).... 0 0.0%
Pages of auto-vacuum overhead......... 0 0.0%
Number of tables in the database...... 9
Number of indices..................... 5
Number of named indices............... 5
Automatically generated indices....... 0
Size of the file in bytes............. 574648320
Bytes of user payload stored.......... 249570714 43.4%
*** Page counts for all tables with their indices ********************
FSDB_FIL.............................. 54524 38.9%
ORIG_REG.............................. 51078 36.4%
FSDB_THM.............................. 34685 24.7%
FSDB_DIR.............................. 2 0.001%
XOWA_CFG.............................. 2 0.001%
FSDB_DBA.............................. 1 0.0%
FSDB_DBB.............................. 1 0.0%
FSDB_MNT.............................. 1 0.0%
SQLITE_MASTER......................... 1 0.0%
*** All tables and indices *******************************************
Percentage of total database.......... 100.0%
Number of entries..................... 13882212
Bytes of storage consumed............. 574648320
Bytes of payload...................... 473247993 82.4%
Average payload per entry............. 34.09
Average unused bytes per entry........ 2.46
Average fanout........................ 378.00
Fragmentation......................... 91.6%
Maximum payload per entry............. 355
Entries that use overflow............. 0 0.0%
Index pages used...................... 192
Primary pages used.................... 140103
Overflow pages used................... 0
Total pages used...................... 140295
Unused bytes on index pages........... 105176 13.4%
Unused bytes on primary pages......... 34015645 5.9%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 34120821 5.9%
*** All tables *******************************************************
Percentage of total database.......... 51.8%
Number of entries..................... 6941128
Bytes of storage consumed............. 297697280
Bytes of payload...................... 249572931 83.8%
Average payload per entry............. 35.96
Average unused bytes per entry........ 0.36
Average fanout........................ 378.00
Fragmentation......................... 83.9%
Maximum payload per entry............. 355
Entries that use overflow............. 0 0.0%
Index pages used...................... 192
Primary pages used.................... 72488
Overflow pages used................... 0
Total pages used...................... 72680
Unused bytes on index pages........... 105176 13.4%
Unused bytes on primary pages......... 2385061 0.80%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 2490237 0.84%
*** All indices ******************************************************
Percentage of total database.......... 48.2%
Number of entries..................... 6941084
Bytes of storage consumed............. 276951040
Bytes of payload...................... 223675062 80.8%
Average payload per entry............. 32.22
Average unused bytes per entry........ 4.56
Fragmentation......................... 99.922%
Maximum payload per entry............. 267
Entries that use overflow............. 0 0.0%
Primary pages used.................... 67615
Overflow pages used................... 0
Total pages used...................... 67615
Unused bytes on primary pages......... 31630584 11.4%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 31630584 11.4%
*** Table FSDB_DBA ***************************************************
Percentage of total database.......... 0.0%
Number of entries..................... 1
Bytes of storage consumed............. 4096
Bytes of payload...................... 34 0.83%
Average payload per entry............. 34.00
Average unused bytes per entry........ 4050.00
Maximum payload per entry............. 34
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 4050 98.9%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 4050 98.9%
*** Table FSDB_DBB ***************************************************
Percentage of total database.......... 0.0%
Number of entries..................... 28
Bytes of storage consumed............. 4096
Bytes of payload...................... 1204 29.4%
Average payload per entry............. 43.00
Average unused bytes per entry........ 99.00
Maximum payload per entry............. 43
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 2772 67.7%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 2772 67.7%
*** Table FSDB_DIR and all its indices *******************************
Percentage of total database.......... 0.001%
Number of entries..................... 4
Bytes of storage consumed............. 8192
Bytes of payload...................... 96 1.2%
Average payload per entry............. 24.00
Average unused bytes per entry........ 2016.25
Fragmentation......................... 0.0%
Maximum payload per entry............. 26
Entries that use overflow............. 0 0.0%
Primary pages used.................... 2
Overflow pages used................... 0
Total pages used...................... 2
Unused bytes on primary pages......... 8065 98.4%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 8065 98.4%
*** Table FSDB_DIR w/o any indices ***********************************
Percentage of total database.......... 0.0%
Number of entries..................... 2
Bytes of storage consumed............. 4096
Bytes of payload...................... 45 1.1%
Average payload per entry............. 22.50
Average unused bytes per entry........ 2017.00
Maximum payload per entry............. 25
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 4034 98.5%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 4034 98.5%
*** Indices of table FSDB_DIR ****************************************
Percentage of total database.......... 0.0%
Number of entries..................... 2
Bytes of storage consumed............. 4096
Bytes of payload...................... 51 1.2%
Average payload per entry............. 25.50
Average unused bytes per entry........ 2015.50
Maximum payload per entry............. 26
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 4031 98.4%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 4031 98.4%
*** Table FSDB_FIL and all its indices *******************************
Percentage of total database.......... 38.9%
Number of entries..................... 4306388
Bytes of storage consumed............. 223330304
Bytes of payload...................... 187125089 83.8%
Average payload per entry............. 43.45
Average unused bytes per entry........ 3.45
Average fanout........................ 373.00
Fragmentation......................... 97.4%
Maximum payload per entry............. 269
Entries that use overflow............. 0 0.0%
Index pages used...................... 73
Primary pages used.................... 54451
Overflow pages used................... 0
Total pages used...................... 54524
Unused bytes on index pages........... 38161 12.8%
Unused bytes on primary pages......... 14802685 6.6%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 14840846 6.6%
*** Table FSDB_FIL w/o any indices ***********************************
Percentage of total database.......... 19.4%
Number of entries..................... 2153194
Bytes of storage consumed............. 111669248
Bytes of payload...................... 95690902 85.7%
Average payload per entry............. 44.44
Average unused bytes per entry........ 0.65
Average fanout........................ 373.00
Fragmentation......................... 94.9%
Maximum payload per entry............. 269
Entries that use overflow............. 0 0.0%
Index pages used...................... 73
Primary pages used.................... 27190
Overflow pages used................... 0
Total pages used...................... 27263
Unused bytes on index pages........... 38161 12.8%
Unused bytes on primary pages......... 1368359 1.2%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 1406520 1.3%
*** Indices of table FSDB_FIL ****************************************
Percentage of total database.......... 19.4%
Number of entries..................... 2153194
Bytes of storage consumed............. 111661056
Bytes of payload...................... 91434187 81.9%
Average payload per entry............. 42.46
Average unused bytes per entry........ 6.24
Fragmentation......................... 99.916%
Maximum payload per entry............. 267
Entries that use overflow............. 0 0.0%
Primary pages used.................... 27261
Overflow pages used................... 0
Total pages used...................... 27261
Unused bytes on primary pages......... 13434326 12.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 13434326 12.0%
*** Table FSDB_MNT ***************************************************
Percentage of total database.......... 0.0%
Number of entries..................... 2
Bytes of storage consumed............. 4096
Bytes of payload...................... 44 1.1%
Average payload per entry............. 22.00
Average unused bytes per entry........ 2018.00
Maximum payload per entry............. 22
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 4036 98.5%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 4036 98.5%
*** Table FSDB_THM and all its indices *******************************
Percentage of total database.......... 24.7%
Number of entries..................... 5254520
Bytes of storage consumed............. 142069760
Bytes of payload...................... 109849693 77.3%
Average payload per entry............. 20.91
Average unused bytes per entry........ 1.25
Average fanout........................ 364.00
Fragmentation......................... 99.66%
Maximum payload per entry............. 31
Entries that use overflow............. 0 0.0%
Index pages used...................... 51
Primary pages used.................... 34634
Overflow pages used................... 0
Total pages used...................... 34685
Unused bytes on index pages........... 30602 14.6%
Unused bytes on primary pages......... 6525692 4.6%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 6556294 4.6%
*** Table FSDB_THM w/o any indices ***********************************
Percentage of total database.......... 13.3%
Number of entries..................... 2627260
Bytes of storage consumed............. 76193792
Bytes of payload...................... 58314826 76.5%
Average payload per entry............. 22.20
Average unused bytes per entry........ 0.11
Average fanout........................ 364.00
Fragmentation......................... 99.38%
Maximum payload per entry............. 31
Entries that use overflow............. 0 0.0%
Index pages used...................... 51
Primary pages used.................... 18551
Overflow pages used................... 0
Total pages used...................... 18602
Unused bytes on index pages........... 30602 14.6%
Unused bytes on primary pages......... 259363 0.34%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 289965 0.38%
*** Indices of table FSDB_THM ****************************************
Percentage of total database.......... 11.5%
Number of entries..................... 2627260
Bytes of storage consumed............. 65875968
Bytes of payload...................... 51534867 78.2%
Average payload per entry............. 19.62
Average unused bytes per entry........ 2.39
Fragmentation......................... 99.994%
Maximum payload per entry............. 27
Entries that use overflow............. 0 0.0%
Primary pages used.................... 16083
Overflow pages used................... 0
Total pages used...................... 16083
Unused bytes on primary pages......... 6266329 9.5%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 6266329 9.5%
*** Table ORIG_REG and all its indices *******************************
Percentage of total database.......... 36.4%
Number of entries..................... 4321210
Bytes of storage consumed............. 209215488
Bytes of payload...................... 176267923 84.3%
Average payload per entry............. 40.79
Average unused bytes per entry........ 2.94
Average fanout........................ 394.00
Fragmentation......................... 80.0%
Maximum payload per entry............. 336
Entries that use overflow............. 0 0.0%
Index pages used...................... 68
Primary pages used.................... 51010
Overflow pages used................... 0
Total pages used...................... 51078
Unused bytes on index pages........... 36413 13.1%
Unused bytes on primary pages......... 12660311 6.1%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 12696724 6.1%
*** Table ORIG_REG w/o any indices ***********************************
Percentage of total database.......... 19.1%
Number of entries..................... 2160605
Bytes of storage consumed............. 109809664
Bytes of payload...................... 95562835 87.0%
Average payload per entry............. 44.23
Average unused bytes per entry........ 0.36
Average fanout........................ 394.00
Fragmentation......................... 61.9%
Maximum payload per entry............. 336
Entries that use overflow............. 0 0.0%
Index pages used...................... 68
Primary pages used.................... 26741
Overflow pages used................... 0
Total pages used...................... 26809
Unused bytes on index pages........... 36413 13.1%
Unused bytes on primary pages......... 737563 0.67%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 773976 0.70%
*** Indices of table ORIG_REG ****************************************
Percentage of total database.......... 17.3%
Number of entries..................... 2160605
Bytes of storage consumed............. 99405824
Bytes of payload...................... 80705088 81.2%
Average payload per entry............. 37.35
Average unused bytes per entry........ 5.52
Fragmentation......................... 99.90%
Maximum payload per entry............. 262
Entries that use overflow............. 0 0.0%
Primary pages used.................... 24269
Overflow pages used................... 0
Total pages used...................... 24269
Unused bytes on primary pages......... 11922748 12.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 11922748 12.0%
*** Table SQLITE_MASTER **********************************************
Percentage of total database.......... 0.0%
Number of entries..................... 13
Bytes of storage consumed............. 4096
Bytes of payload...................... 2217 54.1%
Average payload per entry............. 170.54
Average unused bytes per entry........ 131.69
Maximum payload per entry............. 355
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 1712 41.8%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 1712 41.8%
*** Table XOWA_CFG and all its indices *******************************
Percentage of total database.......... 0.001%
Number of entries..................... 46
Bytes of storage consumed............. 8192
Bytes of payload...................... 1693 20.7%
Average payload per entry............. 36.80
Average unused bytes per entry........ 137.43
Fragmentation......................... 0.0%
Maximum payload per entry............. 63
Entries that use overflow............. 0 0.0%
Primary pages used.................... 2
Overflow pages used................... 0
Total pages used...................... 2
Unused bytes on primary pages......... 6322 77.2%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 6322 77.2%
*** Table XOWA_CFG w/o any indices ***********************************
Percentage of total database.......... 0.0%
Number of entries..................... 23
Bytes of storage consumed............. 4096
Bytes of payload...................... 824 20.1%
Average payload per entry............. 35.83
Average unused bytes per entry........ 137.91
Maximum payload per entry............. 61
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 3172 77.4%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 3172 77.4%
*** Indices of table XOWA_CFG ****************************************
Percentage of total database.......... 0.0%
Number of entries..................... 23
Bytes of storage consumed............. 4096
Bytes of payload...................... 869 21.2%
Average payload per entry............. 37.78
Average unused bytes per entry........ 136.96
Maximum payload per entry............. 63
Entries that use overflow............. 0 0.0%
Primary pages used.................... 1
Overflow pages used................... 0
Total pages used...................... 1
Unused bytes on primary pages......... 3150 76.9%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 3150 76.9%
*** Definitions ******************************************************
Page size in bytes
The number of bytes in a single page of the database file.
Usually 1024.
Number of pages in the whole file
The number of 4096-byte pages that go into forming the complete
database
Pages that store data
The number of pages that store data, either as primary B*Tree pages or
as overflow pages. The number at the right is the data pages divided by
the total number of pages in the file.
Pages on the freelist
The number of pages that are not currently in use but are reserved for
future use. The percentage at the right is the number of freelist pages
divided by the total number of pages in the file.
Pages of auto-vacuum overhead
The number of pages that store data used by the database to facilitate
auto-vacuum. This is zero for databases that do not support auto-vacuum.
Number of tables in the database
The number of tables in the database, including the SQLITE_MASTER table
used to store schema information.
Number of indices
The total number of indices in the database.
Number of named indices
The number of indices created using an explicit CREATE INDEX statement.
Automatically generated indices
The number of indices used to implement PRIMARY KEY or UNIQUE constraints
on tables.
Size of the file in bytes
The total amount of disk space used by the entire database files.
Bytes of user payload stored
The total number of bytes of user payload stored in the database. The
schema information in the SQLITE_MASTER table is not counted when
computing this number. The percentage at the right shows the payload
divided by the total file size.
Percentage of total database
The amount of the complete database file that is devoted to storing
information described by this category.
Number of entries
The total number of B-Tree key/value pairs stored under this category.
Bytes of storage consumed
The total amount of disk space required to store all B-Tree entries
under this category. The is the total number of pages used times
the pages size.
Bytes of payload
The amount of payload stored under this category. Payload is the data
part of table entries and the key part of index entries. The percentage
at the right is the bytes of payload divided by the bytes of storage
consumed.
Average payload per entry
The average amount of payload on each entry. This is just the bytes of
payload divided by the number of entries.
Average unused bytes per entry
The average amount of free space remaining on all pages under this
category on a per-entry basis. This is the number of unused bytes on
all pages divided by the number of entries.
Fragmentation
The percentage of pages in the table or index that are not
consecutive in the disk file. Many filesystems are optimized
for sequential file access so smaller fragmentation numbers
sometimes result in faster queries, especially for larger
database files that do not fit in the disk cache.
Maximum payload per entry
The largest payload size of any entry.
Entries that use overflow
The number of entries that user one or more overflow pages.
Total pages used
This is the number of pages used to hold all information in the current
category. This is the sum of index, primary, and overflow pages.
Index pages used
This is the number of pages in a table B-tree that hold only key (rowid)
information and no data.
Primary pages used
This is the number of B-tree pages that hold both key and data.
Overflow pages used
The total number of overflow pages used for this category.
Unused bytes on index pages
The total number of bytes of unused space on all index pages. The
percentage at the right is the number of unused bytes divided by the
total number of bytes on index pages.
Unused bytes on primary pages
The total number of bytes of unused space on all primary pages. The
percentage at the right is the number of unused bytes divided by the
total number of bytes on primary pages.
Unused bytes on overflow pages
The total number of bytes of unused space on all overflow pages. The
percentage at the right is the number of unused bytes divided by the
total number of bytes on overflow pages.
Unused bytes on all pages
The total number of bytes of unused space on all primary and overflow
pages. The percentage at the right is the number of unused bytes
divided by the total number of bytes.
**********************************************************************
The entire text of this report can be sourced into any SQL database
engine for further analysis. All of the text above is an SQL comment.
The data used to generate this report follows:
*/
BEGIN;
CREATE TABLE space_used(
name clob, -- Name of a table or index in the database file
tblname clob, -- Name of associated table
is_index boolean, -- TRUE if it is an index, false for a table
nentry int, -- Number of entries in the BTree
leaf_entries int, -- Number of leaf entries
payload int, -- Total amount of data stored in this table or index
ovfl_payload int, -- Total amount of data stored on overflow pages
ovfl_cnt int, -- Number of entries that use overflow
mx_payload int, -- Maximum payload size
int_pages int, -- Number of interior pages used
leaf_pages int, -- Number of leaf pages used
ovfl_pages int, -- Number of overflow pages used
int_unused int, -- Number of unused bytes on interior pages
leaf_unused int, -- Number of unused bytes on primary pages
ovfl_unused int, -- Number of unused bytes on overflow pages
gap_cnt int, -- Number of gaps in the page layout
compressed_size int -- Total bytes stored on disk
);
INSERT INTO space_used VALUES('sqlite_master','sqlite_master',0,13,13,2217,0,0,355,0,1,0,0,1712,0,0,4096);
INSERT INTO space_used VALUES('xowa_cfg','xowa_cfg',0,23,23,824,0,0,61,0,1,0,0,3172,0,0,4096);
INSERT INTO space_used VALUES('xowa_cfg__main','xowa_cfg',1,23,23,869,0,0,63,0,1,0,0,3150,0,0,4096);
INSERT INTO space_used VALUES('orig_reg','orig_reg',0,2187345,2160605,95562835,0,0,336,68,26741,0,36413,737563,0,16600,109809664);
INSERT INTO space_used VALUES('orig_reg__main','orig_reg',1,2160605,2160605,80705088,0,0,262,0,24269,0,0,11922748,0,24243,99405824);
INSERT INTO space_used VALUES('fsdb_mnt','fsdb_mnt',0,2,2,44,0,0,22,0,1,0,0,4036,0,0,4096);
INSERT INTO space_used VALUES('fsdb_dba','fsdb_dba',0,1,1,34,0,0,34,0,1,0,0,4050,0,0,4096);
INSERT INTO space_used VALUES('fsdb_dbb','fsdb_dbb',0,28,28,1204,0,0,43,0,1,0,0,2772,0,0,4096);
INSERT INTO space_used VALUES('fsdb_dir','fsdb_dir',0,2,2,45,0,0,25,0,1,0,0,4034,0,0,4096);
INSERT INTO space_used VALUES('fsdb_dir__name','fsdb_dir',1,2,2,51,0,0,26,0,1,0,0,4031,0,0,4096);
INSERT INTO space_used VALUES('fsdb_fil','fsdb_fil',0,2180383,2153194,95690902,0,0,269,73,27190,0,38161,1368359,0,25873,111669248);
INSERT INTO space_used VALUES('fsdb_fil__owner','fsdb_fil',1,2153194,2153194,91434187,0,0,267,0,27261,0,0,13434326,0,27237,111661056);
INSERT INTO space_used VALUES('fsdb_thm','fsdb_thm',0,2645810,2627260,58314826,0,0,31,51,18551,0,30602,259363,0,18486,76193792);
INSERT INTO space_used VALUES('fsdb_thm__owner','fsdb_thm',1,2627260,2627260,51534867,0,0,27,0,16083,0,0,6266329,0,16081,65875968);
COMMIT;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi, i wouldn't expect 7zip to compress the sqlite database of the German images so much. 1G out of 9.
Do you have an idea what it is? The indexes? Empty (pre occupied) records?
Sorry, i wasn't logged in
Hmmm.... Which file is the 9 to 1 compression? Let me know, and I'll download them and take a look. I looked at https://archive.org/details/Xowa_dewiki_latest and the file sizes look correct. You can compare them to the 2015-04 files and see that they are similar.
In general, there should be very little compression with any of the images database. Sqlite is pretty good at allocating disk space.
Oh, my sentence might be misunderstandale. I am attaching a screenshot.
Cool. Thanks for the screenshot.
It's probably the unfilled page size. Databases usually store their data on fixed page sizes. I'm using 4096 bytes for the xowa databases, so there will be leftover space (since the data will rarely fill all 4096 bytes) . If you're interested, you can always run sqlite3_analyzer on the db. SQLite is pretty good about accounting for its internals. See below.