Wednesday, August 09, 2006

Restoration of RPM database.

This article may help you, if your RPM database has corrupted, and that’s what happened to mine. In fact, I tried to use `rpm --rebulddb' command on it in an attempt to fix it. Unfortunately, most of package’s meta-information had gone. However, all, or so I hoped, files remained on the filesystem. I have decided that this could be a my source of restoration of the RPM database.
Fortunately( to me :) ) other people have encountered this problem already, and some solutions exist. Here is the URL for this document: http://www.informatimago.com/linux/rpm-rebuilddb.html. This one actually worked for me as a starting point. It is based on comparison of actual files and rpm'ed files by names, deriving an rpm name where filenames match. It could have been a good solution, if there would be only one version of rpm packages. But in real life we have many. I have downloaded all packages from http://download.fedoralegacy.org/redhat/9/os/i386/ and http://download.fedoralegacy.org/redhat/9/updates/i386/. As a result, there were multiple versions of packages. Which one to apply? I have decided to use some more information about the file that will uniquely identify it. Sure it was MD5 check sum. The idea was to create two lists, speaking relation database language, tables, and then to join them.
Table DISK_CONTENT ( filename, bytesize, md5sum )
Table RPM_CONTENT ( rpmfullname, filename, bytesize, md5sum )

Then I planned to load the data into these tables for joining using SQL language. When joining, ideally, a package suggested to be previously installed if all of its files exist on the filesystem, and the sizes and MD5 checksums are identical. But, some files, like config files, can be modified during or after installation, and the size/md5 will not match. Thus I decided to assign a merit 0 (zero) to files that md5 match, merit 1 where just size matches, merit 3 where only name matches, and merit 4 where file exist in a package but does not exist on the filesystem.

So, I started from gathering the metadata of actual files

1. On a machine where RPM database has been severely damaged I gather information about all files, including absolute path, byte size and md5 checksum

find / -fstype ext3 -type f -printf '%s' -exec md5sum {} \; >/tmp/disk-contents

note: I have only ext3 filesystems, thus I specified -fstype ext3 to exclude /proc and network mounted filesystems. Ideally /tmp, /opt and /home should be excluded as well, but I didn't bother.

The format of a disk-contents file is "filenamesizemd5sumfilename", because md5sum just types this way. I removed the last filename using sed.

sed 's/ \//\\//' /tmp/disk-contents >/tmp/disk-contents-n

2. On a machine where I have all rpms downloaded I gathered meta-information about all files that sit inside rpms, including the rpm file name, absolute path, byte size and md5 checksum
Note: it can be the same machine, but due to space constraints I made it on a separate one. It can sound funny, but I've done it on a Windows machine using Cygwin's utilities, bash and rpm. (yes we can have rpm in windows!)

RPMDIR=g:/download/Linux/RH9/os cd $RPMDIR rm /tmp/rpm-contents
for f in *.rpm ; do rpm -q --nopgp --queryformat='[%{FILENAMES}%{FILESIZES}%{FILEMD5S}'$f'\n]' -p $f >> /tmp/rpm-contents done


I decided to introduce one more table that keeps some info about the whole RPM. Particularly, the fields NAME, PROVIDES, VERSION. That will allow me to group by NAME to group together packages of different versions like lvm-1.0.3-12.i386.rpm and lvm-1.0.3-12.1.legacy.i386.rpm. Why do this? The idea is to check all the files for packages. The package that is most closely presented by the filesystem files will win. (Later I found that the fields PROVIDES and VERSION are not necessery)

RPMDIR=g:/download/Linux/RH9/os cd $RPMDIR rm /tmp/rpm-info
for f in *.rpm ; do rpm -q --nopgp --queryformat=$f'%{NAME}%{VERSION}%{PROVIDES}%{PROVIDENAME}%{PROVIDEVERSION}\n' -p $f >> /tmp/rpm-info
done


3. Load to RDBMS. I decided to go with MySQL(5.0), although I prefer Oracle, I thought that MySQL example can be more helpful to linux people, because of its simplicity and availability.

drop table DISK_CONTENT;

create table DISK_CONTENT (bytes decimal, md5sum varchar(64), filename varchar(2000) );

load data infile 'E:/cygwin/tmp/disk-contents-n' into table DISK_CONTENT fields terminated by '';

create index DISK_CONTENT_IX01 on DISK_CONTENT( filename );

drop table RPM_CONTENT;

create table RPM_CONTENT (filename varchar(2000), bytes decimal, md5sum varchar(64), rpmfilename varchar(128));

load data infile 'E:/cygwin/tmp/rpm-contents' into table RPM_CONTENT fields terminated by '';

alter table RPM_CONTENT drop index RPM_CONTENT_IX01;

alter table RPM_CONTENT add index RPM_CONTENT_IX01 ( filename ); alter table RPM_CONTENT drop index RPM_CONTENT_IX02;

alter table RPM_CONTENT add index RPM_CONTENT_IX02 ( rpmfilename );

drop table RPM_INFO;

create table RPM_INFO (rpmfilename varchar(128), name varchar(128), version varchar(128), provides varchar(128), providename varchar(128), provideversion varchar(128));

load data infile 'E:/cygwin/tmp/rpm-info' into table RPM_INFO fields terminated by '';

alter table RPM_INFO add index RPM_INFO_IX01 ( rpmfilename );

Good. All files metadata has been loaded successfully.

4. Deriving of RPMs. Now I need to find out the RPMs that are supposedly have been previously installed. I wanted to do that in a single query:

select distinct rpmfilename from ( select rpmfilename, case when name=@N then @R:=@R+1 else @R:=1 end Rank, @N:=name r from (select @R:=1, @N:='') r, (select name, rpmfilename, sum(proximity) sumprox from (select I.name, R.filename R_filename, D.filename D_filename, case when (D.filename is not NULL and D.bytes=R.bytes and D.md5sum = R.md5sum) then 0 when (D.filename is not NULL and D.bytes=R.bytes and D.md5sum != R.md5sum) then 1 when (D.filename is not NULL and D.bytes=R.bytes and R.md5sum is NULL ) then 2 when (D.filename is not NULL and D.bytes!=R.bytes) then 3 when (D.filename is NULL) then 4 end proximity, R.rpmfilename, D.md5sum D_md5sum, R.md5sum R_md5sum from RPM_INFO I inner join RPM_CONTENT R on I.rpmfilename=R.rpmfilename left outer join DISK_CONTENT D on D.filename=R.filename /*where R.filename like '/sbin/%'*/ /*order by 1, 4*/ ) v group by name, rpmfilename having count(D_filename)>0 order by 1, 3) v2 ) v3 where Rank=1;

This query works but it requires enormous amount of space for temporary results, gigabytes. I had just 2.5 Gb of free space in my %TEMP% disk, so it has failed.
To resolve this, I’ve split this query to be run sequentially, implicitly storing temporary results in a table rpm_files_tmp.

drop table rpm_files_tmp;

/* place the join into a temp table*/
create table rpm_files_tmp as

select I.name, R.rpmfilename, D.filename D_filename, /*R.filename R_filename, */
case
when (D.filename is not NULL and D.bytes=R.bytes and D.md5sum = R.md5sum) then 0
when (D.filename is not NULL and D.bytes=R.bytes and D.md5sum != R.md5sum) then 1
when (D.filename is not NULL and D.bytes=R.bytes and R.md5sum is NULL ) then 2
when (D.filename is not NULL and D.bytes!=R.bytes) then 3 when (D.filename is NULL) then 4
end proximity
from RPM_INFO I
inner join RPM_CONTENT R on I.rpmfilename=R.rpmfilename
left outer join DISK_CONTENT D on D.filename=R.filename;

/* here I rank packages by a composite merit - sum(proximity), which ideally equals 0. But may be greater, when some files don't match exactly or absent. I selected packages with a best possible match.*/
select distinct rpmfilename into outfile 'E:/cygwin/tmp/rpm-to-install.lst' lines terminated by '\n'

from ( select rpmfilename, case when name=@N then @R:=@R+1 else @R:=1 end Rank, @N:=name r
from (select @R:=1, @N:='') r,
(select name, rpmfilename, sum(proximity) sumprox
from rpm_files_tmp v
group by name, rpmfilename
having count(D_filename)>0
order by 1, 3) v2 ) v3
where Rank=1;

drop table rpm_files_tmp;


5. Install rpms from the list produced in previous step.

# Reinitialize a new rpm database.
mv /var/lib/rpm /var/lib/rpm-damaged-beyond-hope mkdir /var/lib/rpm rpm --initdb

# install GPG keys rpm --import
http://www.redhat.com/security/db42a60e.txt rpm --import http://www.fedoralegacy.org/FEDORA-LEGACY-GPG-KEY

# install packages into RPM database only, do not touch files on filesystems.
rpm -i -v --justdb --nodeps --noscripts --notriggers `cat ~/rpm/rpm-to-install.lst`

# check
rpm -qa

Done!

0 Comments:

Post a Comment

<< Home