GEDCOM, Multi-Media, and the Macintosh
Keith Clarke
October 2002
Introduction
This document is aimed at people developing GEDCOM-capable software. Im hoping therefore that at least the developers of GEDitCOM, Generation X and MacStammBaum will take a look. (Thanks guys.)
[A multi-media] structure provides two options in handling the GEDCOM multimedia interface. The first alternative (embedded) includes all of the data, including the multimedia object, within the transmission file. The embedded method includes pointers to GEDCOM records that contain encoded image or sound objects. Each record represents a multimedia object or object fragment. An object fragment is created by breaking the multimedia files into several multimedia object records of 32K or less. These fragments are tied together by chaining from one multimedia object fragment to the next in sequence. This procedure will help manage the size of a multimedia GEDCOM record so that existing systems which are not expecting large multimedia records may discard the records without crashing due to the size of the record. Systems which handle embedded multimedia can reconstitute the multimedia fragments by decoding the object fragments and concatenating them to the assigned multimedia file.
The second method allows the GEDCOM context to be connected to an external multimedia file. This process is only managed by GEDCOM in the sense that the appropriate file name is included in the GEDCOM file in context, but the maintenance and transfer of the multimedia files are external to GEDCOM.
The method described first is scarcely used; SeeGEDCOMX doesnt now and wont in future support it.
The method described second is widely used, but is only vaguely defined. The particular external file is referenced by a GEDCOM line containing (quote) A complete local or remote file reference to the auxiliary data to be linked to the GEDCOM context. Remote reference would include a network address where the multimedia data may be obtained.
So were left wondering what a complete local file reference might consist of. Files have names, of course, (quote) the appropriate file name is included so we must store one of those; but several different files can have the same name, provided that theyre in different folders.
The GEDCOM data might record the name of the volume (e.g. Macintosh HD or Zip 100); the name of a folder on that volume (e.g. Pictures); and the name of the file within that folder (e.g. Joan Barrow.jpg). Of course the folder may contain a folder that contains a folder, etc, with the picture file buried very deeply inside it. And so were led to the idea of keeping a path name in the GEDCOM data; it might look something like Macintosh HD:Pictures:Family:Barrow:Joan Barrow.jpg on a Macintosh.
Such a path would mean nothing to a user of MS Windows, to whom you sent a GEDCOM document. They wouldnt have a volume (or a drive) called Macintosh HD. They might not like to organise their pictures the way that you do. If you sent them the picture as well as the GEDCOM, thered be no hope at all unless the name in the GEDCOM matched the name of the picture file, but youd probably make a new folder of the pictures and send the folder (as a Stuffit or Zip archive), or perhaps youd send the picture as an email attachment.
Its worth remembering what the designers of GEDCOM had in mind: a communication, not a database format. For this purpose, the name of the picture may be enough the full path name of the picture on the sending system is almost certainly useful only if youre sending the GEDCOM to yourself, for example exporting it from one family history program and importing it into another.
In the end, it looks like a program reading a GEDCOM should try a path when it finds one in the data, and if it works, fine. If it doesnt, then the program should try to find a file with the right name. This is what SeeGEDCOMX does, and what GEDitCOM does, and no doubt many other programs. It is a de facto standard layered on the actual standard.
But what should a genealogy application store in the GEDCOM if it is saving a document or exporting one?
Macintosh Aliases
Documentation within the More Files example code from Jim Luther, Apple Developer Technical Support Emeritus, includes the following notes -
The point about non-unique names applies only to MacOS systems before Mac OS X. In those, you can call a Zip, floppy or CD Macintosh HD if you like, and theres no way to refer to any particular volume if you do.
So we cant obey Macintosh custom and practice and the usual, de facto, interpretation of the GEDCOM standard at the same time.
Fortunately theres an escape: we shouldnt store references to photographs etc. using only path names and the GEDCOM standard allows us to define additional GEDCOM line types for a particular application.
SeeGEDCOMX stores additional lines as well as, and alongside, the file name. The first of these extra lines has the custom tag _ALS, which along with a number of continuation lines (CONC lines) contains a binhex encoded version of a standard MacOS alias to the file. Information in the alias means that the program knows which of your various disks holds the file, and which file it should use if you change the originals name, or re-organise your picture collection. Another custom tag is used to store a checksum for the alias. Its purpose is to detect corruption in the GEDCOM file the checksum can be recalculated when the file is opened in SeeGEDCOMX and any other program that adopts the same scheme. If the checksum doesnt match, then the program shouldnt try to use the encoded alias data to locate the file.
Acknowledgements
John Nairn persuaded me to use custom GEDCOM tags and to keep the line length well below that required by the GEDCOM specification.
Sample Code
// A gTree is an object representing a particular line in a
// GEDCOM file, and its nested lines; so the gTree
// representing an OBJE line can be expected to have
// subsidiary objects representing the FORM and _ALS
// lines.
void checksum (const char *buf, int length, UInt32 *sum32);
// inINDI is an object representing an INDI record
// outSpec is the multimedia file, if we can resolve it.
Boolean
GEDCOM::TryObjectRecord(gTree *inINDI, FSSpec& outSpec) {
// findoption() returns the subsidiary object, or NULL
// if it doesn't exist.
gTree *theObj = inINDI->findoption("OBJE");
if(theObj==NULL) {
return false;
}
gTree *theType = theObj->findoption("FORM");
// validate FORM subrecord, somehow. If you're using
// QuickTime, there's nothing much you'd want to do with this
// value, so perhaps it should simply be ignored on input.
// Probably worth writingif you succeed in opening the multi-
// media object.
if(theType==NULL||std::strcmp(theType->getValue(),"jpeg")!=0)
{
return false;
}
gTree *theAlias = theObj->findoption("_ALS");
gTree *theCS = theObj->findoption("_CKS");
if(theAlias==NULL || theCS==NULL) {
return false;
}
UInt32 theASize = theAlias->CopyValueBinHex(NULL);
// corrupt file?
if(theASize<sizeof(AliasRecord) || theASize>512) {
return false;
}
AliasHandle theAliasH = (AliasHandle)::NewHandle(theASize);
// PowerPlant class to lock the handle until the end of this
// scope
StHandleLocker lock((Handle)theAliasH);
theASize = theAlias->CopyValueBinHex(*theAliasH);
UInt32 theCheckSum=0, theSavedCheckSum=0;
theCS->CopyValueBinHex(&theSavedCheckSum);
checksum( (char*)*theAliasH,
(**theAliasH).aliasSize,
&theCheckSum);
if(theCheckSum!=theSavedCheckSum) {
return false;
}
Boolean theChange;
FSSpec theDocSpec;
Boolean y = GetDocument()->GetFSSpec(theDocSpec);
// HIG says remember if user cancels one of these (broken)
// alias and don't annoy them by asking about every single
// one... but there's no convenient api for it that I've
// found; you can'task ResolveAlias not to interact
OSErr theErr
= ::ResolveAlias(NULL,theAliasH,&outSpec,&theChange);
if(theErr!=noErr || GetDocument()->IsReadOnly()) {
return theErr==noErr;
}
if(theChange) {
// update OBJE.FILE from theAliasH
theAlias->SetValueBinHex(*theAliasH,theASize);
checksum( (char*)*theAliasH,
(**theAliasH).aliasSize,
&theCheckSum);
theCS->SetValueBinHex(theCheckSum,
std::strlen(theCheckSum));
GetDocument()->SetModified(true);
}
// so now it makes sense to turn the FSSpec into a path name
// and store it in the GEDCOM, if only to help GEDitCOM to
// find this object.
// FSpGetFullPath() is from "More Files" sample code,
// Apple Dev Connection
short thePathLength;
Handle thePathH;
StHandleLocker plock(thePathH);
OSErr theErr
= FSpGetFullPath(&outSpec,&thePathLength,&thePathH);
if(theErr==noErr) {
// findcreateoption() finds the requested subsidiary
// line, or creates a new such line if none exists
// already
gTree *thePath = theObj->findcreateoption("FILE");
if( std::strcmp(thePath->getValue(),*thePathH)!=0 ) {
// set the value of this line to be the path
thePath->setValue(*thePathH,thePathLength);
GetDocument()->SetModified(true);
}
::DisposeHandle(thePathH);
}
return theErr==noErr;
}
// FITS ones-complement checksum from
// http://www.adass.org/adass/proceedings/adass94/seamanr.html
/* quote from that web page:
A 1's complement checksum (as used by TCP/IP) is preferable
to a 2's complement checksum (as used by the UNIX sum
command, for example), since overflow bits are permuted back
into the sum and therefore all bit positions are sampled
evenly. A 32 bit sum is as easy to calculate as a 16 bit sum
because of this symmetry, providinggreater sensitivity to
errors.
*/
void checksum(const char *buf, int length, UInt32 *sum32)
{
unsigned short *sbuf;
unsigned int hi, lo, hicarry, locarry;
int len, remain, i;
sbuf = (unsigned short *) buf;
len = 2*(length / 4); /* make sure it's even */
remain = length % 4; /* add odd bytes below */
// sum32 is an accumulating parameter, so you can
// checksum things bigger than any feasible buffer
hi = (*sum32 >> 16);
lo = (*sum32 << 16) >> 16;
for (i=0; i < len; i+=2) {
hi += sbuf[i];
lo += sbuf[i+1];
}
if(remain >= 1) hi += buf[2*len] * 0x100;
if(remain >= 2) hi += buf[2*len+1];
if(remain == 3) lo += buf[2*len+2] * 0x100;
hicarry = hi >> 16; /* fold carry bits in */
locarry = lo >> 16;
while (hicarry || locarry) {
hi = (hi & 0xFFFF) + locarry;
lo = (lo & 0xFFFF) + hicarry;
hicarry = hi >> 16;
locarry = lo >> 16;
}
*sum32 = (hi << 16) + lo;
}
// caller provides space for the result (call first with NULL arg
// to get size).
// cf. SetValueBinHex() - it uses CONC lines to keep the data
// within legal limits.
// a gTree represents a GEDCOM line. GetValue() gets its value!
// GetFirstCONCLine() and GetNextCONCLine() iterate over its
// continuation (CONC) lines.
// byte zero of the alias is encoded as the first two characters
// of the _ALS line; the final byte of the alias is encoded
// as the final two characters of the last CONC line. That is,
// the CONC lines are in the natural order in the GEDCOM file.
UInt32
gTree::CopyValueBinHex(void *outBytes) {
UInt32 theSize = std::strlen(this->GetValue())/2;
for(Line f=GetFirstCONCLine();f!=nil;f=f->GetNextCONCLine())
{
theSize += std::strlen(f->GetValue())/2;
}
if(outBytes==NULL)
return theSize;
Line concs = GetFirstCONCLine();
// unsigned char * for the output pointer, because all 8 bits
// count in the raw data...
unsigned char *p = (unsigned char *)outBytes;
char *src = &(this->GetValue())[0];
for(UInt32 i=0; i<theSize; i+=1) {
if(*src == '\0') {
if(concs==NULL)
break;
src = concs->GetValue();
concs = concs->GetNextCONCLine();
}
UInt32 hi = *src++;
UInt32 lo = *src++;
*p++ = ((hi>='A'?hi-'A'+10:hi-'0')<<4)
+ (lo>='A'?lo-'A'+10:lo-'0');
}
*p = '\0';
return theSize;
}
// Keep the line length low to avoid upsetting duff applications,
// but also bearing in mind that the standard limits GEDCOM lines
// to 256 characters.
// Although SetValueBinHex() keeps the line length of the output
// file to a particular small value, programs should be prepared
// to read files containing _ALS and CONC lines containing
// anything from 2 characters up, though they could reasonably
// ignore (and not use the alias derived from) lines longer than
// 256 characters. An alias is typically about a dozen lines
// with these lengths.
void
gTree::SetValueBinHex(void *inBytes, UInt32 inSize) {
int maxrec = 72;
int nchars = 0; // chars written to current record
unsigned char *bytePtr = (unsigned char*)inBytes;
// discard old value of this line, including its CONC lines
this->DeleteValue();
// ask for space for maxrec chars- CONC lines allocated later
// The pointer p points to first char of the new value
char *p = this->NewValue(maxrec);
for(UInt32 i=0; i<inSize; i++) {
UInt32 hi = *bytePtr >> 4;
Assert_(hi<16);
*p++ = hi>=10?'A'+hi-10:'0'+hi;
UInt32 lo = *bytePtr++ & 0xF;
Assert_(lo<16);
*p++ = lo>=10?'A'+lo-10:'0'+lo;
nchars+=2;
if(nchars>maxrec) {
*p = '\0'; // if using null terminated strings
// make new conc line with room for maxrec chars
// p will point to first char of the value of the
// new line
p = this->AddCONCLine(maxrec);
nchars = 0;
}
}
*p = '\0';
}