Bitzi
home of the
Bitpedia
digital media encyclopedia

About, Products, Download, Search, Browse, Discuss, BitSocieties, Help




Bitzi works
best with Bitzi-Powered Applications.
Register or Sign In 

Bitzi Developer Discussion: Embedded Metadata and Hash Calculation

Main Site : bboard : Bitzi Developer Discussion : One Message

Message:

Embedded Metadata and Hash Calculation   [forward as email]
This is related to the previous topic (format 'plugins') and requires clarity when computing a bitprint. Large media files (eg mp3) are an important class of files for which we want to use bitzi. I noticed in the mp3.c file that the sha1 computation was being modified so it would be independent of any id3 information that might be embedded (anyhow, that is what I thought I was seeing). This makes sense because mp3 files that only differ in the details of the metadata embedded in them will sound identical so it would be nice if their reported sha1 digests were identical.

The question I have is about the corresponding tigertree hash to report with the 'filtered' sha1 value. Since tigertree is being used to provide a method of verifying partial transfers it seems like it might be a good idea to forgo any filtering while computing the tigertree hash and rely on the sha1 hash in the bitprint to establish identity while the tigertree hash may differ due to embedded metadata. Does this correspond at all to current thinking, or should I similarly 'filter' the computation of the tigertree hash?

 
-- steve_bryan, December 11, 2001 03:56 pm

Replies:

Re: Embedded Metadata and Hash Calculation   [forward as email]
We always compute both the SHA1 and TigerTree for the entire file to create our "bitprint". So any little change in embedded metadata means it's a different file, and thus a different bitprint, in both the SHA1 and TigerTree portions.

We also calculate another *separate* SHA1 value for MP3s -- so that we can discover files that are exactly alike in their audio-part. This extra value is treated as a "tag", rather than the primary identifier.

So please *don't* apply any filters which leave out any portions of a file *except* when reporting the "audio_sha1" tag. We are primarily interested in coalescing information for exactly-alike files, with recognizing similarities being a special application, for some file types, via labelling tags that make the similarities clear.

Hope this helps clear things up...

 
-- gojomo, December 11, 2001 04:12 pm

Re: Embedded Metadata and Hash Calculation   [forward as email]
It's slightly silly to be replying to my own query but I forgot to add some pertinent information. Mac files generally consist of two or more forks (no more than two in pre-OSX). I only compute hash values for the data fork. I never include bytes from the resource fork. This is mainly a cross platform issue but there is genuine ambiguity. The resource map at the end of each resource fork has several bytes just before it that are truly indeterminate. Changing them to any value makes no effective change in the file except when the resource fork is accessed while bypassing the resource manager. So there would be an entire array of hash values that would effectively all refer to a single file.

I apologize if this is mainly confusing, but the summary is that I'm excluding all but the data fork from hash calculations on the Mac.

 
-- steve_bryan, December 11, 2001 04:16 pm

Re: Embedded Metadata and Hash Calculation   [forward as email]
You should always report both the full file sha1 and tigertree values as components of the bitprint. Bitprints on their own are completely filetype agnostic and shouldn't be filtered an any way.

For mp3s the 'filtered' sha1 hash is reported as the audio_sha1 field of the mp3 tag (other fields are duration, bitrate, samplerate and vbr).

So for mp3s two sha1 values are computed (1 full file, 1 audio content only) and 1 tigertree value (full file).

 
-- ml, December 11, 2001 04:18 pm

Re: Embedded Metadata and Hash Calculation   [forward as email]
Thanks for the quick clarification. Your decision makes good sense and its simplicity allows everyone involved to avoid excessive introspection about what 'identical' means. This is what the Mac utility will follow, obviously.
 
-- steve_bryan, December 11, 2001 05:10 pm
[ Post a reply ]

© 2009 The Bitzi Corporation | Policies | Company Info | In The Press | Link To Us

296,783 bitizens have contributed 15,880,842 tags about 3,196,646 files.