Algorithm for selective archival or aggregation of records

Given the absurdity of the US patent system, it comes to mind that I should publish all technical ideas, no matter how inane, when I think of them — simply to provide a prior art reference, in case anyone claims that idea in a future patent.

The ideas I publish are arrived at independently, and based on publicly available information. This does not mean that existing and currently "valid" patents don't exist that cover parts of my ideas. It does, however, mean that such patents are obvious, and that the patent system is legalized enabling of extortion.

Without further ado, here's an idea for how to archive old log files to:
  • maintain required storage under a manageable threshold;
  • store the latest data, as well as older data in the event of undetected corruption;
  • avoid predictability to an attacker who might rely on a particular pattern of deletion.
We assume that log files are being archived in regular periods. This period may be an hour, a day, a week, or whatever works for an application.

We will store the log files in ranks, such that rank 0 are the most recent log files, and they get bumped to a higher rank as each rank fills. We allow for an unlimited number of ranks.

We define a maximum number of log files per rank. We call this M, and this number shall be 3 or more. It could be, for example, 4.

We determine the proportion of log files that we will keep each time as we move log files up the ranks. We call this number K, and it is a fraction between 0 and 1. A natural choice of K would be 1/2 or 1/3.

Each rank will start empty, and will with time become populated with log files.

When generated, every log file is first added to rank 0. Then, for each rank, starting with rank 0:
  • We test if the number of log files in that rank has reached M.
  • If it has:
    • We determine a subset S of log files in this rank that contains K*M of the oldest log files. If K is a fraction 1/N, e.g. 1/2, it would be natural to round the number of log files in the subset down to a multiple of N. For example: if K=0.5, S might contain 2 log files.
    • To avoid predictability of which log files are kept, we randomly discard a K proportion of log files in subset S. For example, if K = 1/2, we randomly discard one of the two log files. If predictability is desired, we discard instead based on a predetermined rule. For example, we could discard every other log file.
    • We graduate the log files in S that have not been discarded to the next rank. By this, I mean that we remove them from the current rank, and add them to the next.
We repeat this process with each rank, until we reach one that is empty.

This process keeps around the most recent log files; keeps around the older ones, going all the way back to origin; randomizes which log files are discarded; and uses O(log(T)) storage.

This process works not only for log files, but for any other kind of historical data storage. Instead of discarding information when graduating records from each rank, information can also be aggregated, while reducing the number of records. For example, when selecting the subset of records S that will be graduated to the next rank, instead of deleting a K proportion of records, all the records in S could be aggregated into a single record, which is then propagated to the next rank.

This would allow storage of information such as statistics and totals that is more granular for recent information, and more sparse for older information - yet does not lose information about overall totals.

A natural extension of the above principles is to define ranks to match natural periods of time. For example, rank 0 could be days, rank 1 could be weeks, rank 2 could be months, rank 3 could be years. The maximum number of records per rank, M; and the proportion of records kept between ranks, K; would then be rank-dependent.


Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

Circumcision as an adult, part 1

Circumcision as an adult, part 2