|

Using local storage efficiently

In web applications with a high proportion of state-handling logic in the frontend, it sometimes happens that you have to store information in local storage, at least temporarily. In a project I'm currently working on, we frequently resort to this form of temporary data storage. However, since the Local Storage of a web application is a scarce commodity and I was working on a part of the application that required frequent and paired similar records to be cached in JSON in the Local Storage, I went on a quest to not tie up the scarce Local Storage too much. In this article I explain how we can use the Local Storage more efficiently by using LZW encoding.

1. the problem

In the case described at the beginning, I need to record events that occur frequently to very frequently and send them to the server as soon as possible. For the sake of simplicity, I'll abstract a bit here and say that such a record looks exemplary like this:

{
  eventType: 'fetchError`,
  eventTime: '20200420150701' //20.04.2020 at 15:07:01
  eventData: 'Any info about the event - for example, here's a trace or something.'
}

I want to make sure, if possible, that every event arrives at the server at some point. But it is not important that it arrives there very fast. On the contrary, a connection with the server should not be established for every new event.
From the problem definition it follows that a buffer for started event records would be well placed somewhere in the client, to be then sent in large "chunks" to the backend. Since the buffer should survive a reload (and for other reasons), we decided to store the buffer in local storage. To increase the likelihood that this buffer, along with other data will not overflow the Local Storage, I want to compress the buffer.

2. LZW (Lempel-Ziv-Welch)

At some point in my studies, I had a module in which I had to write compression software. From that module, I remembered the LZW method for dictionary-based compression. This was specified in 1983 and - yes! - it still works very well when applied to JSON.

In the next paragraph (2.1.) I explain how LZW works. If you just want to see the code, skip 2.1.

2.1. LZW Algorithm by Example

I will explain the algorithm on the word "abracadabra". Normally, ASCII codes (between 0 and 255) are used for coding. Here, for simplicity: 'a=1, b=2, c=3, d=4, r=5'. This is our initial dictionary.

Introductory words: the algorithm reads through the input (here a string) from beginning to end, building an implicit dictionary in the process and thus storing recurring input, here text fragments more efficiently.

Let's go:
1st step:
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5}
Output: []
Last read word (LW): [].
Read letter (CW): a.
LW+CW: a is in the dictionary ==> do nothing!
2nd step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5}
Output: []
Last read word (LW): a.
Read letter (CW): b.
LW+CW: ab is not in the dictionary
==> Output = Output + code of LW = [1].
==> Dictionary = Dictionary + ab:6.
3rd step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6}
Output: [1]
Last read word (LW): b.
Read letter (CW): r.
LW+CW: br is not in the dictionary
==> Output = Output + code of LW = [1,2].
==> Dictionary = Dictionary + br:7.
4th step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7}
Output: [1,2]
Last read word (LW): r.
Read letter (CW): a.
LW+CW: ra is not in the dictionary
==> Output = Output + code of LW = [1,2,5].
==> dictionary = dictionary + ra:8.
5th step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8}
Output: [1,2,5]
Last read word (LW): a`` Read letter (CW): c. LW+CW: acis not in the dictionary ==> Output = Output + code of LW =[1,2,5,1]. ==> Dictionary = Dictionary + ac: 9. 6th step Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9} Output:[1,2,5,1] Last read word (LW):c Read letter (CW): `a`. LW+CW: `ca` is not in the dictionary ==> Output = Output + code of LW = `[1,2,5,1,3]`. ==> Dictionary = Dictionary + `ca: 10`. 7. step Dictionary: `{'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10}` Output: `[1,2,5,1,3]` Last read word (LW): `a
Letter read (CW): d.
LW+CW: ad is not in the dictionary
==> Output = Output + code of LW = [1,2,5,1,3,1]
==> Dictionary = Dictionary + ad: 11.
8. step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10, 'ad':11}
Output: [1,2,5,1,3,1]
Last read word (LW): d`` Letter read (CW): a. LW+CW: dais not in the dictionary ==> Output = Output + code of LW =[1,2,5,1,3,1,4] ==> dictionary = dictionary +da: 12. 9. step Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10, 'ad':11, 'da':12} Output:[1,2,5,1,3,1,4] Last read word (LW):a Letter read (CW): `b`. LW+CW: `ab` is in the dictionary!!!! FINALLY ==> Attention: The last word is now `ab`. 10. step Dictionary: `{'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10, 'ad':11, 'da':12}` Output: `[1,2,5,1,3,1,4]` Last read word (LW): `ab
Letter read (CW): r.
LW+CW: abr is not in the dictionary
==> Output = Output + code of LW = [1,2,5,1,3,1,4,6]
==> dictionary = dictionary + abr: 13
11. step
Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10, 'ad':11, 'da':12, 'abr':13}
Output: [1,2,5,1,3,1,4,6]
Last word read (LW): r`` Letter read (CW): a. LW+CW: rais in the dictionary!!!! ==> Attention: The last word is nowra. 12th step Dictionary: {'a':1, 'b':2, 'c':3, 'd':4, 'r':5, 'ab':6, 'br':7, 'ra':8, 'ac':9, 'ca':10, 'ad':11, 'da':12, 'abr':13} Output:[1,2,5,1,3,1,4,6] Last word read (LW):ra``
Letter read (CW): EOF.
LW+CW: EOF is the end, so
==> Write code from LW into output: [1,2,5,1,3,1,4,6,8]

End of story:

  • Input: [1,2,5,1,3,1,4,1,2,5,1] = abracadabra
  • Output: [1,2,5,1,3,1,4,6,8] = abracadxy

We have 9 elements in the array in the output and 11 elements in the array in the input. That is, out of 11 characters, 2 have been saved. That's 18% less memory to use. But then there's the whole dictionary! The really elegant thing about LZW is that this whole dictionary is justified by the procedure in the output array (hence it is called implicit). We do not need to store the dictionary.

I hope you now have an idea of how the algorithm works. I leave it to your ingenuity to figure out here how you might decode abracadxy.

The word abracadabra, by the way, is a relatively self-unlike. That's why we "only" saved 18%. But imagine you would LZW-encode hundreds of event JSONs given under 1. There is already a fair amount of self-similarity or repeating patterns. For example the three keys, which all start with "event". Or the timestamps, which in our use case all match in at least 60% of the characters. With something like that you have better odds.

2.2 LZW compression benchmark

In our use case for testing the foreseeable compression performance, I used many different constellations to test how good the compression ratios are for the various parameter assignments. Important parameters for performance are:

  • Number of event objects in the JSON
  • Number of different event types in the JSON
  • Individuality of the eventDatas.
    I don`t want to go into details here, but I can give the following rules:
  1. the more data is compressed the better it is compressed. With 40 events I had an average saving of 75 %. At 95 % the compression was between 70 % and 90 %.
    The table shows the expected relation between the number of event objects (n) and the compression rate (r):
    number of events in JSON expected value compression
    1 18
    5 53
    10 66
    40 75
    120 90
    250 95
    The 95% means that with 250 events we load only about 28 kB into the local storage instead of about 560 kB.
  2. the number of different object types or the order of the keys has no (big) effect on the compression rate
  3. the individuality of the EventData values is not of great concern as long as they are not randomly generated. Because the event data are always strings assembled on text templates from a set of template possibilities. Even if there are 10 different ones, LZW has "learned" these patterns quite fast.

3. implementation compression and decompression

Before last I would like to write now something to the implementation and explain, how the methodology described above can look in Typescript.

lzwCompress(original: string) {
    const dictionary = {};
    const plainTextCharArray = original.split('');
    const outputDictionaryCodes: Array<number> = [];
    let currentChar: string;
    let phrase = plainTextCharArray[0];
    let code = 256;
    for (let inputIndex = 1;
         inputIndex < plainTextCharArray.length;
         inputIndex++) {
      currentChar = plainTextCharArray[inputIndex];
      if (dictionary[phrase + currentChar] != null) {
        phrase += currentChar;
      } else {
        outputDictionaryCodes.push(
          phrase.length > 1
            ? dictionary[phrase]
            : phrase.charCodeAt(0));
        dictionary[phrase + currentChar] = code;
        code++;
        phrase = currentChar;
      }
    }
    outputDictionaryCodes.push(
      phrase.length > 1
      ? dictionary[phrase]
      : phrase.charCodeAt(0));
    return outputDictionaryCodes.reduce(
      (acc, val) => {
        return acc + String.fromCharCode(val);
      }, ''
    );
}

Analogous to decompression: here I would like to point out (but also applies to above) that certain ``weird'' ways of developing, such as building a string as an array of strings, then merging them at the end, are not for program aesthetics but for performance.

lzwDecompress(compressedString: string) {
    const dictionary = {};
    const inputData = compressedString.split('');
    let currentChar = inputData[0];
    let oldPhrase = currentChar;
    const clearTextCharArray: Array<string> = [currentChar];
    let code = 256;
    let phrase;
    for (let inputIndex = 1;
         inputIndex < inputData.length;
         inputIndex++) {
      const currentCode = inputData[inputIndex].charCodeAt(0);
      if (currentCode < 256) {
        phrase = inputData[inputIndex];
      } else {
        phrase = dictionary[currentCode]
          ? dictionary[currentCode]
          : (oldPhrase + currentChar);
      }
      clearTextCharArray.push(phrase);
      currentChar = phrase.charAt(0);
      dictionary[code] = oldPhrase + currentChar;
      code++;
      oldPhrase = phrase;
    }
    return clearTextCharArray.join('');
}

4. a few thoughts

It is important to weigh whether to take the hassle (computational cost) of compressing and decompressing for saving some of the storage space.
Some important factors are:

  1. frequent updates - if the content in local storage is changed frequently this requires a full round of decompression-compression each time! This is not worth it for high frequency changes.
  2. self-similarity of the content - if we are dealing with very monotonous data, which always shows the same patterns, then it is plausible that the compression rates are good. Examples of such content are: JSON, source code, monolingual texts, binary (image) data without compression. Examples of not such content are: Noise, Binary (image) data that is compressed.
  3. storage space problem - only if there is a storage bottleneck or shortage at all, such an approach is appropriate.