WelcomeWelcome | FAQFAQ | DownloadsDownloads | WikiWiki

Author Topic: Best method to check for file changes between directories?  (Read 1831 times)

Offline CentralWare

  • Administrator
  • Hero Member
  • *****
  • Posts: 1652
Best method to check for file changes between directories?
« on: January 30, 2015, 10:10:24 PM »
Due to the excessive lag time brought on by using persistent /home storage while using a flash device with some applications (such as firefox) I've decided to start implementing the means to switch /home/user/cache_dirs over to /tmp so they can operate in memory-only while the machine is running and then copy them back to persistence during the shutdown operation.  The theory and prototype are both sound and increase launch time and operation speed tremendously, but there's a wasteful flaw with putting things back INTO /home on shutdown.

Let's say the user/.cache directory and user/.mozilla directory contain 1,000 files.  During the normal usage of a single day, let's then say 150 of these files actually changed and another 100 were added.  Dumping the entire 1,250 files back to the persistent /home is very wasteful as 850 files are being rewritten which are unnecessary.

Concept #1: Use tar-gz in a fashion similar to filetool.  The primary flaw here is going to be decompression time on boot (when the number of files becomes much greater - they're usually rather small files, just a ton of them) in addition to the extensive number of blocks written to the flash which are otherwise unnecessary. 

Concept #2: The idea here was to have a zipped backup method for "changed" files only (incremental backup) but this tends to become overly complicated for such a small task.

Concept #3: The final decision was to go with an uncompressed incremental method by simply using file comparison (similar to mirroring) where the files from both /home and /tmp are checked for differences and when found, only changed/added files are posted back to /home and the shutdown process will dump the rest.

I was planning to script the entire process, but was wondering (in a non-extension way -- busybox perhaps?) if there were onboard tools which could be utilized for a compare-and-copy?

ie: diff-new /path1/* /path2/* which would result in a file listing showing any files in path1 which are newer (or non-existent) in path2.  This could be piped over to a copy command and life is grand.

If there isn't such a monster within the init image I don't mind writing it; I'm just trying to not reinvent the wheel.  Thanks!

EDIT: I've tried diff -qr /tmp/home/tc/.cache /home/tc/.cache and don't get any results, thus I have to assume somewhere I'm either doing something wrong or that it's not "diffing" what I'm intending.
« Last Edit: January 30, 2015, 10:16:22 PM by centralware »
Over 90% of all computer problems can be traced back to the interface between the keyboard and the chair

Offline Rich

  • Administrator
  • Hero Member
  • *****
  • Posts: 11178
Re: Best method to check for file changes between directories?
« Reply #1 on: January 30, 2015, 10:41:22 PM »
Hi centralware
I think something like  rsync  may be what you are looking for.

Offline CentralWare

  • Administrator
  • Hero Member
  • *****
  • Posts: 1652
Re: Best method to check for file changes between directories?
« Reply #2 on: January 30, 2015, 11:14:10 PM »
@Rich: Yes, rsync was considered, but being that it's not "in the box" (extension) I was trying to avoid third party if at all possible and maintain dependency-less.  Since we have busybox-httpd and wget as pre-install items on all of our images, I even considered going the mirror direction using the httpd as the reference point, but that just seemed horribly wasteful (resource, memory and time.)

I'm putting on the finishing touches right now of a working system which IS using diff, it just takes a little manipulation due to the verbose nature of diff while trying to maintain these constant read/write areas so that we're not leaving behind files which would have been deleted (ie: expired cache items from firefox) and to ensure new and changed items are retained.  I know there have to be more efficient methods to accomplishing the task, but considering the different applications which heavily write to /home (fox, thuderbird, office, etc.) this has to be done as the user experience is sorely disrupted when using bootable flash drives with persistence.

Example: On a flash-boot where opt, tce, home and local are persisted, I install the firefox.tcz extension.  Running fox takes ~4-6 seconds to just to launch.  On the same system (after removing fox entirely) I have the cache and mozilla directories mounted in /tmp.  The launch time is now under 1.4 seconds.  The additional time it would have taken is made up for by running the cache-to-home writes, but considering the ongoing degradation for user experience I'd imagine users would happily spend a few extra seconds on shutdown/reboot as opposed to during the entire day.

When finished, I'm going to recommend this (or something similar) be implemented into TC itself in the following manner:

1) Boot device needs to be detected as hdd or removable
2) IF removable, the file /home/user/.ramcache works in a fashion similar to .filetool.lst where directories that are to be sent to the ram-based cache are listed.  (The file being placed in /home/user in case of multiple users.)
3) A function (script called ./.setcache) should be launched which does nothing more than copies the contents from /home/user/[.ramcache] to /tmp/home/user/[.ramcache] based on #2)
4) rc-shutdown should be modified to launch ./.unsetcache which basically just reverses the process, updating /home with modified/new files and deletes those which were removed during the session from /tmp

* filetool.sh is recommend to behave in a similar fashion.  Making the backup in an automated fashion such as through X's shutdown is grand and all...  but just writes needlessly (especially to flash devices) if nothing has been changed from within the list.  We don't usually use it here, so it's not that big of a deal with us, but I can imagine globally this would be quite a savings in time as well as flash wear-n-tear.

Thanks for the input!
Over 90% of all computer problems can be traced back to the interface between the keyboard and the chair

Offline core-user

  • Full Member
  • ***
  • Posts: 191
  • Linux since 1999
Re: Best method to check for file changes between directories?
« Reply #3 on: January 31, 2015, 02:28:18 AM »
I can see you know a lot more than me about computers :), but I used to use diff piped through uniq to find just the changes, I would expect it to be in busybox. (I'm typing this from AntiX, my regular distro.)
AMD, ARM, & Intel.

Offline coreplayer2

  • Hero Member
  • *****
  • Posts: 3020
Re: Best method to check for file changes between directories?
« Reply #4 on: January 31, 2015, 02:21:55 PM »
I second rsync :p which is perfect for the task

Offline CentralWare

  • Administrator
  • Hero Member
  • *****
  • Posts: 1652
Re: Best method to check for file changes between directories?
« Reply #5 on: February 01, 2015, 01:06:19 AM »
@coreplayer2: I'd have to agree, but I'm "sticking to my guns" when it comes to dependency-less...  and now have a completely working app called "FlashCache" which I'll be submitting once I've had a chance to do more testing.  The goal here is to have something which could be implemented directly into Core (4KB) which could serve a multitude of time and resource savings (especially the idea of flash-based writes...  which most of us take for granted) and be as slim as possible to reduce overhead and redundancy.

diff works "well enough" to do the job at hand, just requires a little finesse when it comes to parsing the results as it doesn't take file names into account (ie: the dreaded space) so I already know someone out there is going to find out the hard way should an app choose not to live up to 'nix naming conventions.  I haven't done so yet, but it's on the to-do list to baby-sit such things before they become problems.  (I could re-write the diff function...  but that also defeats the concept of dependency-less as a custom diff would in itself become a dependency.)

Example outcome (which I've been picking on the default firefox.tcz as it does the most damage to user-experience on a flash based device thus far) :
I have about 128MB worth of combined cache items in tc/.cache, tc/.mozilla and tc/.local and when running with persisting /home, it's "OK" at start, though firefox lags 'x' number of seconds before it even launches setting up profiles, etc. with no clue to the user it's even trying to start up, giving the user reason to "click on it again!" thus causing another process to be launched, also doing the same job.  These "cache" locations are moved off to RAM at boot before X launches...  where virtually every flash drive out there READS a lot faster than WRITES, so it's reasonably fast copying 128MB from flash to RAM.  Doing so, fox loads up in just a second or two (now writing to RAM instead of the flash drive) and now I can surf some of the more heavily flv or graphics sites with almost no flash-drive activity at all.

Someone's going to claim "Don't save the cache!"  As much as I'd love to agree, sometimes it's more prudent to "download-once, use many" depending on the environment needed.

Someone else is going to scream "That's what filetool is for!"  128MB of cache currently contains 9,000+ files.  MyData.gz takes up less space and is already implemented...  but doesn't seem to bode well with tens of thousands of files when it comes to copying from GZ to RAM -- so far my tests show direct-to-ram is less intensive and possibly a bit faster.  Time will tell in the end, though.  It's also going to be entirely useless if you just back up the /home directories for peeps who save their files there.  A few MP3s later and filetool loses its intended efficiency.

The rc.shutdown script is updated by the extension to reverse the cache2ram process.  During shutdown/reboot we want to copy those files back to the flash drive, but there's no sense copying files that haven't changed (thus the diff) which so far has been very painless.  Since Fox can be told to empty its cache, the shut-down will simply delete any files on the flash drive which are no longer located in the FlashCache and viola'...  we stay clean at the same time.

Try installing TC on a flash drive and set tce=flash opt=flash home=flash and local=flash.  Install X and firefox.  Open fox and once it finally opens, open a few different tabs to heavy graphics sites, animated flash sites, etc. and by the time you've hit the fifth tab you'll already feel the system lagging as the flash drive struggles to keep up.  OoO is likely going to have a similar effect since it auto-saves documents every so many seconds/minutes and a 5mb file being saved every couple minutes can easily add up, not to mention cause a great deal of write-cycles being spent unnecessarily.

Someone out there is going to say "Then just don't persist home/local!" :)  I'm working on a method for a hybrid home directory (ie: simply linking things like /home/tc/Desktop to /mnt/sdXX/home2/tc/Desktop) but there's no "standard" for where documents are placed/saved, thus a high risk of them vanishing when the device is removed.  Give it time... :)

Thanks for your feedback and take care!
Over 90% of all computer problems can be traced back to the interface between the keyboard and the chair