No announcement yet.

Optimal number of indexing threads

  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimal number of indexing threads

    There is an option in OSForensics to select the number of CPU threads to use when indexing content.
    But what number is optimal?

    At first you might think that more threads is always better. As more threads should equal faster indexing and less indexing time.
    But there are several factors that limits the scaling. These are,
    1. The number of physical cores in your CPU (or cores allocated in your VM). Running 10 threads on a dual core machine doesn't make sense.
    2. The speed of other components in your system. Indexing involves heavy use of the disk and memory. Neither of which get any quicker as you add CPU threads.
    3. The quantity of RAM in the machine. Each indexing thread uses large memory buffers to load documents from disk, extract the text and build the text into an index. More threads means more RAM is required. OSForensics can also create and use a RAM drive to hold temporary files (e.g. extracting files from within a Zip file). So in an ideal world you have enough free RAM to allow the creation of a RAM drive AND run a reasonable number of threads. This may mean limiting the number of threads in use.
    4. If you want to leave some system resources for other tasks. OSForensics runs indexing asynchronously, this allows you to continue running other tasks while indexing is running in the background. So you might not want to fully load your hardware if you have other work to do on the machine.
    5. There is some system overhead in running a lot of threads. CPU data caching doesn't work as well, it tends to thrash the disk more (the reads are less linear) and there is overhead in task switching.

    CPU threads while indexing

    So what number is optimal?

    Rule 1: Always run at least 2 threads. Scaling is almost linear with 2 threads. (indexing time nearly halves for 2 threads compared to 1 thread).

    Rule 2: Never run more threads than CPU cores. Hyperthreading doesn't count for much, so discount those virtual cores to some degree.

    Rule 3: Stick with 2 threads if you are on a 32bit machine, as available RAM address space is so limited.

    Rule 4: If the content you are indexing is on a slow hard drive (i.e. a spinning SATA HDD, USB drive, older SATA SSD or, network drive) then going beyond 4 threads is likely to provide only marginal benefit. And going beyond 6 will likely provide no speed increase, but still use a lot more RAM. The disk system becomes the bottleneck.

    Rule 5: If the content you are indexing is on a modern fast hard drive, (M.2. NVME SSD) then going beyond 8 threads is likely to provide only marginal benefit. And going beyond 12 will likely provide no speed increase, but still use a lot more RAM. Maybe more than 12 might make some sense in the future as RAM and disk systems become even faster (with PCIe 4.0 and x8 interfaces and DDR5).

    Rule 6: If you have limited RAM, don't run too many threads. There are many many factors that effect ultimate RAM use. So these are only very rough guidelines.
    Total installed RAM.
    4GB or less => 2 threads
    8GB to 12GB => 4 threads
    12GB to 16GB => 6 threads
    16GB to 20GB => 8 threads

    A graph:
    Benchmark Setup
    AMD Ryzen Threadripper 1950X 16/32 Cores
    64 GB RAM
    Files indexed 3165
    Emails indexed 285
    Unique words 145454
    OSF indexing time vs number of threads

    Indexing times are in HH:MM:SS. So 1min 26sec for 3450 documents.