Skip to content

Latest commit

 

History

History
154 lines (131 loc) · 5.94 KB

design.org

File metadata and controls

154 lines (131 loc) · 5.94 KB

Brad@Brad-PC /c/development/qgrep/src/cpp $ ./igreplua search geobase -T distance_pt_pt >>>>> No results!

Brad@Brad-PC /c/development/qgrep/src/cpp $ ./igreplua search geobase distance_pt_pt >>>>> Results

Qgrep cleanup

  • detect colours to tty or file
  • get a test framework setup
  • purify archive.cpp (ie, move ExecuteSimpleSearch to igreplua)
  • trigram splitting
  • delete the old crap files

TASKS

Build all libs for 10.4

Libre2 :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libarchive :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Liblua :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libz, libbz :PROPERTIES: :EstDays: 1

Rethinking Instagrep

  • Compression and decompression are the high performance parts
  • Quick startup. Either server or C
  • Max out the CPUs, this probably means quickest decompress time. Still likely to be IO bound.

Dataflow

  • Single stream of compressed data from file descriptor -> file desc?
  • Decompress wide (pgzip, etc) -> file desc?
  • Single stream of uncompressed data -> per file work queue
  • Grep workers pull streams from queue -> per file work queue
    • (shm_open) Will need to abstract for Win32

Build DB & compress

  • Can be driven by non-C.
  • Compress ought to be parallel, but not required

GUI

  • Required for average users
  • Required to setup the project

Project datafile

  • Project name
  • basepaths & regex matchers

Tasks

Search engine

  • Command line tool that behaves like grep.
  • igrep project [options] expression igrep project -filenames expression
  • Responsible for actually converting a compressed stream to grep hits
  • Provides a library to access the functionality
  1. Build out project structure
    • unit test (cppunit)
    • libarchive
    • RE2
    • setup basic hello world testcase
  2. libarchive data from a tgz (time tests)
  3. RE2 match filter (timings)
  4. Stream functionality (look at libarchive)
  5. Single threaded extract/grep (timings)
  6. RE2 matching as a thread task.

Glue Code

  • Provides the GUI, file selection, file watching, archive updating etc
  1. Filename walking code
  2. Project setup
  3. Create archive from project
  4. File monitoring

New file/stale handling

Server needs to:

  1. full scan of project & compare timestamps to archive
  2. create project.stalefiles fullname for new or touched files -fullname for deletes
  3. generate new tmpfile & swap with project.stalefiles
  4. Generate new archive as tmpfile & swap in, at the same time refreshing project.stalefiles.

Searcher

  • opens project.tgz & project.stalefiles.
  • reads stalefiles into a map, deletes to their own map
  • when archive name is stale, match in that
  • when archive name is gone, skip it
  • May want staleness options - ie, only hit stalefiles at the very end, or hit them at the start.

Command line control

  • projects: list projects
  • regen project
  • search project grep-expr : searches
  • files project
  • start-service
  • stop-service

Plan [7/13]

  • [X] Build a use-case project & use daily
  • [X] hook grep commands to use-case project
  • [X] setup win dev environment (local with parallels?)
  • [X] port current code to win32 (C)
  • [X] port python code to win32
  • [X] begin use at work
  • [X] faster filename searching (look aside)
  • [ ] help
  • [ ] documentation
  • [ ] determine revenue model
  • [ ] get beta testers
  • [ ] setup website & payment options
  • [ ] more performance (pigz, lock free stream, threaded RE2), currently not very multithreaded
  • Apparently lock free stream make 0 difference!
  • Spin waiting on read block
  • TIME TO PROFILE!!
    • [X] You might want to grab the path environment var and append the igrep directory on there.
    • [ ] Quick readme on search parameters and options

Python sucks for scripting :(

Python was really good for getting up and running, but it’s not good for distributing an application. Python is too big, and I’m not actually using many of the libraries, just the language features.

Documentation & Design

igrep is a cross platform text searching utility. The application uses basic platform independant calls where possible, and relies on POSIX style support elsewhere. For example, the Windows version relies on MINGW and POSIX libraries that have been ported to Win32. The app is split into two parts, Python and C++. The C++ part is compiled into a dynamic library, any code that is performance critical is written in C/C++. Python is used as a glue language, all project management and command line handling, etc are coded here. igrep gets its speed from the realization that hard drives and file systems are the main bottle neck when searching code, instead of searching hundreds of Mb of text over thousands of files, igrep simple decompresses a single well compressed file. Search speed is almost entirely bound by how quickly decompression can happen. I currently use gzip, but may want to tune.

Search

Two threads are used when searching. The main thread decompresses source files and places whole files into a thread-safe queue. A second thread dequeues the uncompressed text and tests the block for the input expression. The regex should be run on the entire file as one block as it allows for fast early rejection of the file. If the regex matches in the file, then the file must be divided into lines & the regex run on a per-line basis. This scheme will only utilize two threads, there are other possible ways to arrange for more threads, but all are awkward. The best solution would be for the decompress and regex matching to internally use more threads. libarchive is used for compression/decompression RE2 is used for regular expression matching