Brad@Brad-PC /c/development/qgrep/src/cpp $ ./igreplua search geobase -T distance_pt_pt >>>>> No results!

Brad@Brad-PC /c/development/qgrep/src/cpp $ ./igreplua search geobase distance_pt_pt >>>>> Results

Qgrep cleanup

detect colours to tty or file
get a test framework setup
purify archive.cpp (ie, move ExecuteSimpleSearch to igreplua)
trigram splitting
delete the old crap files

TASKS

Build all libs for 10.4

Libre2 :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libarchive :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Liblua :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libz, libbz :PROPERTIES: :EstDays: 1

Rethinking Instagrep

Compression and decompression are the high performance parts
Quick startup. Either server or C
Max out the CPUs, this probably means quickest decompress time. Still likely to be IO bound.

Dataflow

Single stream of compressed data from file descriptor -> file desc?
Decompress wide (pgzip, etc) -> file desc?
Single stream of uncompressed data -> per file work queue
Grep workers pull streams from queue -> per file work queue
- (shm_open) Will need to abstract for Win32

Build DB & compress

Can be driven by non-C.
Compress ought to be parallel, but not required

GUI

Required for average users
Required to setup the project

Project datafile

Project name
basepaths & regex matchers

Tasks

Search engine

Command line tool that behaves like grep.
igrep project [options] expression igrep project -filenames expression
Responsible for actually converting a compressed stream to grep hits
Provides a library to access the functionality

Build out project structure
- unit test (cppunit)
- libarchive
- RE2
- setup basic hello world testcase
libarchive data from a tgz (time tests)
RE2 match filter (timings)
Stream functionality (look at libarchive)
Single threaded extract/grep (timings)
RE2 matching as a thread task.

Glue Code

Provides the GUI, file selection, file watching, archive updating etc

Filename walking code
Project setup
Create archive from project
File monitoring

New file/stale handling

Server needs to:

full scan of project & compare timestamps to archive
create project.stalefiles fullname for new or touched files -fullname for deletes
generate new tmpfile & swap with project.stalefiles
Generate new archive as tmpfile & swap in, at the same time refreshing project.stalefiles.

Searcher

opens project.tgz & project.stalefiles.
reads stalefiles into a map, deletes to their own map
when archive name is stale, match in that
when archive name is gone, skip it

May want staleness options - ie, only hit stalefiles at the very end, or hit them at the start.

Command line control

projects: list projects
regen project
search project grep-expr : searches
files project
start-service
stop-service

Plan [7/13]

[X] Build a use-case project & use daily
[X] hook grep commands to use-case project
[X] setup win dev environment (local with parallels?)
[X] port current code to win32 (C)
[X] port python code to win32
[X] begin use at work
[X] faster filename searching (look aside)
[ ] help
[ ] documentation
[ ] determine revenue model
[ ] get beta testers
[ ] setup website & payment options
[ ] more performance (pigz, lock free stream, threaded RE2), currently not very multithreaded

Apparently lock free stream make 0 difference!
Spin waiting on read block
TIME TO PROFILE!!
- [X] You might want to grab the path environment var and append the igrep directory on there.
- [ ] Quick readme on search parameters and options

Python sucks for scripting :(

Python was really good for getting up and running, but it’s not good for distributing an application. Python is too big, and I’m not actually using many of the libraries, just the language features.

Documentation & Design

igrep is a cross platform text searching utility. The application uses basic platform independant calls where possible, and relies on POSIX style support elsewhere. For example, the Windows version relies on MINGW and POSIX libraries that have been ported to Win32. The app is split into two parts, Python and C++. The C++ part is compiled into a dynamic library, any code that is performance critical is written in C/C++. Python is used as a glue language, all project management and command line handling, etc are coded here. igrep gets its speed from the realization that hard drives and file systems are the main bottle neck when searching code, instead of searching hundreds of Mb of text over thousands of files, igrep simple decompresses a single well compressed file. Search speed is almost entirely bound by how quickly decompression can happen. I currently use gzip, but may want to tune.

Search

Two threads are used when searching. The main thread decompresses source files and places whole files into a thread-safe queue. A second thread dequeues the uncompressed text and tests the block for the input expression. The regex should be run on the entire file as one block as it allows for fast early rejection of the file. If the regex matches in the file, then the file must be divided into lines & the regex run on a per-line basis. This scheme will only utilize two threads, there are other possible ways to arrange for more threads, but all are awkward. The best solution would be for the decompress and regex matching to internally use more threads. libarchive is used for compression/decompression RE2 is used for regular expression matching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.org

design.org

TASKS

Build all libs for 10.4

Libre2 :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libarchive :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Liblua :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libz, libbz :PROPERTIES: :EstDays: 1

Rethinking Instagrep

Dataflow

Build DB & compress

GUI

Project datafile

Tasks

Search engine

Glue Code

New file/stale handling

Command line control

Plan [7/13]

Python sucks for scripting :(

Documentation & Design

Search

Files

design.org

Latest commit

History

design.org

File metadata and controls

TASKS

Build all libs for 10.4

Libre2 :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libarchive :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Liblua :PROPERTIES: :EstDays: 1 :CompletedDays: 1

Libz, libbz :PROPERTIES: :EstDays: 1

Rethinking Instagrep

Dataflow

Build DB & compress

GUI

Project datafile

Tasks

Search engine

Glue Code

New file/stale handling

Command line control

Plan [7/13]

Python sucks for scripting :(

Documentation & Design

Search