-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic file and folder list. #256
Comments
A lot to say here:
This time your suspicion is incorrect. When GF uses grep to search directories, there is nary a call to DIR or ADIR in sight. grep performs two functions for GF (and does them very fast)
GF then searches through the list of possible matches, one file at a time ("normal" GF processing). In my test folder of 8,000 files:
This shows that using grep is about 5 times faster (this ratio gradually decreases with the number of matches). This demonstrates that the path to optimization is to minimize the number of files for normal" GF processing. Thus I do not think there is any advantage to maintaining a separate list of files for GF to process. Which brings up xargs.exe, which you suggested a week or so ago as a path to use when using GF for files in a project. I had high hopes for continued success and spent considerable time implementing it within GF. However, the results were extremely disappointing. For my test project with ~2,000 files:
I believe that the underlying problem here is that grep is hardwired to rapidly search directory trees but does not have a native way to read a list of files. Using xargs (in "chunks" of files about 23K bytes each) apparently adds so much overhead as to make this approach unusable. I am very satisfied with the dramatic improvements we have achieved in searching folder trees. So far, there has been no suggestion for searching list of files that has proved profitable. |
@Jimrnelson
I'm glad we worked through getting grep incorporated. I understand that
grep is getting the files. It's because you were asking about lists of
files hence xargs - which I had no time to test - that I was trying to
think up a faster way to scan the set of files. Think back to my original
sql which you said was a weak example - if we instantiate a regex object
before the sql, then refer to that object in the sql, there will be little
overhead. So I'm guessing a .012 query of a subset of the files, which asks
regex to test the file. That is similar to the locate command that gave you
a 7 times boost. As long as there is a cursor of the files to be scanned,
that might do it.
…On Mon, Jun 17, 2024, 7:09 p.m. Jimrnelson ***@***.***> wrote:
@myearwood1 <https://github.com/myearwood1>
A lot to say here:
I suspect that even with Grep, GoFish is still using some method to access
the directory, such as recursing folders with ADIR or multiple calls to DIR
*.prg>outputfile, DIR *.??a>outputfile.txt
This time your suspicion is incorrect. When GF uses grep to search
directories, there is nary a call to DIR or ADIR in sight.
grep performs two functions for GF (and does them very fast)
- traversing the list of files in the directory and sub-directories
- determining which files have possible matches to the search string.
GF then searches through the list of possible matches, one file at a time
("normal" GF processing).
In my test folder of 8,000 files:
- GF without grep took 11.9 seconds (less than 5% of that spent
generating the list of files)
- GF with grep took 2.4 seconds
This shows that using grep is about 5 times faster (this ratio gradually
decreases with the number of matches).
This demonstrates that the path to optimization is to minimize the number
of files for normal" GF processing. Thus I do not think there is any
advantage to maintaining a separate list of files for GF to process.
------------------------------
Which brings up xargs.exe, which you suggested a week or so ago as a path
to use when using GF for files in a project. I had high hopes for continued
success and spent considerable time implementing it within GF. However,
the results were extremely disappointing.
For my test project with ~2,000 files:
- "Normal" Gf took 7.14 seconds
- Using xargs.exe to call grep.exe took 11.42 seconds (about 60%
slower)
I believe that the underlying problem here is that grep is hardwired to
rapidly search directory trees but does not have a native way to read a
list of files. Using xargs (in "chunks" of files about 23K bytes each)
apparently adds so much overhead as to make this approach unusable.
I am very satisfied with the dramatic improvements we have achieved in
searching folder trees. So far, there has been no suggestion for searching
list of files that has proved profitable.
—
Reply to this email directly, view it on GitHub
<#256 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABL36W4BK52MJPGBLOR6KS3ZH5UDBAVCNFSM6AAAAABJOUFD6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZUGU4TENJUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
You said:
Unfortunately, I have disappointing news on this front. I have tested this technique and found only negligible improvement (about 6%). What GF does "normally" is to scan a cursor of file names and for each file it performs FileToStr and then RegEx on the contents. Your suggestion has moved that process so that it occurs within the Select statement, but it still is necessary to perform FileToStr and Regex on each file. The negligible savings occurs because Select is slightly more effective in this case than the looping in the normal case, but the meat of what is happening, the FileToStr/Regex, still most be performed for each file. It is now my belief that to achieve any substantial savings we would need to find somethings completely outside of VFP, some Windows utility (like you found Grep.exe), that can work on an entire list of files. |
Let me see if I follow. You request a utility that can scan a hierarchy of folders to produce a cursor of files, or deal with an existing cursor of files and do a regex on the listed files without filetostr? |
Actually, not quite either. Conceptually, there are four steps within the GF search engine:
For "normal" GF searching in a directory / sub-directory, these becomes
When searching directories and their sub-directories, the recent optimization of a few weeks ago uses grep.exe to combine the first three steps into one and do so much faster that VFP was able to (as noted, maybe 4-6 times faster.) The unsolved problem is how to optimize this search when step 1 obtains the list of files from a different source then ADIR. (From the list of files in a project of list of projects, e.g.) So what is desirable is the following: a Windows utility that does a grep-like search on a list of files (presumably a text file, one fully qualified file name per line), with a result of one line per each file with a match. Note that earlier we tried using xargs to pipe a list of file names to grep, but this turned out to be much slower that the normal GF. |
What I was originally hoping we could do is include ADirEx from Christian Ehlscheid. I was going to add regex to my DirX, but it would be faster if he added it to his ADirEx. His FLL works inside VFP. His utility gets all 20,150 files in my dev folder into a cursor in .287 seconds. If I tell it to do ".??a;.prg" recursive, it takes .241 seconds. That is running his utility twice. I bet he could do it faster. If he supported building the cursor while recursing the folders and searching, that could be as fast as grep. If he also could scan an existing cursor while searching, that could do what grep and xargs cannot do. If he adds a new feature to scan an existing cursor, then we could produce a set of results and then scan that set for a different expression, and so on. That sounds pretty good to me. I asked him. Let's see what he says. It would do it without filetostr and somewhat outside of vfp. |
You've put together a lot of IFs there. I think that this issue should be put on hold until we have a clear statement from him. It sounds like he would be replicating grep in some way (at least the part of searching regular expressions). Interesting. Note that this is not on my radar as a high priority project. I am not aware of any interest in the community that searching projects needs to be any faster. (For me, I have no need to improve on the 7 seconds it takes to search all of my projects combined.) |
That is the problem with democracy - you get less than optimal. Nobody cared for 12 years - what you called natural. That is not acceptable to me. If he and I can make something you can use, don't look a gift horse in the mouth or offend those pushing the envelope by suggesting it's a waste of time. It's 2 ifs, one to add searching to ADirEx and a new feature to scan a cursor to another cursor. He also provides the source code, so a C++ guy could help as long as it doesn't end up getting called ADirSEX. I'll gladly take any input you have on making it suitable. |
Please, there's no need for you to be like that, Mike. Here's the concept for what I need (as if what will be provided will be a VFP procedure). Three parameters: [1] A list of file names For each file in [1], the grep expression is used in [2] to determine if the file is to be included in the result in [3]. |
I always get attempted shoot downs from everyone. Have you no imagination? You are content with a status quo? I swear Mr. Scott could travel back in time with a replicator and the majority of average intelligence humans would say, Naw, I'm good. |
I agree with your concept. As I see it, I've already used Christian's ADirEx to do 2 queries and compare them to find differences with a single SQL command. Can't do that with ADIR and arrays. Status quo. Ha. |
In your post that initiated this issue, you discussed the possibility of maintaining a table of files to be searched, presumably as a way to shorten the time GF needs to search the files. You wrote:
Yes, I fully agree that this should be a separate project. GoFish is an advanced code search tool (as stated on its home page) and thus I do not believe that its scope should include maintaining the list of files to be searched. A separate project would be the correct path, especially considering that such a project might well grow into significant complexity unrelated to searching code.
No, it would not save any time nor reduce any programming, nor would it require any additional time or programming. There is already an option "Custom UDF" in the Scope dropdown that allows anybody to write their own code to add records to a cursor of all files they would like to have searched. In this case, that would be one line of code to read from the table of files and insert it into the result cursor. |
and as always on this planet, limited thinkers do not try to think outside the box. You originally refused to even entertain the idea of using Grep. Before Grep gofish was not "Advanced". What you don't understand is this: I look for any way to make things as fast as possible for me and possibly the mob DESPITE the resistance, rudeness, attacks and even illegal actions of the mob. A file monitoring utility would mean no need to scan the directory itself every time. Fox can roar through an existing cursor of files. Building that list takes more time than updating the cursor. The benefit of that would be to extract all files with certain names and then having something like ADirEx scan and regex those files, since Grep in their limited thinking cannot do what FoxPro and ADirEx could do.
The wscript.shell takes 2.869 seconds on my computer. The DirX takes 0.249 seconds. If the cursor already existed:
would take 0.022 seconds resulting in 6995 records. A new vfp2c32.fll feature to scan and grep those files all in some wonderland of speed between Fox and Windows sounds good to me. If the scanning of the files and regex - which vfp2c32 source code already mentions - can be done in one massive burst, that seems worthwhile to me and to gofish. I may make the changes to vfp2c32 myself and offer them to Christian. We have 52,000 pictures. We add to that pile intermittently.
DirX takes .478 seconds to put those files into a cursor and index them.
That takes .011 seconds to find 2 files, then regex them with a modified vfp2c32. I'd gladly accept that. So, you can have something that can recursively scan the folders in .4 seconds instead of 2.8 seconds and potentially something that can rip through 50,000 filenames in .010 seconds. By the way, I believe you mentioned using GoFish5 to search the source of GoFish7. If you build GoFish7.app, and put it in a separate folder, you can use GoFish7 to gofish the Gofish source code. |
The monitoring function would be optional. |
EDIT: Wait before you try this. I am going to incorporate PCRE so we get a better set of RegEx abilities. I made a tiny c++ program to read filenames from one file, perform a regexp on each, and output matching files to output.txt I tested it with an input file of all ??a files in my project folder which is 5471 files. dir *.??a /s /b > input.txt This should be useful for scanning the first set of matches to produce a second set of matches. |
📝 Provide a description of the new feature
What is the expected behavior of the proposed feature? What is the scenario this would be used?
@Jimrnelson
I suspect that even with Grep, GoFish is still using some method to access the directory, such as recursing folders with ADIR or multiple calls to DIR *.prg>outputfile, DIR *.??a>outputfile.txt
What if there was a table that was updated in the background while a developer works? GoFish could access that table and do a single query like this:
select * from c_temp where inlist(fileext,'PRG','SCA','FRA','VCA','LBA','MNA') into cursor c_temp1 NOFILTER
That takes .4 seconds on a cursor with all the files in my c drive. 992,000 files. I tried my DirX on my dev folder to build a cursor of all files which took .21 seconds. The query above took .012 seconds.
This could be a separate project. Do you see it saving you time and/or reducing programming? It would need start, stop and refresh functions. It should use buffering to update the table.
If you'd like to see this feature implemented, add a 👍 reaction to this post.
The text was updated successfully, but these errors were encountered: