-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any way to install without Docker on Linux system? #16
Comments
Installing and running wp2txt on CentOS is, of course, possible! However, I'm afraid I don't use CentOS and cannot provide detailed instructions. On Ubuntu, you can install wp2txt by following these steps. I believe most of the installation process can be done on CentOS by replacing the Please note that the commands include uninstalling the existing Ruby to use the latest Ruby version as much as possible. Make sure that no other programs depend on the existing Ruby before executing.
Now, the |
Thanks for the steps! The instructions make sense, however, I found another solution yesterday: https://github.com/daveshap/PlainTextWikipedia It seemed to work ok for my needs so will revisit your solution later on then. Though it does leave some html in the plain text which I have to manually remove with some regex's. like this: text = re.sub( r'{[^}]*}', '', text) # replace stuff in {} braces It seems your tool is more robust and does not leave html in the plaintext? At least your sample looks nice and clean here: https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt Although the page I had trouble parsing was this one: |
I agree that it can be challenging for tools like WP2TXT to be completely reliable due to the user-created nature of Wikipedia articles. There may be instances where tags and directives are left uncorrected, which can affect the tool’s accuracy. That being said, I still believe that WP2TXT is a useful tool, especially with its features for extracting page summaries and categories. I encourage you to give it a try sometime. However, there is another tool that is similar to WP2TXT and more widely used for removing MediaWiki tags from Wikipedia dump data. You might want to check it out as well: https://github.com/attardi/wikiextractor Good luck! |
I am using a google cloud machine so prefer not to use up too much disk space with docker. I am running CentOS 8.
The text was updated successfully, but these errors were encountered: