Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to install without Docker on Linux system? #16

Open
feacluster opened this issue May 16, 2023 · 3 comments
Open

Any way to install without Docker on Linux system? #16

feacluster opened this issue May 16, 2023 · 3 comments

Comments

@feacluster
Copy link

I am using a google cloud machine so prefer not to use up too much disk space with docker. I am running CentOS 8.

@yohasebe
Copy link
Owner

yohasebe commented May 18, 2023

Installing and running wp2txt on CentOS is, of course, possible! However, I'm afraid I don't use CentOS and cannot provide detailed instructions.

On Ubuntu, you can install wp2txt by following these steps. I believe most of the installation process can be done on CentOS by replacing the apt-get installation commands with those using yum.

Please note that the commands include uninstalling the existing Ruby to use the latest Ruby version as much as possible. Make sure that no other programs depend on the existing Ruby before executing.

# Uninstall existing Ruby
sudo apt purge ruby rbenv ruby-build
rm -rf ~/.rbenv

# Install tools required for Ruby installation
sudo apt update
sudo apt install gcc make
sudo apt install libssl-dev zlib1g-dev

# Install lbzip2 to speed up wp2txt
sudo apt install lbzip2

# Install Ruby
git clone –depth 1 https://github.com/rbenv/rbenv.git ~/.rbenv
cd ~/.rbenv && src/configure && make -C src
git clone –depth 1 https://github.com/rbenv/ruby-build.git “$(rbenv root)”/plugins/ruby-build

# Add path to .bashrc (if using bash)
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(rbenv init -)"' >> ~/.bashrc

# Reload .bashrc
source ~/.bashrc

# Check available Ruby versions for installation
rbenv install -l

# Install Ruby with specified version (here, version 3.2.0)
rbenv install 3.2.0

# Switch Ruby version
rbenv global 3.2.0

# Install wp2txt
gem install wp2txt

Now, the wp2txt command should be available.

@feacluster
Copy link
Author

Thanks for the steps! The instructions make sense, however, I found another solution yesterday:

https://github.com/daveshap/PlainTextWikipedia

It seemed to work ok for my needs so will revisit your solution later on then. Though it does leave some html in the plain text which I have to manually remove with some regex's. like this:

text = re.sub( r'{[^}]*}', '', text) # replace stuff in {} braces

It seems your tool is more robust and does not leave html in the plaintext? At least your sample looks nice and clean here:

https://raw.githubusercontent.com/yohasebe/wp2txt/master/data/output_samples/testdata_en_summary.txt

Although the page I had trouble parsing was this one:

https://simple.wikipedia.org/wiki/Green

@yohasebe
Copy link
Owner

I agree that it can be challenging for tools like WP2TXT to be completely reliable due to the user-created nature of Wikipedia articles. There may be instances where tags and directives are left uncorrected, which can affect the tool’s accuracy.

That being said, I still believe that WP2TXT is a useful tool, especially with its features for extracting page summaries and categories. I encourage you to give it a try sometime. However, there is another tool that is similar to WP2TXT and more widely used for removing MediaWiki tags from Wikipedia dump data. You might want to check it out as well:

https://github.com/attardi/wikiextractor

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants