Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cain stuck file copy #12

Open
alexbarta opened this issue Jan 13, 2019 · 19 comments
Open

Cain stuck file copy #12

alexbarta opened this issue Jan 13, 2019 · 19 comments

Comments

@alexbarta
Copy link
Contributor

Hi Maor,

I'm trying to restore, but almost every time some file gets stuck during the copy and cain 0.5.1 hangs.

I tried to do some tunings with buffer size/parallelism but no success, cain randomly gets stuck at certain file copy. In this state after a while, the tcp connection towards minio disappears from netstat output, but cain remains still alive.

Any idea to increase verbosity of the copy process ?

Regards

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

Could this be that these are large files?
Can you check in k8s if the file size changes during the copy?

@alexbarta
Copy link
Contributor Author

well the schema has only a few records since I'm trying on a test minimal installation, if a do a du on the minio folder the total is less than 13M

du -skh minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/*
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-1
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-2

btw I'm running kubernetes v1.11.5 with flannel host-gw setup

@alexbarta
Copy link
Contributor Author

on cassandra-0 the total data amounts to:

sudo du -skh /mnt/disks/cassandra/data/thingsboard
9.4M    /mnt/disks/cassandra/data/thingsboard

same on the other nodes

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

Lets try to narrow this down.
Can you try to do a copy from minio to k8s using skbn?

@alexbarta
Copy link
Contributor Author

Sure give a minute

@alexbarta
Copy link
Contributor Author

alexbarta commented Jan 13, 2019

ok skbn seems to be working properly

created a file on minio

sudo dd if=/dev/zero of=/mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.50469 s, 429 MB/s

Run skbn to copy that file into k8s container

 kubectl run cassandra-restore  --rm  --serviceaccount='cassandra-backup' -i --tty --restart=Never --image-pull-policy=IfNotPresent --image nuvo/skbn --env 'AWS_ACCESS_KEY_ID=admin' --env 'AWS_SECRET_ACCESS_KEY=*****' --env 'AWS_S3_NO_SSL=true' --env 'AWS_S3_FORCE_PATH_STYLE=true' --env 'AWS_S3_ENDPOINT=http://minio-svc.cfs.svc.cluster.local:9000' --command -- sh
If you don't see a command prompt, try pressing enter.
~ $
~ $ skbn cp --src s3://db-backup/abigfile --dst k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:02:54 [1/1] copy: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:03:00 [1/1] done: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data

Check the file copied

 md5sum /mnt/disks/cassandra/stdin
cd573cfaace07e7949bc0c46028904ff  /mnt/disks/cassandra/stdin

md5sum /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
cd573cfaace07e7949bc0c46028904ff  /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile

@alexbarta
Copy link
Contributor Author

alexbarta commented Jan 13, 2019

Maor, looking at your skbn PerformCopy code

https://github.com/nuvo/skbn/blob/42781bdb9d5cd81fcda5a6ac44a17e0480fb0e94/pkg/skbn/skbn.go#L139

I see you are using nio buffers, maybe the hang process is due to some race condition provoked by the goroutines pipew and piper. Probably converting piper goroutine to a standard function could be a good test to see if that is the cause..

When cain gets stuck I can only see "copy:" log output, the instead "done:" never appears.

What do you think?

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

These routines are running concurrently, allowing copy to be done using a pipe. This has to be 2 goroutines...

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

See nuvo/skbn#3 for details

@alexbarta
Copy link
Contributor Author

Then the stuck is either in Download/Upload functions..

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

Probably in download. Can you try the same again, but with a file that gets stuck?

@alexbarta
Copy link
Contributor Author

Unfortunately is not a particular file, when running cain it randomly stops every time on different ( very small ) files. Only a couple of times It did finish the job.

Funny thing is backup that runs 2x faster and it never gets stuck

@alexbarta
Copy link
Contributor Author

this is a short gif of the stuck
cainstuck

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

If minio is a pod in the cluster, you can try treating it as k8s://...
Give it a shot, as a work around :)

@alexbarta
Copy link
Contributor Author

Cool idea ! I will try thanks

@alexbarta
Copy link
Contributor Author

no luck I got stuck here this time :(

cain restore --src 'k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster' -n cfs -k thingsboard  -t
 20190112212203 --cassandra-data-dir /cassandra_data/data  --buffer-size 1 -l app=cassandra
...
2019/01/13 17:37:18 [0372/1674] copy: k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0/event_by_id/manifest.json -> k8s://cfs/cassandra-0/cassandra/cassandra_data/data/thingsboard/event_by_id-42d57b20174511e986ce69f7ad260f0d/manifest.json

@maorfr
Copy link
Contributor

maorfr commented Jan 13, 2019

I want to assume this is an issue with minio, but can't verify at this time...

@alexbarta
Copy link
Contributor Author

well using k8s:// same result I guess is something that happens during the PerformCopy stuff

@Bfoster-melrok
Copy link

Is this project still active?
I seem to be having this same issue writing from cassandra cluster on eks to s3. Tried multiple times and it gets stuck at random parts each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants