New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outernet-rtlsdr.py stops outputting 'normal' PDUs after a while #2
Comments
Hi Darren, I have received similar reports when I first released gr-outernet, but I never looked into this in depth, as I was busy looking at the network protocol (what later became free-outernet) and only some users suffered this problem, so we were not sure if it was a hardware issue. Looking at the flowgraph, the only thing that I could imagine falling into a failure state after running for a long time is the HDLC Deframer. Could you do a small test? Just replace the four HDLC Deframer blocks by the stock GNU Radio ones (these are also called "HDLC Deframer" but you can tell them apart because it is under the "Packet Operators" category and its parameters are "Min length" and "Max length"). If it doesn't fail with the stock deframers, then it is a problem of the HDLC Deframer in gr-kiss and I'll have to look at that more carefully. If the problem persists, then we'll have to think of something else. In this case, it would help to have lots of plots (constellation, frequency, time, etc.) to see if anything changes during the failure state. |
I have replaced the KISS Deframers with the stock ones and have it running OK. I'll let you know how it goes, |
It failed again with the stock HDLC deframers. I set Min Length to 4 and Max Length to 1024 (in all 4 instances). Here's the last from outernet_rtlsdr_stock.py:
and the full output from free-outernet.py covering the run of outernet_rtlsdr_stock.py
|
I can confirm this issue. Also tried replacing the HDLC block and after a while, the frame output stops and only
is printed on the terminal window. I am running gnuradio on Ubuntu 16.04 64-bits with an RTL-SDR receiver and using the Americas sat. I'm available for helping with any testing. 73, Edson PY2SDR |
I can't think right now of any possible software failure that gives this problem. Some tests you can do to try to rule out hardware failure. Connect a "File sink" block directly to the "RTL-SDR Source" block and record several minutes of samples (I assume that the decoder will still be working fine if you only record a few minutes). Disable the "File sink" and replace the "RTL-SDR Source" by a "File source" pointing to the file you just recorded, with the parameter "Repeat" set to "Yes". We know that the samples in the recording are fine, the question now is if the software will run into troubles when processing them after some time. The good thing is that if you don't put a "Throttle" block, decoding will happen as fast as your CPU allows, so if it is a software problem that happens only after a while it should happen much earlier when doing this. Another possible test is to put the "File sink" as before and run and record from the RTL-SDR until the decoder runs into problems. The good thing is that if there is a problem with the samples, we will be able to see it at the final part of the recording. The bad thing is that recording chews a lot of disk space, about 1.15GB per minute. It might also be something really simple, such as the XO heating up after the RTL-SDR has been running for a while, and so the frequency moving out of the receiver filter. We should rule out this possibility. It will be obvious if this happens just by looking at a frequency plot of the filtered signal. Look at this post to see how to do this and other plots which might be helpful in debugging this. |
I left the sw running today, it seemed to enter the "failed" state shortly after I went to work. Scrolling back through the output, I can see that the ffff PDUs aren't the only ones coming out. The last few are shown below:
The free-outernet script said this about them:
Not sure that info will be of use though. Here, I'm using gnu-radio and friends, freshly re-built with PyBombs, on Ubuntu 14.04 x86_64. The RX hardware is the Outernet kit, with 30m of CLF-400 coax between the LNA and RTL dongle. Regarding the possibility of receiver drift causing the downlink signal to slip out of the passband, that doesn't seem to be the case, the signal is sitting nicely in the passband all the time. For the constellation plot, it's not as tight as I would like, with a fair few samples around the 0 crossing, but it doesn't seem to look any worse in the "failed" state. The spectrum plot looks very similar to the one in the blog post you mention, typically with similar levels of grass-like noise. Also note that restarting outernet-rtlsdr after it has gone in to the failed state does always seem to work as expected right away. I will have a go at recording the raw feed as you suggest. I shall also try dropping the RX gain and bringing it back up again to see if I can induce failure that way. Also I'll run rtl_test for a good long while also, to see what happens. I also have a some fans blowing on the RTL dongle now, so I'll see if that makes any difference. |
Thanks for your tests Darren. The fact that the constellation plot looks normal even in the failed state suggests that signal processing up to Viterbi decoding is working fine. This almost rules out a hardware error. This is very weird, indeed. The remaining part of processing is Viterbi decoding, descrambling and HDLC deframing. Some of this process seems to run into some wrong state where it no longer works. We've tested 2 completely different implementations of HDLC deframing obtaining the same results. So I don't think the problem is there. Also, the packets you get in the failure state look like what you would get if you feed garbage (random bits) to an HDLC deframer. Every once in a while you would get a small HDLC frame with valid CRC-16 out of random input. The descrambler is very simple, so I would doubt that it's possible for it to run into some failure state. I've read the descrambler code again and it seems OK. This leaves the Viterbi decoder. It is a complex algorithm, but this is part of GNU Radio and the complex code comes from Phil Karn's libfec, so I assume that this is very well tested. I'll try to stress test the Viterbi decoder just in case I see it failing. If you're going to do recordings, please do one for me. I'll loop it in my machine to see if it fails. It doesn't have to be very long. A couple of minutes will suffice. Another idea off the top of my head: GNU Radio runs each block in a separate thread. If one of this threads dies, sometimes the flowgraph will seem to run normally but it won't. Perhaps even your OS is killing a thread due to an out of memory condition, so watch out |
Before switching to recording instead of live processing I've confirmed that fan-cooling the RTL dongle didn't cure the problem. Nothing was evident in the output of dmesg. Would it be more manageable to record the 48kHz from the frequency translating filter? I can't think of an obvious downside to that idea? |
Yes. Recording at 48kHz after frequency translation will be OK. |
I made a couple of recordings at 48Khz. The first was about 4 mins of collection and is about 112MB. There were no anomalies during recording, but the sw does twitch sometimes during a playback loop, though not at the same points in the recording. I have also made a longer recording, about 30 mins of collection. It is about 840MB in a bz2 archive. The sw did go into a failed state in the last two minutes of recording. I haven't tried playing it back yet. That will have to wait until another day.
I'm not sure if I can post files of this size here as attachments. Have you any preferences/suggestions as to how I provide the files to you? I don't have any facilities to hand, or any preferences/suggestions myself. |
Today I finally could sit down and do some further testing. I've added a file sink block right after the low pass filter. I left the flowgraph running until the problem occurred. The good news is that I could reproduce the problem. The bad news is that if I split the file in smaller chunks, the problem does not occur. This is rather strange. I tried keeping the rtl-sdr source running to introduce throttling, but there is not difference in having throttling or not. I am pasting here two sets of data from the flowgraph terminal and from free-outernet.py (older version -- I am having a hard time trying to get zfec installed on my Ubuntu 16.04 machine). The following packet is the last one received when problem occurs.
|
On my previous comment... The problem occurs between t = 2017-01-28 14:46:34 UTC and t = 2017-01-28 14:46:34 UTC. During this time interval, there are two malformed LDP packets and one file service. However, I am not sutre this is relevant since the problem occurs only when I replay the whole file (about 1 GB). If I split the file in four parts and try to play the chunk that should have the problem, the problem does not occur. I am baffled! Daniel, if you want, I can place the huge file on Dropbox. 73, Edson PY2SDR |
One more observation. When the problem occurs, the packet
gets printed every minute. This packet seems to appear right before every time packet. |
I left the flowgraph running after the problem occurred and so far it has not restored normal decoding. From time to time some packets with length greater than 2 bytes are decoded, but they seem bogus. The feed-forward AGC thread is very CPU hungry.
73, Edson PY2SDR |
The feed-forward AGC is very CPU intensive. You could make it less intensive by reducing the "Num. samples" parameter, but I don't think this has anything to do with the problem. Please upload the recording to Dropbox. I'll try to process it in my system to see if I can reproduce the problem and perhaps run some tests. It's not that weird that if you split the file in segments then you're not able to reproduce the failure state. It must be something about the whole history of the data processing. |
I have been running Edson's recording with the same GRC flowgraph as him in my system for several hours. I don't get the failure state. Edson gets it always on the first repeat of the recording. It seems that this is some problem with software versions, ABIs, or something. I'm running GNU Radio 3.7.10.1 on Gentoo. Perhaps its a good idea to compile GNU Radio using PyBOMBS and see if that also suffers the problem. |
Hi Daniel, This afternoon I did a fresh install of GNURadio using PyBOMBS on a local directory (I had issues before while using the default prefix). Compiled and installed a fresh copy of gr-outernet and gr-kiss. The flowgraph for a modified outernet-rtlsdr (to use a prerecorded sample file as source instead of the rtlsdr source) runs and stops at the first round exactly at the same location when using the recorded sample file. :-( If there is an ABI or versioning issue, it may be in one of the dependencies, not with gnuradio itself. I am considering correlating the data streams before and after the viterbi decoders when the decoding fails and when it works, but due to the huge amount of data, I am not sure this would be productive. 73, Edson PY2SDR |
Changing out the kiss HDLC Deframer with the Packet Operations HDLC Deframer works for a while, but also hangs after a period of time though it does seem to work a little better. The bit stream is still going into the HDLC Deframer but it still quits working after a period of time and does not output the bogus messages like the kiss version. It doesn't appear to be an open files or stack size limits. I bumped both up and it still fails. I'm going to try to get a debugger on it, but I have no experience debugging python. |
Daniel, |
If you can make a recording that reproduces this behaviour, please try to do so. I've still haven't been able to get this issue even once on my Gentoo system. Probably some weird bug with libraries. |
Daniel,
Here is the kiss file. Do you want me to make a flow graph and record the
raw data/ It would help id you supplied a grc file that i can modify that
gives yo*u what you want* .
Larry
…On Mon, Apr 10, 2017 at 2:16 PM, Daniel Estévez ***@***.***> wrote:
If you can make a recording that reproduces this behaviour, please try to
do so.
I've still haven't been able to get this issue even once on my Gentoo
system. Probably some weird bug with libraries.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASpdgFMRkXqV1XTsJgfBdUFxYWJ55mm8ks5ruoAfgaJpZM4Lri4S>
.
|
I would need an IQ recording and the same flowgraph that makes your system fail when playing back the recording. |
Daniel,
Here are the files:
https://drive.google.com/open?id=0B9jK9VSzrCGLNS14elJWQi1oT28
https://drive.google.com/open?id=0B9jK9VSzrCGLRXl5QUMydUgxRkU
https://drive.google.com/open?id=0B9jK9VSzrCGLeFN4NWh5a1lST2s
…On Tue, Apr 11, 2017 at 3:15 AM, Daniel Estévez ***@***.***> wrote:
I would need an IQ recording and the same flowgraph that makes your system
fail when playing back the recording.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASpdgGTlz3tt-q9SV8hI2pt_cyRvhZFYks5ruza7gaJpZM4Lri4S>
.
|
I've had to modify the .grc file to make it read from the recording (it was using RTL-SDR input). Please check if this grc file fails when playing back the recording once: https://drive.google.com/open?id=0B2pPGQkeEAfdZWJMTlZVZFJtZGM In my system, this grc file processes your whole .raw recording without problems. |
Daniel,
|
Sorry. As I said, I could not reproduce the issue on my system. We suspect that it is a very subtle issue perhaps having to do with system libraries or whatever. Something very difficult to track down. |
Hi Daniel,
Would it help if I give you access to my Xubuntu system in order to further
investigate the problem?
73, Edson PY2SDR
…On Tue, Apr 25, 2017 at 4:56 PM, Daniel Estévez ***@***.***> wrote:
Sorry. As I said, I could not reproduce the issue on my system. We suspect
that it is a very subtle issue perhaps having to do with system libraries
or whatever. Something very difficult to track down.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGALAEiWeaFI1nriEw2JraawWT5-PAsfks5rzk_UgaJpZM4Lri4S>
.
|
Something interesting, I've run the file several times and it failed in a different places. Once it's confused it never recovers and continues to output bogus data. |
Hi Larry,
The same occurs here in my system. Once the decoder stops, it does not
recover. I do however see the decoder output some longer packets besides
0xFF 0xFF, but I suspect they are all just garbage.
Regards,
Edson
…On Tue, Apr 25, 2017 at 5:57 PM, lccreech ***@***.***> wrote:
Something interesting, I've run the file several times and it failed in a
different places. Once it's confused it never recovers and continues to
output bogus data.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGALAD2gBhMbJ1B-T84E3wKmoMHArGM0ks5rzl5GgaJpZM4Lri4S>
.
|
Hi Edson, Remote access to a machine where I can replicate the problem would be of help. I'm quite busy these days, so it will have to wait until the next few weeks. I can't promise anything, as this looks as something very tough to track down, even if you can reproduce the bug. |
Daniel, |
I would have to look at the KISS recording that triggers this error to be sure, but it seems to me that the problem is that a malformed packet could trick the LDP defragmenter into believing that there is a wrong number of fragments and so FEC decoding will fail. I think this patch (which I already applied to free-outernet) daniestevez/free-outernet@b5b4c7b will prevent this kind of errors. |
That occurred after running it for a week and never have had to restart
free-outernet before. It's gr-outernet seems to get in its bad mode daily
in the late evenings, but not always. I may not be able to reproduce or
capture data that one since its only happened once in the months I've been
using it
…On Sun, May 7, 2017 at 2:59 AM, Daniel Estévez ***@***.***> wrote:
I would have to look at the KISS recording that triggers this error to be
sure, but it seems to me that the problem is that a malformed packet could
trick the LDP defragmenter into believing that there is a wrong number of
fragments and so FEC decoding will fail.
I think this patch (which I already applied to free-outernet)
***@***.***
<daniestevez/free-outernet@b5b4c7b>
will prevent this kind of errors.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASpdgIvY8Jxb_BhXH5wrRE2bJWsInkJDks5r3Xn_gaJpZM4Lri4S>
.
|
I have discovered a bug in the "Decode CCSDS 27" bug in GNU Radio. This block stops working properly after processing many symbols. I think that this bug explains the issue you are experiencing with gr-outernet. More information in Degradation bug in GNU Radio "Decode CCSDS 27". In 5e59c71 I have replaced the "Decode CCSDS 27" block by the "CC Decoder" block, which seems to work well. Please update gr-satellites and test to see if this fixes the issue. |
Using the current GIT head, I've found that outernet_rtlsdr.py stops producing normal PDUs after some time, when receiving the Alphasat bird (providing the European and African service). This typically happens 20 minutes to 1 hour after starting, although it does output something at approximately 1 minute intervals when in that failed state.
At first, I used a modified grc file, tweaked for the Alphsat downlink frequency, a variable channel offset and with some more graphical widgets to try and find the best tuning solution. I experienced the problem with the generated python from the modified grc flowgraph, but was hesitant at first to consider it a bug in gr-outernet, as I had, by accident, deleted a couple of lines in the flow graph and I could not be sure that I hadn't mucked up one the descrambler/deframer paths, which might, I imagine, give similar symptoms under appropriate circumstances.
Today I minimally modified a pristine version of the grc file, just changing the centre_freq to 1545e6 and freq to be 1545.94e6, and fixing the fine tune to be 400Hz. This still failed in the manner reported.
When restarted, the generated python immediately starts producing PDUs as expected. At all times, both when working and when not working, the signal is about 10 to 12 dB above the noise floor and the constellation looks pretty good.
By way of comparison, I found the Outernet-In-A-Box stuff running in qemu to report a similar SNR of about 10-12dB and a PER of about 0.01, although it would frequently loose frame lock. The LOS to Alphasat is more or less down the street outside my house, so I had wondered if I was having multipath issues with reflections off moving traffic, but I couldn't easily correlate lock drop-outs with traffic noise. I had also been of the opinion that it might also work better with a different tuning solution, as I generally have intermod problems here. Generally when outernet_rtlsdr.py is behaving normally, it outputs full-size PDUs monotonically at the expected 1Hz rate and barely misses any, though I find the update rate of the constellation plot a bit too slow to be sure that it isn't collapsing at all.
I'm running outernet_rtlsdr.py and free-outernet.py with no options.
Here's the tail end of the output from outernet_rtlsdr.py, starting around the time of the last time packet it received correctly:
In the failed state, most of the output is of PDUs of length 2, having the 0xFFFF content, although other pathologically small PDUs can occur with a variety of content.
Here's the complete output from free-outernet.py during that run of outernet_rtlsdr.py.
I have received about 1.8MB of tbz2 files in the opaks directory in total, so things seem to be working well enough when they are working.
Sorry if this is a bit long and waffly. Thanks for your considerable efforts in reverse engineering the Outernet signal and implementing the receiver/decoder. I found the video of your presentation to be fascinating.
Cheers,
Darren, G0HWW
The text was updated successfully, but these errors were encountered: