Assembly segmentation fault during retq -


i have bit of assembly code calls using callq. upon calling retq, program crashes segmentation fault.

    .globl  main main:                   # def main():     pushq   %rbp        #     movq    %rsp, %rbp  #      callq   input       # input     movq    %rax, %r8      callq   r8_digits_to_stack     # program not getting here before segmentation fault     jmp     exit_0  # put binary digits of r8 on stack, last digit first (lowest) # uses: rcx, rbx r8_digits_to_stack:     movq    %r8, %rax       # copy popping digits off      loop_digits_to_stack:         cmpq    $0, %rax    # if our copy zero, we're done!         jle     return          movq    %rax, %rcx  # make copy extract digit         andq    $1, %rcx    # last digit         pushq   %rcx        # push last digit stack         sarq    %rax        # knock off last digit next loop         jmp     loop_digits_to_stack  # return wherever last called return:     retq  # exit code 0 exit_0:     movq    $0, %rax    # return 0     popq    %rbp     retq 

where input c function returns keyboard input %rax.

i assume might have fact i'm manipulating stack, iis case?

i think 1 of return paths doesn't pop rbp. leave out the

pushq   %rbp movq    %rsp, %rbp  pop     %rbp 

altogether. gcc's default -fomit-frame-pointer.

or fix non-return-zero path pop rbp.


actually, you're screwed because function appears designed put stuff on stack never take off. if want invent own abi space below stack pointer can used return arrays, that's interesting, you'll have keep track of how big can adjust rsp pointing @ return address before ret.

i recommend against loading return address register , replacing later ret jmp *%rdx or something. throw off call/return address prediction logic in modern cpus, , cause stall same branch mispredict. (see http://agner.org/optimize/). cpus hate mismatched call/ret. can't find specific page link right now.

see https://stackoverflow.com/tags/x86/info other useful resources, including abi documentation on how functions take args.


you copy return address down below array pushed, , run ret, return %rsp modified. unless need call long function multiple call sites, it's better inline in 1 or 2 call-sites.

if it's big inline @ many call sites, best bet, instead of using call, , copying return address down new location, emulate call , ret. caller does

    put args in registers     lea   .ret_location(%rip), %rbx     jmp   my_weird_helper_function .ret_location:  # in nasm/yasm, labels starting . local labels, , don't show in object file.          # gnu assembler might treat symbols starting .l way.     ...   my_weird_helper_function:     use args, potentially modifying stack     jmp *%rbx   # return 

you need reason use this. , you'll have justify / explain lot of comments, because it's not readers expecting. first of all, going array pushed onto stack? going find length subtracting rsp , rbp or something?

interestingly, though push has modify rsp store, has 1 per clock throughput on recent cpus. intel cpus have stack engine make stack ops not have wait rsp computed in out-of-order engine when it's been changed push/pop/call/ret. (mixing push/pop mov 4(%rsp), %rax or whatever results in uops being inserted sync ooo-engine's rsp stack-engine's offset.) intel/amd cpus can 1 store per clock anyway, intel snb , later can pop twice per clock.

so push/pop not terrible way implement stack data structure, esp. on intel.


also, code structured weirdly. main() split across r8_digits_to_stack. that's fine, you're not taking advantage of falling through 1 block other block ever, costs jmp in main no benefit , huge readability downside.

let's pretend loop part of main, since talked how it's super-weird have function return %rsp modified.

your loop simpler, too. structure things jcc top, when possible.

there's small benefit avoiding upper 16 registers: 32bit insns classic registers doesn't need rex prefix byte. lets pretend have our starting value in %rax.

digits_to_stack: # put each bit of %rax own 8 byte element on stack maximum space-inefficiency      movq   %rax, %rdx  # save copy      xor    %ecx, %ecx  # setcc available byte operands, 0 %rcx      # need test @ top after transforming while() do{}while     test   %rax, %rax  # fewer insn bytes test 0 way     jz  .lend      # option can jmp test @ end of loop, begin first iteration there.  .align 16 .lpush_loop:     shr   $1, %rax   # shift low bit cf, set zf based on result     setc  %cl       # set %cl 0 or 1, based on carry flag     # movzbl %cl, %ecx  # zero-extend     pushq %rcx       #.lfirst_iter_entry       # test %rax, %rax   # not needed, flags still set shr     jnz  .lpush_loop .lend: 

this version still kinda sucks, because on intel p6 / snb cpu families, using wider register after writing smaller part leads slowdown. (stall on pre-snb, or uop on snb , later). others, including amd , silvermont, don't track partial-registers separately, writing %cl has dependency on previous value of %rcx. (writing 32bit reg zeros upper 32, avoids partial-reg dependency problem.) movzx zero-extend byte long sandybridge implicitly, , give speedup on older cpus.

this won't quite run in single cycle per iteration on intel, might on amd. mov/and $1 not bad, and affects flags, making harder loop based on shr setting flags.

note old version sarq %rax shifts in sign bits, not zeros, negative input old version inf loop (and segfault when ran out of stack space (push try write unmapped page)).


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -