Assembly segmentation fault during retq -
i have bit of assembly code calls using callq
. upon calling retq
, program crashes segmentation fault.
.globl main main: # def main(): pushq %rbp # movq %rsp, %rbp # callq input # input movq %rax, %r8 callq r8_digits_to_stack # program not getting here before segmentation fault jmp exit_0 # put binary digits of r8 on stack, last digit first (lowest) # uses: rcx, rbx r8_digits_to_stack: movq %r8, %rax # copy popping digits off loop_digits_to_stack: cmpq $0, %rax # if our copy zero, we're done! jle return movq %rax, %rcx # make copy extract digit andq $1, %rcx # last digit pushq %rcx # push last digit stack sarq %rax # knock off last digit next loop jmp loop_digits_to_stack # return wherever last called return: retq # exit code 0 exit_0: movq $0, %rax # return 0 popq %rbp retq
where input
c function returns keyboard input %rax
.
i assume might have fact i'm manipulating stack, iis case?
i think 1 of return paths doesn't pop rbp. leave out the
pushq %rbp movq %rsp, %rbp pop %rbp
altogether. gcc's default -fomit-frame-pointer
.
or fix non-return-zero path pop rbp.
actually, you're screwed because function appears designed put stuff on stack never take off. if want invent own abi space below stack pointer can used return arrays, that's interesting, you'll have keep track of how big can adjust rsp
pointing @ return address before ret
.
i recommend against loading return address register , replacing later ret
jmp *%rdx
or something. throw off call/return address prediction logic in modern cpus, , cause stall same branch mispredict. (see http://agner.org/optimize/). cpus hate mismatched call/ret. can't find specific page link right now.
see https://stackoverflow.com/tags/x86/info other useful resources, including abi documentation on how functions take args.
you copy return address down below array pushed, , run ret
, return %rsp modified. unless need call long function multiple call sites, it's better inline in 1 or 2 call-sites.
if it's big inline @ many call sites, best bet, instead of using call
, , copying return address down new location, emulate call
, ret
. caller does
put args in registers lea .ret_location(%rip), %rbx jmp my_weird_helper_function .ret_location: # in nasm/yasm, labels starting . local labels, , don't show in object file. # gnu assembler might treat symbols starting .l way. ... my_weird_helper_function: use args, potentially modifying stack jmp *%rbx # return
you need reason use this. , you'll have justify / explain lot of comments, because it's not readers expecting. first of all, going array pushed onto stack? going find length subtracting rsp , rbp or something?
interestingly, though push
has modify rsp store, has 1 per clock throughput on recent cpus. intel cpus have stack engine make stack ops not have wait rsp computed in out-of-order engine when it's been changed push/pop/call/ret. (mixing push/pop mov 4(%rsp), %rax
or whatever results in uops being inserted sync ooo-engine's rsp stack-engine's offset.) intel/amd cpus can 1 store per clock anyway, intel snb , later can pop twice per clock.
so push/pop not terrible way implement stack data structure, esp. on intel.
also, code structured weirdly. main()
split across r8_digits_to_stack
. that's fine, you're not taking advantage of falling through 1 block other block ever, costs jmp
in main
no benefit , huge readability downside.
let's pretend loop part of main
, since talked how it's super-weird have function return %rsp modified.
your loop simpler, too. structure things jcc top, when possible.
there's small benefit avoiding upper 16 registers: 32bit insns classic registers doesn't need rex prefix byte. lets pretend have our starting value in %rax.
digits_to_stack: # put each bit of %rax own 8 byte element on stack maximum space-inefficiency movq %rax, %rdx # save copy xor %ecx, %ecx # setcc available byte operands, 0 %rcx # need test @ top after transforming while() do{}while test %rax, %rax # fewer insn bytes test 0 way jz .lend # option can jmp test @ end of loop, begin first iteration there. .align 16 .lpush_loop: shr $1, %rax # shift low bit cf, set zf based on result setc %cl # set %cl 0 or 1, based on carry flag # movzbl %cl, %ecx # zero-extend pushq %rcx #.lfirst_iter_entry # test %rax, %rax # not needed, flags still set shr jnz .lpush_loop .lend:
this version still kinda sucks, because on intel p6 / snb cpu families, using wider register after writing smaller part leads slowdown. (stall on pre-snb, or uop on snb , later). others, including amd , silvermont, don't track partial-registers separately, writing %cl has dependency on previous value of %rcx. (writing 32bit reg zeros upper 32, avoids partial-reg dependency problem.) movzx
zero-extend byte long sandybridge implicitly, , give speedup on older cpus.
this won't quite run in single cycle per iteration on intel, might on amd. mov/and $1
not bad, and
affects flags, making harder loop based on shr
setting flags.
note old version sarq %rax
shifts in sign bits, not zeros, negative input old version inf loop (and segfault when ran out of stack space (push try write unmapped page)).
Comments
Post a Comment