Masm Download For Windows 10 64 Bit




The 'Go' tools
The GoAsm manual
GoAsm Assembler and Tools forum (in the MASM forum)

Writing 64-bit programs
by Jeremy Gordon -

This file is intended for those interested in writing 64-bit programs for the AMD64 and EM64T processors running on x64 (64-bit Windows), using GoAsm (assembler), GoRC (resource compiler) and GoLink (linker). It may also be of interest to those writing 64-bit assembler programs for Windows using other tools.

Contents

  1. Basically MASM Assembler was developed for Windows Xp,But It works in Windows 7 32bit not in 64 bit so you might have to use DosBox for it. Download and Install MASM Assembler for Windows Xp / Windows 7 32-bit. Download MASM Assembler(Link in Download Section ↓) Extract the downloaded file using winrar to a drive (C:).
  2. Dosbox - zip file - is an 8086 microprocessor si.
Introduction to 64-bit programming:
How easy is 64-bit programming?
Differences between 32-bit and 64-bit executables
Differences between Win32 and Win64 (for AMD64/EM64T)
Differences between x86 and x64 processors:
registers
instructions
RIP-relative addressing
call address sizes
64-bit programming in practice
Changes to Windows data types
Alignment requirements
Windows structures in 64-bit programming
Choice of register
Zero-extension of results into 64-bit registers
Sign-extension of results into qwords
Automatic stack alignment
Using the same source code for both 32 and 64-bits
Converting existing 32-bit code to 64-bit
Using AdaptAsm.exe to help with the conversion
Some pitfalls to avoid when converting existing source code
Switching using /x64 and /x86 in conditional assembly
Assembling and linking to produce the executable
Some code optimisation and refinement done by GoAsm
Some tips to reduce the size of your code
Demonstration files
Hello64World1 (simple 64-bit console program)
Hello64World2 (simple 64-bit windows program)
Hello64World3 (switchable 32-bit or 64-bit windows program)
More information, references and links

DOWNLOADING MASM 6.15. These instructions assume that you have Winzip.exe installed on your PC. If not, download and install it. Now download masm.zip into some folder, and then navigate to it using Windows Explorer and double click on it. This should bring up Winzip, and after agreeing to its conditions, click on 'Extract'. Basically MASM Assembler was developed for Windows Xp,But It works in Windows 7 32bit not in 64 bit so you might have to use DosBox for it. Download and Install MASM Assembler for Windows Xp / Windows 7 32-bit. Download MASM Assembler(Link in Download Section ↓) Extract the downloaded file using winrar to a drive (C:).

Introduction to 64-bit programming

How easy is 64-bit programming?top

Despite the differences between the 64-bit processors and their 32-bit counterparts, and between the x64 (Win64) operating system and Win32, using GoAsm to write 64-bit Windows programs is just as easy as it was in Win32.

In fact, you can readily use the same source code to create executables for both platforms if you follow a set of rules.

You can also convert existing 32-bit source code to 64-bits and some of the work required to do this can be done automatically using AdaptAsm.

Differences between 32-bit and 64-bit executablestop

Although 32-bit and 64-bit executables are based on the same PE (Portable Executable) format, in fact there are a number of major differences. The extent of those differences means that 32-bit code will only run on Win64 using the Windows on Windows (WOW64) subsystem. This works by intercepting API calls from the executable and converting the parameters to suit Win64. 64-bit code will not work at all on 32-bit platforms.

The executable contains a flag which tells the system at load-time whether it is 32-bit or 64-bit. If the x64 loader sees a 32-bit executable, WOW64 kicks-in automatically. This means that 32-bit and 64-bit code cannot be mixed within the same executable.

The significance of the above is that the programmer has to choose between:-

  • Making one version of the application (Win32). This will work on both platforms.
  • Making two versions of the application (one for Win32 and one for Win64).
For those who are interested in PE file internals, here is a summary of the main differences between 32-bit and 64-bit executables:-
  • The PE file format for Win64 files is called 'PE+'.
  • The size of optional header field in the COFF header is 0F0h in a PE+ file and 0E0h in a PE file.
  • The 'machine type' in the COFF header is not 14Ch (as it is for x86 processors), but is 8664h (for the AMD64 processor).
  • The 'magic number' at the beginning of the optional header is 20Bh instead of 10Bh.
  • The 'majorsubsystemversion' in a PE+ file is 5 instead of 4 in a PE file.
  • The executable 'image' (the code/data as loaded in memory) of a Win64 file is limited in size to 2GB. This is because the AMD64/EM64T processors use relative addressing for most instructions, and the relative address is kept in a dword. A dword is only capable of holding a relative value of ±2GB.
  • The import address table (where the loader overwrites the addresses of external calls such as the addresses of APIs in system Dlls) is enlarged to 64-bits, as is the import look-up table. This is because the address of external calls could be anywhere in memory.
  • The preferred image base, SizeofStackReserve, SizeofStackCommit, SizeofHeapReserve and SizeofHeapCommit fields in the optional header are enlarged from 4 to 8 bytes.
  • The default base address in Win64 is 400000h as in Win32 files.
  • 64-bit executables which provide properly for full Win64 exception handling contain a .pdata section holding the tables required for this.
You can view the internals of the PE file using Wayne J. Radburn's PEview.

Differences between Win32 and Win64 (for AMD64/EM64T)top

Here are the main differences between Win32 and Win64 of relevance to the assembler or Windows programmer:-
  • Calling convention. Win32 uses the STDCALL convention whereas Win64 uses the FASTCALL convention. In STDCALL all parameters which are sent to an API are PUSHed on the stack. In Win32 the stack pointer (ESP) is reduced by 4 bytes for each PUSH. In STDCALL it is the responsibility of the API to restore the stack to equilibrium.
    In FASTCALL, the first four parameters are sent to the API in registers (in this order: RCX,RDX,R8 and R9), but the fifth and subsequent parameters are PUSHed on the stack. In Win64, the stack pointer (RSP) is reduced by 8 bytes for each PUSH. Unlike STDCALL, it is not the responsibility of the API to clear up the stack. Instead this must be done by the caller to the API. The caller must also ensure that there is space on the stack for the API to store the parameters which are passed in registers. In practice this is achieved by reducing the stack pointer by 32 bytes just before the call.
    Note than in GoAsm all the work required by the FASTCALL calling convention is done automatically if you use INVOKE or ARG followed by INVOKE. See coding to comply with FASTCALL calling convention. The use of ARG and INVOKE is described in the relevant part of the GoAsm manual.
    Note that GoAsm does not yet do this for parameters which need to be sent in the XMM registers (ie. in floating point instructions).
  • Windows uses the FASTCALL convention to call the window procedures and other callback procedures in your application. This means that your window procedures will pick up the parameters in a different way under Win64. Also the window procedures no longer have to restore the stack to equilibrium.
    Note that GoAsm will implement these things automatically if you use FRAME...ENDF. The use of FRAME...ENDF is described in the relevant part of the GoAsm manual.
  • All functions using a stack frame (including window procedures) need to follow certain rules if they wish to make use of exception handling. The tools need also to add exception frame records to the executable. This will also be handled automatically by the 'Go' tools. Note this is not yet available
  • Register volatility. In Win32, window procedures and other callback procedures have to restore the values in the EBP,EBX,EDI and ESI registers before returning to the caller (if the value in those registers are changed). This is something that is also done by the Windows APIs (these registers will not change when you call an API). These are called the 'non-volatile' registers. In Win64, this list of registers is extended to RBP,RBX,RDI,RSI,R12 to R15 and XMM6 to XMM15.
    The 'volatile' registers are those which may be changed by APIs, and which you do not need to save and restore in your window procedures and other callback procedures. In Win32 the general purpose volatile registers were EAX,ECX and EDX. These have now been extended to RAX,RCX,RDX, and R8 to R11.
  • You might not have expected this, but in 64-bit assembly for the AMD64, pointers to code and data whose addresses are within the executable are still only 32-bits. This ties in with the fact that RIP-relative addressing limits the size of the executable to 2GB. Pointers to external addresses, such as functions in Dlls, are 64-bit wide so that the function can be anywhere in memory see call address sizes.
  • In Win64 the data size of all handles and pointers are now 64-bits instead of 32-bits. See Changes to Windows data types for more.
  • In Win64 there are stricter requirements for the alignment of the stack, data, and for structures (see alignment of structures and structure members).
  • The Windows APIs have been modified to work in 64-bits. There are, however, a small number of new APIs to handle the extra requirements of 64-bit operation. These include:-
    GetClassLongPtr
    GetWindowLongPtr
    SetClassLongPtr
    SetWindowLongPtr
    Note that just as in Win32, you can make your application with either the ANSI or the Unicode version of the APIs. See Writing Unicode programs.

Differences between x86 and x64 processorstop

The main differences are the expanded register range, some changes to instructions, and the use of RIP-relative addressing. The notes below refer to the AMD64 in 64-bit mode. In this mode the AMD64 can also run 32-bit executables naturally.

Registerstop

The AMD64 adds several new registers to those available in the 86 series of processors, and also adds new ways to address the existing registers.
  • The EAX,EBX,ECX,EDX,ESI,EDI,EBP and ESP 'general purpose' registers are all enlarged to 64-bits. The enlarged registers are accessed using RAX,RBX,RCX,RDX,RSI,RDI,RBP and RSP
  • You can still access the low dword of these registers (ie. the least significant 32 bits) by using the existing names EAX,EBX,ECX,EDX,ESI,EDI,EBP and ESP.
  • You can still access the lowest word of these registers (ie. the least significant 16 bits) by using the existing names AX,BX,CX,DX,SI,DI,BP and SP.
  • You can still access the first byte of RAX,RBX,RCX and RDX (ie. the least significant 8 bits) by using the existing names AL,BL,CL,DL as in the 86 processor. But you can now also address the first byte of the 'index' registers by using SIL,DIL,BPL and SPL. So for example SIL is the least significant 8 bits of the index register RSI.
  • You can still access the second byte of RAX,RBX,RCX and RDX (bits 8 to 15) by using the existing names AH,BH,CH,DH as in the 86 processor. However, the opcodes for this have been altered in the AMD64 processor. They now clash with the opcodes required to address the byte versions of the extended registers R8 to R15. So you cannot use AH,BH,CH,DH and R8B to R15B in the same instruction.
  • There are eight new 64-bit registers (the 'extended registers') named R8 to R15.
  • The low dword of these registers (ie. the least significant 32 bits) can be addressed using the R8D to R15D forms.
  • The low word of these registers (ie. the least significant 16 bits) can be addressed using the R8W to R15W forms.
  • The first byte of these registers (ie. the least significant 8 bits) can be addressed using the R8B to R15B forms.
  • There are 8 new XMM (128-bit) registers named XMM8 to XMM15.
  • The 64-bit MMX registers (MM0 to MM7) are still available. As in the 86 processor they are also used as floating point registers (ST0 to ST7) for the x87 floating point instructions.
  • The instruction pointer is now in the 64-bit RIP register.

Instructionstop

  • There are some instructions which are not available in the AMD64. The opcodes are now used for other purposes. The full list is contained in the AMD64 manuals, but includes AAA, AAD, AAM, AAS, DAA and PUSH and POP operations using CS,DS,ES and SS.
  • Instructions are enlarged to allow for the new registers and register forms of address, for example:-
  • The string instructions are now enlarged to allow for 64-bit addressing for, example:- The repeat prefixes REP, REPZ and REPNZ use RCX rather than ECX. The loop instructions LOOP, LOOPZ and LOOPNZ use RCX rather than ECX. The table look-up instruction XLATB uses RBX rather than EBX.
  • Apart from the above, the only new instruction of any note usable by programmers is MOVSXD which can move 32-bits of data from a register or from memory into a 64-bit register, sign extending bit 31 into all higher bits. There are also a handful of new system instructions.
  • In the AMD64, each PUSH and POP instruction moves the stack pointer by 8 bytes instead of 4 bytes as in the 86 processor. This means that PUSH 32-bit register is no longer a recognised instruction on the AMD64. To help with compatibility of source code, GoAsm treats (for example) PUSH EAX as equivalent to PUSH RAX. In /x86 mode, GoAsm treats PUSH RAX as equivalent to PUSH EAX. So it does not really matter which you use.
  • PUSH immediate on the AMD64 takes a 32-bit immediate (number) value and sign extends bit 31 into all higher bits. There is no single instruction capable of taking a 64-bit immediate value and PUSHing that onto the stack. For this reason PUSH ADDR THING is not a recognised instruction on the AMD64 (the offset value is treated as an immediate). The problem here is that the actual immediate value of any particular offset is unknown until link-time, and at assemble-time it is impossible for the assembler to know whether the offset is above 7FFFFFFFh and so would be affected by the sign extension.

    Therefore in GoAsm, PUSH ADDR THING makes use of the R11 register and takes advantage of the shorter RIP-relative addressing of LEA with the following coding:-

  • The 3DNow! instructions are still available in the AMD64. It's not clear whether these instructions are now available on processors supporting Intel EM64T technology.

RIP-Relative addressingtop

Some instructions in the AMD64 processor which address data or code, use RIP-Relative addressing to do so. The relative address is contained in a dword which is part of the instruction. When using this type of addressing, the processor adds three values: (a) the contents of the dword containing the relative address (b) the length of the instruction and (c) the value of RIP (the current instruction pointer) at the beginning of the instruction. The resulting value is then regarded as the absolute address of the data and code to be addressed by the instruction. Since the relative address can be a negative value, it is possible to address data or code earlier in the image from RIP as well as later. The range is roughly ±2GB, depending on the instruction size. Since relative addressing cannot address outside this range, this is the practical size limit of 64-bit images.

RIP-relative addressing happens 'behind the back' of the user. The processor uses it if the opcodes contain certain values (in the ModRM byte, the Mod field equals 00 binary, and the r/m field equals 101 binary). You cannot control this except by changing the type of instructions you use. Generally here are the rules which govern whether or not an instruction uses RIP-relative addressing:-

  • Addresses in data cannot use RIP-relative addressing since the value of RIP cannot be known at the time when those addresses are set. Instead, an absolute address for insertion is calculated at link-time. So for example the following instructions do not use RIP-relative addressing but instead use absolute addresses:- Note that in practice, the absolute address is contained in a dword and not in a qword. This is why in the above examples data and code addresses can be contained within a dword data declaration. This restriction is feasible because the practical image size is limited to 2GB anyway because of the restrictions imposed by RIP-relative addressing.
  • Offsets converted to immediate values either at assemble-time or at link-time use absolute addressing rather than relative addressing. For example the following instructions do not use RIP-relative addressing but instead use absolute addresses:- However, GoAsm actually codes MOV RAX,ADDR MyDataLabel3 and similar instructions using the shorter LEA instruction, which does use RIP-relative addressing.
    Also note that for a MOV to memory of an ADDR, GoAsm makes use of the R11 register and takes advantage of the shorter RIP-relative addressing of LEA with the following coding:-
  • Here are examples of other instructions which use RIP-relative addressing:- Note in the case of an external call, the relative address points to the Import Address Table. Since the table is now enlarged to 64-bits, it is possible to call a code label anywhere in memory.
  • LEA uses RIP-relative addressing, for example:-
  • RIP-relative addressing is not used where the data or code label is supplemented by an index register. Although this may seem odd, the reason appears to be that adding information about the register to the opcodes means that the processor can no longer recognise the instruction as one which uses RIP-relative addressing (in the ModRM byte, the Mod field no longer equals 00 binary, and the r/m field no longer equals 101 binary). This means that the following instructions use absolute addresses rather than RIP-relative ones:- Because RIP-relative addressing is not being used here, for these types of instructions to work properly, the Image Base should be well below 7FFFFFFFh. These types of instructions would need to be adjusted if using a larger Image Base or when linking with the /LARGEADDRESSAWARE option.
Bearing in mind that the image size is limited to 2GB by the above arrangements, it might be thought that the advantages of RIP-relative addressing are somewhat limited. This seems to be the case. It appears that the only advantage is that it lessens the number of relocations which would need to be carried out by the loader if a DLL is loaded at an address which is unexpected. The loader then would need to adjust all absolute addresses to suit the actual image base, but relative addresses would not have to be altered since they refer to other parts of the virtual image of the executable itself. However, it is good practice for the programmer to choose a suitable image base at link-time to avoid the need for relocations in a DLL in the first place. A good example of this is the system DLLs themselves. They all have a different image base which effectively avoids any prospective clashes of the image in memory which would require relocation at load-time.

Call address sizestop

In 64-bit assembly, a simple call to a code label eg. will be coded as an E8 RIP-relative call, using a dword to provide the offset from RIP. The destination of this call might be an internal code label (ie. a procedure or function within the executable itself). Or it might be to an external code label, such as an API in a system Dll or to a code label exported by another exe or Dll. The first destination of a call to an external code label is to the Import Address Table which is part of the executable itself. This table is written over by the loader when the executable starts. Therefore during run-time the table contains the absolute addresses in virtual memory of the eventual destination of the call. In a 64-bit executable, the table contains 64-bit values, so the E8 RIP-relative call is capable of calling a procedure or function anywhere in memory.

Calls to memory addresses either held in a label, or in registers, or in memory pointed to by registers, however, are dealt with in a different way. They are not channelled through the Import Address Table. These calls must also permit the destination of the call to be anywhere in memory. In order to achieve this they must themselves use 64-bit absolute addresses. Examples of these types of calls are:- Here you need to be careful that you are in fact giving a qword to the call, and not just a dword.
See some pitfalls to avoid when converting existing source code.

Changes to Windows data types

Here is a list of the changes to data types between 32 and 64-bits:-

All handles now qwords not dwords

exceptions:- HRESULT, HFILE which remain dwords, and HALF_PTR (see below)

All pointers now qwords not dwords

Masm Download For Windows 10 64 Bit exceptions:- HALF_PTR, and UHALF_PTR which are now dwords instead of a word and POINTER_32 which remains a 32-bit pointer

WPARAM and LPARAM now qwords not dwords

Here is a list of the data types which remain the same:-

Using the switched type indicator

The above change of a data type may require a corresponding change to a type indicator. The letter P is reserved as a type indicator in all situations when GoAsm might expect to find one. So you can have this switch:- P can be switched to the equivalent of any of the pre-defined type indicators that is B, W, D, Q or T. In this case it is switched either to Q (value 8) or to D (value 4). Therefore you can control the size of the instruction with it, for example:-

Alignment requirements

The requirements of the system in Win64 for correct alignment of the stack pointer, data, and structure members are much stricter than in Win32. Wrong alignment can cause as best a loss of performance and at worst, an exception or program exit.

Stack alignment

The stack pointer (RSP) must be 16-byte aligned when making a call to an API. However, this is organised automatically by GoAsm if you use INVOKE see automatic stack alignment.

Data alignment

All data must be aligned on a 'natural boundary'. So a byte can be byte-aligned, a word should be 2-byte aligned, a dword should be 4-byte aligned, and a qword should be 8-byte aligned. A tword should also be qword aligned. GoAsm deals with alignment automatically for you when you declare local data (within a FRAME or USEDATA area). But you will need to organise your own data declarations to ensure that the data is properly aligned. The easiest way to do this is to declare all qwords first, then all dwords, then all words and finally all bytes. Twords (being 10 bytes) would put out the alignment for later declarations, so you could declare all those first and then put the data back into alignment ready for the qwords by using ALIGN 8.

As for strings, in accordance with the above rules, Unicode strings must be 2-byte aligned, whereas ANSI strings can be byte aligned.

When structures are used they need to be aligned on the natural boundary of the largest member. All structure members must also be aligned properly, and the structure itself needs to be padded to end on a natural boundary (the system can write in this area). Because of the importance of this, from Version 0.56 (beta), GoAsm aligns structures automatically for you. See automatic alignment and padding of structures and structure members for more.

Windows structures in 64-bit programmingtop

Windows often uses structures to send and receive information using the APIs. In 64-bits these structures are likely to be significantly different from their 32-bit counterparts because of the enlargement of many data types to 64-bits. See changes to Windows data types.
Take for example the WNDCLASS structure which is used when you want to register a window class:- A number of the members are now qwords, whereas previously they were dwords as you can see from the 32-bit version below. The class style at offset +0h remains a dword, but then in the 64-bit version, padding of four bytes is required because the next member is a qword. This complies with the requirement that structure members are aligned on their natural boundary. A qword is used to provide space for the pointers firstly to the window procedure itself at +8h, to menu name at +38h and to the window class name at +40h. This is despite the fact that 64-programming as implemented by Win64 for the AMD64 processor only uses 32-bit pointers where those pointers give the addresses of internal data. Presumably the reason for this is that the same structures as being used here as are used for the IA64 family of processors (which use 64-bit pointers to internal data). Handles in the structure are also enlarged to 64-bits. Here is another example, this time the structure DRAWITEMSTRUCT. First, lets have a look at the 32-bit version in the form you would find it in the SDK:- In 64-bits this structure becomes:- It is also a requirement that the structure is enlarged so that it ends on the natural boundary of its largest member. This is achieved by adding the necessary padding at the end of the structure. So PAINTSTRUCT becomes:- In practice it was found that the system wrote to the area of padding at +44h when using PAINTSTRUCT in certain circumstances. This shows the importance of complying with these rules (otherwise you could find that data after the structure could be written over).

Note that the beginning of structures must be aligned on the natural boundary of the largest member as well. All the above rules ensure, therefore, that qwords in the structure are always qword aligned.

Automatic alignment and padding of structures and structure members

As we have seen correct alignment of structures and structure members is crucial for proper operation of 64-bit code. Unfortunately the Windows header files containing the structure definitions do not necessarily contain the necessary padding to achieve such alignment.

So from Version 0.56 (beta), GoAsm does this work automatically for you as follows:-

  1. GoAsm always aligns the structure itself to the correct data boundary.
  2. GoAsm always pads if necessary to ensure that structure members are on their natural boundary. So in the MSG structure example below, the padding at +0Ch could be left out. It would be inserted automatically.
  3. GoAsm always adds padding at the end of a structure so that the structure ends on a natural boundary. So in the example below the padding at +2Ch could be left out. It would be inserted automatically.
  4. The symbols created when using a structure are automatically adjusted to suit the alignment and padding which is applied.
You can see what alignment and padding GoAsm has added to your source code if you specify /l in GoAsm's command line. This will create a list file. Also you can view the effect in a debugger. Masm Download For Windows 10 64 Bit

Structures - the overall picture

  • If you are writing source code for both 32 and 64-bit versions of your program, this will be made much easier if you use conditional assembly to switch the correct structures at assemble-time, and then instead of filling the structures using the offset values, you fill them using the member names. Using this method, GoAsm finds the correct offset for you automatically. This technique has been used in the demonstration file Hello64World 3.
  • You can use conditional assembly to switch whole banks of structures in one go. These can be contained in include files containing 32-bit structures and 64-bit structures respectively.
  • Since GoAsm aligns and pads the structures automatically for you, you can use the 64-bit structure definitions already available in include files, or you can make your own from the Windows header files using Wayne J Radburn's xlatHinc utility.

Choice of registertop

  • One main thing to remember is that all Windows handles are 64-bits so the APIs will provide them in RAX rather than in EAX.
  • The same goes for Windows pointers. For example you may ask Windows for some memory. The address of the memory will be returned in RAX and not in EAX.
    So this means that:- is bad 64-bit coding, whereas is good.
  • Since all pointers to internal data and code labels are 32-bits, in theory it is possible to use the 32-bit versions of the general purpose registers (EAX to ESP) for all such pointers so for example, you could use MOV [ESI],AL instead of MOV [RSI],AL.

    However, I do advise against this for the following five reasons:-

    1. It means you have to keep track of which pointers are internal ones and which are external ones. You must allow for the external ones being 64-bits.
    2. You may need two sets of procedures which are oft-used in your program, one using 32-bit register pointers and one using 64-bit register pointers.
    3. The string instructions such as LODSB, MOVSW, STOSD, CMPSQ and SCASB use RSI and RDI in a 64-bit program rather than ESI and EDI. And the repeat prefixes REP, REPZ and REPNZ use RCX instead of ECX.
    4. Using the 32-bit versions of these instructions in 64-bit program codes one opcode larger than the 64-bit version. This is because in a 64-bit program, MOV [RSI],AL is the default and to convert this to MOV [ESI],AL requires an 67h override byte.
    5. You can still use the same source code to make both 32-bit and 64-bit programs provided you only use the general purpose registers, RAX to RSP. This is because when you use the /x86 switch with GoAsm these registers are automatically regarded as EAX to ESP instead.

    You can automate the required changes to existing 32-bit code using AdaptAsm.

  • If you need to use the R8 to R15 registers, remember that R8 to R11 are volatile (they will not be maintained by the APIs). If you use the non-volatile R12 to R15 registers within window procedures and callback procedures then you must ensure that they are restored after use. This can be done by using PUSH at the beginning and POP at the end of the procedure which uses them, or by using the USES statement.
  • When passing parameters to an API using INVOKE, you may need to take into account that in the FASTCALL calling convention the parameters have to be sent to the API in the RCX,RDX,R8 and R9 registers. Therefore you would not wish to pass parameters in registers which will be overwritten by GoAsm (you will get an error message if you try to do this).

    For example this is bad and will show an error:- It's bad because if it were allowed, it would translate to:- so it can be seen that the contents of the registers are being overwritten before they are being used to establish the parameters.

    Better would be:- Which translates to:- Note that GoAsm does not bother to code MOV R8,R8
    Even better would be:- which requires no further code to pass the parameters since they are already in the correct registers. So this is very efficient code.

See also some tips to reduce the size of your code which has some additional implications for your choice of registers
and also some pitfalls to avoid when converting existing source code.

Zero-extension of results into 64-bit registerstop

Take care when mixing the 64-bit registers and their 32-bit counterparts because the processor can change the contents of the whole 64-bit register when this is not obvious. This is because when writing results to a 32-bit register the processor will zero-extend the result into the whole 64-bits of the register. So, for example:- but the processor will zero extend the result into RAX, in other words it will zero the whole of the high dword of RAX. The result in RAX is 00000000 0F0F0F0Fh not 0FFFFFFFF 0F0F0F0Fh as expected. This happens irrespective of the value of bit 31 of RAX (this is not the same as sign-extension).

A similar thing happens when using other instructions. Here is an example with XOR:- The actual result in RAX is zero.

And it also happens with the mov instruction for example The result is RCX=88888888h

You can take advantage of zero-extension in various ways. Some examples are given in some tips to reduce the size of your code. Take also this example, where the structure RECT (which is four dwords) contains values which must be passed to the API MoveWindow as qwords:- Here only 32-bit registers are used to extract the information from the RECT structure, but we know that the high part of the 64-bit versions of those registers are set to zero.

Masm 8086 Download

It is possible that there is a performance loss in relying on zero-extension. Some of the documentation suggests that the processor has to carry out an additional operation to zero the high bits of the register.

Sign-extension of results into qwordstop

Masm 8086 Download For Windows 10 64 Bit

You may wonder about the difference between the following instructions:- These code differently and do different things. The dword version places the value 12345678h into the dword at the label THING as you would expect. The qword version does the same, but also zeroes the dword at THING+4. This is because it sign-extends the result into the qword at the label THING. So if the high bit is set, the qword version will fill THING+4 with 0FFFFFFFFh. In other words, the 32-bit value in these instructions are regarded as signed numbers, and written to memory accordingly.

The same happens if you use a register to address the data area for example:- Note that you can't put more than 4 bytes into memory directly using the MOV instruction even though you are using 64-bit code, so this shows an error:- Instead, to achieve this result you would use the following code:-

Masm

Automatic stack alignmenttop

The stack pointer (RSP) must be 16-byte aligned when making a call to an API. With some APIs this does not matter, but with other APIs wrong stack alignment will cause an exception. Some APIs will handle the exception themselves and align the stack as required (this will, however, cause performance to suffer). Other APIs (at least on early builds of x64) cannot handle the exception and unless you are running the application under debug control, it will exit.

Because of this requirement, the Win64 documentation states that you can only call an API within a stack frame. This is because it is assumed that only within a stack frame can the stack be guaranteed to be aligned properly. A call out of the stack frame will misalign the stack by 8 bytes.

This requirement is very restrictive to assembler programmers, and causes compilers a big headache. GoAsm's solution to this problem is to insert special coding before and after each API call (when INVOKE is used) to ensure that the stack is always properly aligned at the time of the call. This liberates the assembler programmer, and means that:-

  • Calls to APIs (using INVOKE) can be made anywhere in your code. They can be made from procedures called by other procedures without worrying about the stack pointer.
  • PUSHes and POPs can be used in the usual way to save and restore registers, memory addresses and contents of memory without having to worry that this puts the stack out of alignment.
  • You can use the same source code both for 32-bit and 64-bit versions of your application (there is no requirement for stack alignment in 32-bits).
The overhead for aligning the stack at the time of each API call is an additional nine bytes per API, which seems a small price to pay for the advantages gained. To keep down the size of the code as much as possible, GoAsm takes a number of opportunities to optimise the code particularly when inserting the parameters. See some optimisation done by GoAsm for details. See also coding to achieve automatic stack alignment.

Using the same source code for both 32 and 64-bitstop

The GoAsm manual describes the use of ARG and INVOKE in the section dealing with calls to Windows APIs in 32-bits and 64-bits and the use of FRAME...ENDF in the section dealing with callback stack frames in 32-bits and 64-bits. GoAsm's ARG and INVOKE and FRAME...ENDF constructs effectively deal with the changes in the calling convention in 64-bit programming.

Bringing together all those considerations and also those set out above, it is perfectly possible to use the same source code to create executables for both 32-bit and 64-bit platforms.

To recap, here are the rules which must be followed to do this:-

  • When calling APIs use INVOKE in your code instead of CALL.
  • When passing parameters to APIs use ARG in your code instead of PUSH, alternatively give the parameters after INVOKE.
  • Use FRAME .. ENDF in your code when using LOCAL data or picking up parameters sent to a window procedure (or other similar callback procedure).
  • If you want to use the new registers R8-R15, XMM8-XMM15, or the new 8, 16 and 32-byte addressed registers, make sure they are used only within switched 64-bit source code using conditional assembly.
  • Use the 64-bit form of the general purpose registers (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP) for pointers. When GoAsm assembles for 32-bit, it will automatically reduce these registers to their 32-bit counterparts.
  • If you have used PUSHFD and POPFD to save and restore the flags, change this to PUSHF and POPF or PUSH FLAGS and POP FLAGS.
  • Ensure that structures, data sizes, and type indicators are correct for 32/64-bit use, if necessary by using conditional assembly.
  • Use /x64 in the command line to create a 64-bit executable, and /x86 in the command line to create a 32-bit executable.
The 'Go' tools will do the rest of the work.

Note that x86 should not be used in the command line for Win32 source code (use it only for 32/64-bit switchable source code).

See the file Hello64World3 for example source code which can make either a simple Win32 'Hello World' Window program or a Win64 one.

Converting existing 32-bit code to 64-bittop

Bringing together all the above considerations, this is what you need to do to convert existing 32-bit source code to 64-bit source.
  • Change all CALLs to APIs to INVOKE. Do not change any CALLs to non-APIs.
  • If you have used PUSH to send parameters to an API in your 32-bit source, change this to ARG. Do not use ARG for any other PUSHes.
  • Change all the 32-bit general purpose registers used as pointers (that is, within square brackets) to their 64-bit counterparts (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP). This will keep your code shorter, and ensure that pointers to external data work properly. Remember also to use only RSI, RDI and RCX with your string instructions and repeat prefixes. See choice of registers.
  • Ensure that registers which contain system handles and other values provided by the system are changed to their 64-bit counterparts (RAX,RBP,RBX,RCX,RDX,RDI,RSI, and RSP).
  • Adjust all other registers use as required. Generally for other use, the existing registers will work perfectly well, but do not mix the use of 32-bit and 64-bit registers because of zero-extension of results. There is no need to change PUSHes and POPs of registers. These changes are done automatically by GoAsm because the opcodes are the same (for example PUSH EAX is regarded the same as PUSH RAX and vice versa).
  • Ensure that structures, data sizes, and type indicators are correct for 64-bit use.
  • Check that your JECXZ instructions are changed to JRCXZ if appropriate.
  • Since 64-bit tends to be a little larger than 32-bit code, when you re-assemble your code using the /x64 switch, you may find that some short jumps have to be re-organised.
AdaptAsm can do some of the above work for you.

Using AdaptAsm.exe to help with the conversiontop

AdaptAsm comes packaged with GoAsm and I originally wrote it to help to convert source code used for other assemblers to GoAsm syntax. I have now extended it to help towards the conversion of 32-bit source code to 64-bit source code. This works both on GoAsm source code and also source code for other assemblers.

For full details of AdaptAsm's other rôles seethe GoAsm manual.

You use AdaptAsm from the command line using the following:- If no input extension is specified, .asm is assumed.
If no output extension is specified, .adt is assumed
The command line switches are:-
What AdaptAsm does when helping to adapt a file to 64-bits using the /x64 switch
CALLs to APIs are changed to INVOKE (CALLs to non-APIs are not affected).
AdaptAsm does this by looking at lists of APIs in '.h.txt' files in the same folder as AdaptAsm.exe. See the '.h.txt' files for more information about these files.
This works with all types of calls even if enclosed in square brackets and even if dependent on a define (equate) or a switch, for example:-
Changing PUSH to ARG for the parameters sent to the API. AdaptAsm does this by counting the correct number of parameters back from the CALL and comparing this with the correct number of parameters in the lists of APIs in '.h.txt' files in the same folder as AdaptAsm.exe. See the '.h.txt' files for more information about these files.
Here are some simple examples:- You may have preserved registers across API calls and these are unaffected, for example:- However, if you have mixed these two uses of PUSH AdaptAsm will show an error by changing the PUSH to ARG and noting the problem in the log file:- If AdaptAsm cannot find all the expected parameters it shows an error by changing the CALL to INVOKE and noting the problem in the log file, for example:- This means that this type of thing which could be done in 32-bits, will show up as as error by AdaptAsm (and rightly so, since in 64-bit assembler each CALL must immediately follow the parameters):-
32-bit general purpose registers in square brackets are changed to their 64-bit counterparts so that they can be used for both 32-bit and 64-bit assembly, for example:-
Where a pointer is used with a 32-bit general purpose register, the register is changed to its 64-bit counterpart, for example:-
Although not strictly necessary, for good measure 32-bit general purpose registers after PUSH, POP and INVOKE are changed to their 64-bit counterparts, for example:-
What AdaptAsm does not do (and you need to do by hand)
AdaptAsm cannot decide for you which register to use in other circumstances. You will have to decide this on a case-by-case basis see choice of registers for some guidance on this.
AdaptAsm does not ensure that structures and data sizes are correct for 64-bit use, nor that the pointers to structures and strings are properly aligned.

The 'h.txt' files used by AdaptAsm with the /x64 switchtop

These files are text files containing lists of APIs and the number of parameters required by each API. AdaptAsm looks inside its own folder for such h.txt files. The 'h.txt' files are created from Microsoft header files using a clever javascript file ApiParamCount.js, written by Leland M George of West Virginia, who has kindly donated it to the public domain. This js file is shipped with AdaptAsm together with some ready-made h.txt files containing the most commonly used APIs. If your program uses APIs declared in other header files you can make your own 'h.txt' files using the js file. There are two ways to use the js file:-
  • Either drag and drop the header file onto the js file (an h.txt file will be made in the same folder)
  • From the command line using the following command (for example):-
    cscript ApiParamCount.js WinNT.h
    or
    wscript ApiParamCount.js WinNT.h
    which commands start the Windows Scripting Host which handles JavaScript files outside Web page environments.

    If you need to download the Windows Scripting Host you can get it from this Microsoft site.

Alternatively you can make your own h.txt file or edit the existing ones. The format is as follows:-
  • The first API name must start at the beginning of the file and subsequent ones at the beginning of a line.
  • New lines are made using carriage return (ascii 13) followed by linefeed (ascii 10).
  • A comma immediately follows the API name.
  • The number of parameters required by the API immediately follows the comma and is written as an ascii decimal character. If the API does not take any parameters the number is zero.

Switching using x64 and x86 in conditional assemblytop

As well as switching to 64-bit or 32-bit assembly, specifying /x64 or /x86 in GoAsm's command line also permits these words to be tested in conditional assembly. So, for example, you can switch two different generalised window procedures in this way:- Note that the words 'x64' and 'x86' are not case sensitive.

Here is another example to switch include files including structures:-

Some pitfalls to avoid when converting existing source codetop

  • Forgetting that API parameters are always qwords.
    Your existing 32-bit source code will have been written on the correct assumption that each parameter is a dword. For example:- In 32-bits this is good coding because there is a dword at [SYSTEM_INFO+4h] (the dword here holds the systems memory page size (these assumes the structure was filled in using a call to the GetSystemInfo API).
    In 64-bits this is bad because the value at +4h is still a dword, but you are now sending a qword to VirtualFree and not just a dword. This should be coded as follows instead:- Note that in practice, because the MOV EAX line itself zeroes the top part of RAX, you could remove the first line of this example altogether!

    A similar problem arises when interrogating the system and receiving information into data. Your existing 32-bit code may well look something like this:- Here the call puts a 32-bit value into the dword SIZEOF_WORKAREA which is correct. However assembling and running the same code in a 64-bit system would overwrite the next dword in memory as well (a qword is sent not a dword). So you need to enlarge SIZEOF_WORKAREA to a qword.

  • Forgetting that all calls are now to 64-bit values.
    This can easily be forgotten when using tables to control movement of execution around your code. Take the case of a simple table of labels for example:- or This will call an 64-bit address with CODELABEL's address in the low dword and 2 in the high dword. This will produce an error at run-time. The solution for internal calls is to code as follows:- or This code ensures that the high dword of the 64-bit address holds zero. This works because all pointers to internal data and code labels are 32-bits.
  • Forgetting that all Windows handles are now 64-bit values.
    In Win64, system handles are enlarged to 64-bits so it is unsafe to assume that they will always fit into 32-bits.
    So this means that:- is bad 64-bit coding, whereas is correct.
  • Forgetting that all POPs are now to qwords.
    Your existing 32-bit source code may POP into dwords in memory. For example:- In 64-bits a RECT structure is still 4 dwords just as it was in 32-bits. However the second POP in the above code would rub out the second dword in the structure because the POP is in fact 64-bits, not 32-bits.

    Correct coding for 64-bits would be:-

Assembling and linking to produce the executabletop

To make a 64-bit object file with GoAsm use this command line:- where filename is the name of your asm file written either as a 64-bit source file or a 32/64 switchable source file. Use /x86 instead of /x64 when assembling a 32/64 switchable source file to make a 32-bit version.
The object file created by GoAsm can be sent to GoLink or another linker in the usual way.
GoLink automatically senses whether the object file is 32 or 64-bit and creates the correct type of executable to suit.
You cannot mix 32-bit and 64-bit object files. GoLink will show an error if you try to do this.
You do not necessarily need to make 64-bit executables on a 64-bit machine. This is because the DLL names given to GoLink simply tell the linker that the DLL contains the APIs used by the application and these tend to be the same between the two platforms. If your application calls APIs specific to the 64 bit system however, this does not work.

Some optimisation and refinement done by GoAsmtop

GoAsm always aims to produce the tightest possible code from your source. In the case of x64, GoAsm has not yet taken up all opportunities to optimise the code. This is because there are still some unknowns, such as effects on performance of optimised code on x64.

The optimisations and refinements are listed here to help you when you look at the code produced by GoAsm in the debugger.

GoAsm optimisations and refinements in all code

None of these affect the flags or adversely affect performance.
  • MOV 64-bit register,ADDR label changed to LEA 64-bit register,label. This saves 5 opcodes. One important difference between the two instructions is that the MOV version uses an absolute relocation (hence in theory it needs to leave space for a 64-bit value to be inserted by the linker). The LEA instruction uses RIP-relative addressing and so it can do the same job but requires only a 32-bit space for the relative address.
  • PUSH or ARG ADDR Non_Local_Label also uses LEA as well as the R11 register as follows:- See explanation for this. Note that this will also take place with INVOKE when pushing arguments with ADDR, which also includes use of pointers to a string or raw data (ex. 'Hello' or <'H','i',0>).

This affects the flags.
  • PUSH or ARG ADDR Local_Label is coded as follows:-

Additional optimisations and refinements only when INVOKE is used

These may affect the flags which does not matter when calling an API. Those that rely on zero-extension may require another operation from the processor, but it is assumed that this does not matter when calling an API. It is more important to keep the code size down.
  • A register parameter containing zero is optimised using XOR 32-bit register. This is a saving of between 7 and 8 bytes over the MOV equivalent.
  • A register parameter containing a number (an 'immediate') which can fit into 32-bits is changed to use a 32-bit register, saving between 1 and 5 bytes depending on the register and the number.
  • A register parameter containing -1 is achieved by using OR 64-bit register,-1 saving 6 bytes.
  • If the parameter is already in the correct register no further code is emitted because it is not required.
  • The coding to achieve automatic stack alignment and to adjust the stack for the FASTCALL calling convention is as follows (which one is used depends on the number of parameters):- or

Some tips to reduce the size of your codetop

Note it is possible some of these optimisations may adversely affect performance..
  • Using the 64-bit registers (RAX to RSP) as pointers to memory (for example MOV [RSI],AL) saves a byte over using the 32-bit versions (for example MOV [ESI],AL). This is because in such instructions a 67h override byte is needed for the 32-bit version.
  • The opposite is the case when you use registers to hold immediates (numbers). In those cases using the enlarged registers (RAX to RSP) and the extended registers (R8 to R15) or any of the new register addressing methods, adds at least a byte to each instruction. For example, MOV RAX,23456h is 2 bytes larger than MOV EAX,23456h. The contrast is even greater using larger numbers which are above 7FFFFFFFh because these have to be coded as full 64-bit numbers if you use a 64-bit register. So for example MOV RAX,80234560h codes 5 bytes larger than MOV EAX,80234560h. If the number you wish to move will fit into a byte, then even greater savings can be achieved, for example MOV AL,88h codes as 2 bytes, but MOV RAX,88h is 10 bytes.
  • DEC and INC (with a register) now use two opcodes, whereas in 86 processors they were very frugal, using only one opcode. But there is still an advantage in using this over SUB register,1 or ADD register,1 which is one byte longer. SUB or ADD can still be used if you need to test the carry flag after the instruction.
  • In 64-bit programming LEA register,Label is 5 opcodes shorter than MOV register,ADDR Label yet they achieve the same result. In GoAsm source code however, you can use either since GoAsm automatically uses the shortest form.
  • PUSH ADDR THING codes as 9 bytes, whereas if you use LEA RAX,THING followed by PUSH RAX instead, this is 8 bytes. However, it changes the content of the RAX register.
  • Zero a register using XOR. XOR RAX,RAX is 3 bytes, whereas MOV RAX,0 is 10 bytes (because the instruction takes a 64-bit immediate value (number). However, XOR affects the flags, MOV does not.
  • XOR EAX,EAX is even shorter at 2 bytes and it does zero the whole RAX register. See zero-extension of results.
  • A good way to fill a register with -1, is to use OR register,-1 which in the case of a 64-bit register is 4 bytes, a saving of 6 bytes over MOV register,-1. However, OR affects the flags, but MOV does not.
  • Compares in the range -80h to +7Fh code as 4 bytes (eg. CMP RDX,-80h to RDX,7Fh) but outside that range they code as 7 bytes (so eg. CMP RDX,80h is 7 bytes).
  • You can still use LEA to do intra-register arithmetic for example LEA RAX,[RAX+RAX*2] which multiplies RAX by three. This codes as 4 bytes.
See also general tips for programming in GoAsm help.

More information, references and links top

Information about the AMD64
AMD information for developers
AMD and industry partners' AMD64 site
František Gábriš much early 64-bit work including sample source code.
Intel 64 Technology site

Newsgroups and forums:-
64-bit assembler forum
AMD developer forum
Planet 64
Extended 64
Start64 forum

Copyright © Jeremy Gordon 2006-2016
Back to top

By Chris Lomont

Download Article

Download Introduction to x64 Assembly [PDF 303KB]

Introduction

For years, PC programmers used x86 assembly to write performance-critical code. However, 32-bit PCs are being replaced with 64-bit ones, and the underlying assembly code has changed. This white paper is an introduction to x64 assembly. No prior knowledge of x86 code is needed, although it makes the transition easier.
x64 is a generic name for the 64-bit extensions to Intel's and AMD's 32-bit x86 instruction set architecture (ISA). AMD introduced the first version of x64, initially called x86-64 and later renamed AMD64. Intel named their implementation IA-32e and then EMT64. There are some slight incompatibilities between the two versions, but most code works fine on both versions; details can be found in the Intel® 64 and IA-32 Architectures Software Developer's Manuals and the AMD64 Architecture Tech Docs. We call this intersection flavor x64. Neither is to be confused with the 64-bit Intel® Itanium® architecture, which is called IA-64.
This white paper won't cover hardware details such as caches, branch prediction, and other advanced topics. Several references will be given at the end of the article for further reading in these areas.
Assembly is often used for performance-critical parts of a program, although it is difficult to outperform a good C++ compiler for most programmers. Assembly knowledge is useful for debugging code - sometimes a compiler makes incorrect assembly code and stepping through the code in a debugger helps locate the cause. Code optimizers sometimes make mistakes. Another use for assembly is interfacing with or fixing code for which you have no source code. Disassembly lets you change/fix existing executables. Assembly is necessary if you want to know how your language of choice works under the hood - why some things are slow and others are fast. Finally, assembly code knowledge is indispensable when diagnosing malware.

Architecture

When learning assembly for a given platform, the first place to start is to learn the register set.
General Architecture
Since the 64-bit registers allow access for many sizes and locations, we define a byte as 8 bits, a word as 16 bits, a double word as 32 bits, a quadword as 64 bits, and a double quadword as 128 bits. Intel stores bytes 'little endian,' meaning lower significant bytes are stored in lower memory addresses.

Masm X64 Download


Figure 1 shows sixteen general purpose 64-bit registers, the first eight of which are labeled (for historical reasons) RAX, RBX, RCX, RDX, RBP, RSI, RDI, and RSP. The second eight are named R8-R15. By replacing the initial R with an E on the first eight registers, it is possible to access the lower 32 bits (EAX for RAX). Similarly, for RAX, RBX, RCX, and RDX, access to the lower 16 bits is possible by removing the initial R (AX for RAX), and the lower byte of the these by switching the X for L (AL for AX), and the higher byte of the low 16 bits using an H (AH for AX). The new registers R8 to R15 can be accessed in a similar manner like this: R8 (qword), R8D (lower dword), R8W (lowest word), R8B (lowest byte MASM style, Intel style R8L). Note there is no R8H.
There are odd limitations accessing the byte registers due to coding issues in the REX opcode prefix used for the new registers: an instruction cannot reference a legacy high byte (AH, BH, CH, DH) and one of the new byte registers at the same time (such as R11B), but it can use legacy low bytes (AL, BL, CL, DL). This is enforced by changing (AH, BH, CH, DH) to (BPL, SPL, DIL, SIL) for instructions using a REX prefix.
The 64-bit instruction pointer RIP points to the next instruction to be executed, and supports a 64-bit flat memory model. Memory address layout in current operating systems is covered later.
The stack pointer RSP points to the last item pushed onto the stack, which grows toward lower addresses. The stack is used to store return addresses for subroutines, for passing parameters in higher level languages such as C/C++, and for storing 'shadow space' covered in calling conventions.
The RFLAGS register stores flags used for results of operations and for controlling the processor. This is formed from the x86 32-bit register EFLAGS by adding a higher 32 bits which are reserved and currently unused. Table 1 lists the most useful flags. Most of the other flags are used for operating system level tasks and should always be set to the value previously read.
Table 1 - Common Flags

SymbolBitNameSet if...
CF0CarryOperation generated a carry or borrow
PF2ParityLast byte has even number of 1's, else 0
AF4AdjustDenotes Binary Coded Decimal in-byte carry
ZF6ZeroResult was 0
SF7SignMost significant bit of result is 1
OF11OverflowOverflow on signed operation
DF10DirectionDirection string instructions operate (increment or decrement)
ID21IdentificationChangeability denotes presence of CPUID instruction


The floating point unit (FPU) contains eight registers FPR0-FPR7, status and control registers, and a few other specialized registers. FPR0-7 can each store one value of the types shown in Table 2. Floating point operations conform to IEEE 754. Note that most C/C++ compilers support the 32 and 64 bit types as float and double, but not the 80-bit one available from assembly. These registers share space with the eight 64-bit MMX registers.
Table 2 - Floating Point Types

Data TypeLengthPrecision (bits)Decimal digits PrecisionDecimal Range
Single Precision322471.18*10^-38 to 3.40*10^38
Double Precision6453152.23 *10^-308 to 1.79*10^308
Extended Precision8064193.37*10^-4932 to 1.18*10^4932


Binary Coded Decimal (BCD) is supported by a few 8-bit instructions, and an oddball format supported on the floating point registers gives an 80 bit, 17 digit BCD type.
The sixteen 128-bit XMM registers (eight more than x86) are covered in more detail.
Final registers include segment registers (mostly unused in x64), control registers, memory management registers, debug registers, virtualization registers, performance registers tracking all sorts of internal parameters (cache hits/misses, branch hits/misses, micro-ops executed, timing, and much more). The most notable performance opcode is RDTSC, which is used to count processor cycles for profiling small pieces of code.
Full details are available in the five-volume set 'Intel® 64 and IA-32 Architectures Software Developer's Manuals' at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. They are available for free download as PDF, order on CD, and often can be ordered for free as a hardcover set when listed.
SIMD Architecture
Single Instruction Multiple Data (SIMD) instructions execute a single command on multiple pieces of data in parallel and are a common usage for assembly routines. MMX and SSE commands (using the MMX and XMM registers respectively) support SIMD operations, which perform an instruction on up to eight pieces of data in parallel. For example, eight bytes can be added to eight bytes in one instruction using MMX.
The eight 64-bit MMX registers MMX0-MMX7 are aliased on top of FPR0-7, which means any code mixing FP and MMX operations must be careful not to overwrite required values. The MMX instructions operate on integer types, allowing byte, word, and doubleword operations to be performed on values in the MMX registers in parallel. Most MMX instructions begin with 'P' for 'packed'. Arithmetic, shift/rotate, comparison, e.g.: PCMPGTB 'Compare packed signed byte integers for greater than'.
The sixteen 128-bit XMM registers allow parallel operations on four single or two double precision values per instruction. Some instructions also work on packed byte, word, doubleword, and quadword integers. These instructions, called the Streaming SIMD Extensions (SSE), come in many flavors: SSE, SSE2, SSE3, SSSE3, SSE4, and perhaps more by the time this prints. Intel has announced more extensions along these lines called Intel® Advanced Vector Extensions (Intel® AVX), with a new 256-bit-wide datapath. SSE instructions contain move, arithmetic, comparison, shuffling and unpacking, and bitwise operations on both floating point and integer types. Instruction names include such beauties as PMULHUW and RSQRTPS. Finally, SSE introduced some instructions for memory pre-fetching (for performance) and memory fences (for multi-threaded safety).
Table 3 lists some command sets, the register types operated on, the number of items manipulated in parallel, and the item type. For example, using SSE3 and the 128-bit XMM registers, you can operate on 2 (must be 64-bit) floating point values in parallel, or even 16 (must be byte sized) integer values in parallel.
To find which technologies a given chip supports, there is a CPUID instruction that returns processor-specific information.
Table 3

TechnologyRegister size/typeItem typeItems in Parallel
MMX64 MMXInteger8, 4, 2, 1
SSE64 MMXInteger8,4,2,1
SSE128 XMMFloat4
SSE2/SSE3/SSSE3...64 MMXInteger2,1
SSE2/SSE3/SSSE3...128 XMMFloat2
SSE2/SSE3/SSSE3...128 XMMInteger16,8,4,2,1


Tools


Assemblers
An Internet search reveals x64-capable assemblers such as the Netwide Assembler NASM, a NASM rewrite called YASM, the fast Flat Assembler FASM, and the traditional Microsoft MASM. There is even a free IDE for x86 and x64 assembly called WinASM. Each assembler has varying support for other assemblers' macros and syntax, but assembly code is not source-compatible across assemblers like C++ or Java* are.
For the examples below, I use the 64-bit version of MASM, ML64.EXE, freely available in the platform SDK. For the examples below note that MASM syntax is of the form Instruction Destination, Source
Some assemblers reverse source and destination, so read your documentation carefully.
C/C++ Compilers
C/C++ compilers often allow embedding assembly in the code using inline assembly, but Microsoft Visual Studio* C/C++ removed this for x64 code, likely to simplify the task of the code optimizer. This leaves two options: use separate assembly files and an external assembler, or use intrinsics from the header file 'intrn.h' (see Birtolo and MSDN). Other compilers feature similar options.
Some reasons to use intrinsics:

  • Inline asm not supported in x64.
  • Ease of use: you can use variable names instead of having to juggle register allocation manually.
  • More cross-platform than assembly: the compiler maker can port the intrinsics to various architectures.
  • The optimizer works better with intrinsics.

For example, Microsoft Visual Studio* 2008 has an intrinsic
unsigned short _rot16(unsigned short a, unsigned char b)
which rotates the bits in a 16-bit value right b bits and returns the answer. Doing this in C gives
unsigned short a1 = (b>>c)|(b<<(16-c));
which expands to fifteen assembly instructions (in debug builds - in release builds whole program optimization made it harder to separate, but it was of a similar length), while using the equivalent intrinsic
unsigned short a2 = _rotr16(b,c);
expands to four instructions. For more information read the header file and documentation.

Instruction Basics


Addressing Modes
Before covering some basic instructions, you need to understand addressing modes, which are ways an instruction can access registers or memory. The following are common addressing modes with examples:

  • Immediate: the value is stored in the instruction. ADD EAX, 14 ; add 14 into 32-bit EAX
  • Register to register ADD R8L, AL ; add 8 bit AL into R8L
  • Indirect: this allows using an 8, 16, or 32 bit displacement, any general purpose registers for base and index, and a scale of 1, 2, 4, or 8 to multiply the index. Technically, these can also be prefixed with segment FS: or GS: but this is rarely required. MOV R8W, 1234[8*RAX+RCX] ; move word at address 8*RAX+RCX+1234 into R8W
    There are many legal ways to write this. The following are equivalent The dword ptr tells the assembler how to encode the MOV instruction.
  • RIP-relative addressing: this is new for x64 and allows accessing data tables and such in the code relative to the current instruction pointer, making position independent code easier to implement.

MOV AL, [RIP] ; RIP points to the next instruction aka NOP
NOP

Unfortunately, MASM does not allow this form of opcode, but other assemblers like FASM and YASM do. Instead, MASM embeds RIP-relative addressing implicitly.
MOV EAX, TABLE ; uses RIP- relative addressing to get table address

  • Specialized cases: some opcodes use registers in unique ways based on the opcode. For example, signed integer division IDIV on a 64 bit operand value divides the 128-bit value in RDX:RAX by the value, storing the result in RAX and the remainder in RDX.


Instruction Set
Table 4 lists some common instructions. * denotes this entry is multiple opcodes where the * denotes a suffix.
Table 4 - Common Opcodes

OpcodeMeaningOpcodeMeaning
MOVMove to/from/between memory and registersAND/OR/XOR/NOTBitwise operations
CMOV*Various conditional movesSHR/SARShift right logical/arithmetic
XCHGExchangeSHL/SALShift left logical/arithmetic
BSWAPByte swapROR/ROLRotate right/left
PUSH/POPStack usageRCR/RCLRotate right/left through carry bit
ADD/ADCAdd/with carryBT/BTS/BTRBit test/and set/and reset
SUB/SBCSubtract/with carryJMPUnconditional jump
MUL/IMULMultiply/unsignedJE/JNE/JC/JNC/J*Jump if equal/not equal/carry/not carry/ many others
DIV/IDIVDivide/unsignedLOOP/LOOPE/LOOPNELoop with ECX
INC/DECIncrement/DecrementCALL/RETCall subroutine/return
NEGNegateNOPNo operation
CMPCompareCPUIDCPU information


A common instruction is the LOOP instruction, which decrements RCX, ECX, or CX depending on usage, and then jumps if the result is not 0. For example,


Less common opcodes implement string operations, repeat instruction prefixes, port I/O instructions, flag set/clear/test, floating point operations (begin usually with a F, and support move, to/from integer, arithmetic, comparison, transcendental, algebraic, and control functions), cache and memory opcodes for multithreading and performance issues, and more. The Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2, in two parts, covers each opcode in detail.

Operating Systems

64-bit systems allow addressing 2 to the 64th power bytes of data in theory, but no current chips allow accessing all 16 exabytes (18,446,744,073,709,551,616 bytes). For example, AMD architecture uses only the lower 48 bits of an address, and bits 48 through 63 must be a copy of bit 47 or the processor raises an exception. Thus addresses are 0 through 00007FFF`FFFFFFFF, and from FFFF8000`00000000 through FFFFFFFF`FFFFFFFF, for a total of 256 TB (281,474,976,710,656 bytes) of usable virtual address space. Another downside is that addressing all 64 bits of memory requires a lot more paging tables for the OS to store, using valuable memory for systems with less than all 16 exabytes installed. Note these are virtual addresses, not physical addresses.
As a result, many operating systems use the higher half of this space for the OS, starting at the top and growing down, while user programs use the lower half, starting at the bottom and growing upwards. Current Windows* versions use 44 bits of addressing (16 terabytes = 17,592,186,044,416 bytes). The resulting addressing is shown in Figure 2. The resulting addresses are not too important for user programs since addresses are assigned by the OS, but the distinction between user addresses and kernel addresses are useful for debugging.
A final OS-related item relates to multithreaded programming, but this topic is too large to cover here. The only mention is that there are memory barrier opcodes for helping to keep shared resources uncorrupted.

Masm Free Download

Figure 2 - Memory Addressing


Masm Assembler Download

Calling Conventions

Interfacing with operating system libraries requires knowing how to pass parameters and manage the stack. These details on a platform are called a calling convention.
A common x64 calling convention is the Microsoft 64 calling convention used for C style function calling (see MSDN, Chen, and Pietrek). Under Linux* this would be called an Application Binary Interface (ABI). Note the calling convention covered here is different than the one used on x64 Linux* systems.
For the Microsoft* x64 calling convention, the additional register space let fastcall be the only calling convention (under x86 there were many: stdcall, thiscall, fastcall, cdecl, etc.). The rules for interfacing with C/C++ style functions:

  • RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right.
  • XMM0, 1, 2, and 3 are used for floating point arguments.
  • Additional arguments are pushed on the stack left to right.
  • Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
  • It is the caller's responsibility to allocate 32 bytes of 'shadow space' (for storing RCX, RDX, R8, and R9 if needed) before calling the function.
  • It is the caller's responsibility to clean the stack after the call.
  • Integer return values (similar to x86) are returned in RAX if 64 bits or less.
  • Floating point return values are returned in XMM0.
  • Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
  • The stack is 16-byte aligned. The 'call' instruction pushes an 8-byte return value, so the all non-leaf functions must adjust the stack by a value of the form 16n+8 when allocating stack space.
  • Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls.
  • RBX, RBP, RDI, RSI, R12, R14, R14, and R15 must be saved in any function using them.
  • Note there is no calling convention for the floating point (and thus MMX) registers.
  • Further details (varargs, exception handling, stack unwinding) are at Microsoft's site.

Examples

Armed with the above, here are a few examples showing x64 usage. The first is a simple x64 standalone assembly program that pops up a Windows MessageBox.


Save this as hello.asm, compile this with ML64, available in the Microsoft Windows* x64 SDK as follows:
ml64 hello.asm /link /subsystem:windows /defaultlib:kernel32.lib /defaultlib:user32.lib /entry:Start
which makes a windows executable and links with appropriate libraries. Run the resulting executable hello.exe and you should get the message box to pop up.
The second example links an assembly file with a C/C++ file under Microsoft Visual Studio* 2008. Other compiler systems are similar. First make sure your compiler is an x64-capable version. Then

    1. Create a new empty C++ console project. Create a function you'd like to port to assembly, and call it from main.
    2. To change the default 32-bit build, select Build/Configuration Manager.
    3. Under Active Platform, select New...
    4. Under Platform, select x64. If it does not appear figure out how to add the 64-bit SDK tools and repeat.
    5. Compile and step into the code. Look under Debug/Windows/Disassembly to see the resulting code and interface needed for your assembly function.
    6. Create an assembly file, and add it to the project. It defaults to a 32 bit assembler which is fine.
    7. Open the assembly file properties, select all configurations, and edit the custom build step.
    8. Put command line
    1. and set outputs to
  1. Build and run.

For example, in main.cpp we put a function CombineC that does some simple math on five integer parameters and one double parameter, and returns a double answer. We duplicate that functionality in assembly in a separate file CombineA.asm in a function called CombineA. The C++ file is:

Masm Download For Windows 10 64-bit


Be sure to make functions extern 'C' linkage to prevent C++ name mangling. Assembly file CombineA.asm contains


Running this should result in the value 1.97368 being output twice.

Conclusion

This has been a necessarily brief introduction to x64 assembly programming. The next step is to browse the Intel® 64 and IA-32 Architectures Software Developer's Manuals. Volume 1 contains the architecture details and is a good start if you know assembly. Other places are assembly books or online assembly tutorials. To get an understanding of how your code executes, it is instructive to step through code in debugger, looking at the disassembly, until you can read assembly code as well as your favorite language. For C/C++ compilers, debug builds are much easier to read than release builds so be sure to start there. Finally, read the forums at masm32.com for a lot of material.

Download Masm For Windows 10 64 Bit

References


NASM: http://www.nasm.us/
YASM: http://www.tortall.net/projects/yasm/
Flat Assembler (FASM): http://www.flatassembler.net/
'Intel® 64 and IA-32 Architectures Software Developer's Manuals,' available online at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
'Compiler Intrinsics', available online at http://msdn.microsoft.com/en-us/library/26td21ds.aspx
Matt Pietrek, 'Everything You Need To Know To Start Programming 64-Bit Windows Systems', available online at http://msdn.microsoft.com/en-us/magazine/cc300794.aspx, 2009.

About the Author

Chris Lomont works as a research engineer at Cybernet Systems, working on projects as diverse as quantum computing algorithms, image processing for NASA, developing security hardware for United States Homeland Security, and computer forensics. Before that he obtained a PhD. in math from Purdue, three Bachelors degrees in physics, math, and computer science, worked as a game programmer, did brief stints in financial modeling, robotics work, and various consulting roles. The rest of his time is spent hiking with his wife, watching movies, giving talks, recreational programming, doing math research, learning more physics, playing music, and performing various experiments. Visit his website www.lomont.org or his electronic gadget site www.hypnocube.com.

Additional Resources